Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 80]
cs.CV [Total: 190]
cs.AI [Total: 31]
cs.SD [Total: 11]
cs.LG [Total: 149]
cs.MA [Total: 5]
cs.MM [Total: 7]
eess.AS [Total: 8]
eess.IV [Total: 14]

cs.CL

[1] ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning

Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aaditya Shukla, Rama Akkiraju

Main category: cs.CL

TL;DR: ParallelSearch introduces a reinforcement learning framework for LLMs to execute parallel search operations, improving efficiency and performance over sequential methods.

Details

Motivation: Existing search agents process queries sequentially, limiting efficiency for parallelizable tasks.

Method: Proposes ParallelSearch, a reinforcement learning framework with rewards for identifying parallelizable queries and executing concurrent searches.

Result: Outperforms baselines by 2.9% on average, with 12.7% improvement on parallelizable questions and 69.6% fewer LLM calls.

Conclusion: ParallelSearch effectively addresses the sequential bottleneck, enhancing computational efficiency and performance in multi-step information retrieval.

Abstract: Reasoning-augmented search agents such as Search-R1, trained via reinforcement learning with verifiable rewards (RLVR), demonstrate remarkable capabilities in multi-step information retrieval from external knowledge sources. These agents address the limitations of their parametric memory by dynamically gathering relevant facts to address complex reasoning tasks. However, existing approaches suffer from a fundamental architectural limitation: they process search queries strictly sequentially, even when handling inherently parallelizable and logically independent comparisons. This sequential bottleneck significantly constrains computational efficiency, particularly for queries that require multiple entity comparisons. To address this critical limitation, we propose ParallelSearch, a novel reinforcement learning framework that empowers large language models (LLMs) to recognize parallelizable query structures and execute multiple search operations concurrently. Our approach introduces dedicated reward functions that incentivize the identification of independent query components while preserving answer accuracy through jointly considering correctness, query decomposition quality, and parallel execution benefits. Comprehensive experiments demonstrate that ParallelSearch outperforms state-of-the-art baselines by an average performance gain of 2.9% across seven question-answering benchmarks. Notably, on parallelizable questions, our method achieves a 12.7% performance improvement while requiring only 69.6% of the LLM calls compared to sequential approaches.

[2] Leveraging Large Language Models for Rare Disease Named Entity Recognition

Nan Miles Xi, Yu Deng, Lin Wang

Main category: cs.CL

TL;DR: GPT-4o is evaluated for rare disease NER using prompt-based strategies, achieving competitive or SOTA results. Few-shot prompting is cost-effective, while RAG offers limited benefits.

Details

Motivation: Address challenges in rare disease NER like limited data and semantic ambiguity by leveraging GPT-4o's capabilities under low-resource settings.

Method: Uses zero-shot, few-shot, RAG, and fine-tuning with structured prompts and semantically guided example selection.

Result: GPT-4o outperforms BioClinicalBERT, with fine-tuning achieving SOTA. Few-shot is cost-effective; RAG has marginal benefits.

Conclusion: Prompt-optimized LLMs are scalable alternatives for biomedical NER, especially in rare diseases with scarce annotated data.

Abstract: Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings, using a range of prompt-based strategies including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. We design a structured prompting framework that encodes domain-specific knowledge and disambiguation rules for four entity types. We further introduce two semantically guided few-shot example selection methods to improve in-context performance while reducing labeling effort. Experiments on the RareDis Corpus show that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding new state-of-the-art (SOTA) results. Cost-performance analysis reveals that few-shot prompting delivers high returns at low token budgets, while RAG offers marginal additional benefit. An error taxonomy highlights common failure modes such as boundary drift and type confusion, suggesting opportunities for post-processing and hybrid refinement. Our results demonstrate that prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, particularly in rare disease applications where annotated data is scarce.

[3] UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

Shuhei Kato

Main category: cs.CL

TL;DR: UtterTune is a lightweight adaptation method for multilingual TTS systems using LLM architecture, improving pronunciation control in target languages like Japanese without sacrificing performance in others.

Details

Motivation: LLM-based TTS models struggle with accurate G2P mapping and prosody, especially without explicit G2P modules. UtterTune aims to address this.

Method: Uses low-rank adaptation to control segmental pronunciation and pitch accent at the phoneme level for Japanese, maintaining naturalness and speaker similarity.

Result: Effective in objective and subjective evaluations.

Conclusion: UtterTune successfully enhances pronunciation controllability in target languages while preserving multilingual performance.

Abstract: We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness.

[4] TEN: Table Explicitization, Neurosymbolically

Nikita Mehrotra, Aayush Kumar, Sumit Gulwani, Arjun Radhakrishna, Ashish Tiwari

Main category: cs.CL

TL;DR: TEN is a neurosymbolic method for extracting tabular data from semistructured text, combining LLM prompting with symbolic checks to improve accuracy and reduce hallucinations.

Details

Motivation: Extracting tabular data from inconsistently delimited text is challenging for neural methods due to hallucinations and lack of constraint enforcement.

Method: TEN uses Structural Decomposition prompting on an LLM to generate tables, followed by symbolic checks and a self-debug loop with a critique-LLM for corrections.

Result: TEN outperforms neural baselines in accuracy and hallucination reduction, with user studies confirming higher accuracy and preference.

Conclusion: TEN’s hybrid approach effectively addresses the limitations of purely neural methods for tabular data extraction.

Abstract: We present a neurosymbolic approach, TEN, for extracting tabular data from semistructured input text. This task is particularly challenging for text input that does not use special delimiters consistently to separate columns and rows. Purely neural approaches perform poorly due to hallucinations and their inability to enforce hard constraints. TEN uses Structural Decomposition prompting - a specialized chain-of-thought prompting approach - on a large language model (LLM) to generate an initial table, and thereafter uses a symbolic checker to evaluate not only the well-formedness of that table, but also detect cases of hallucinations or forgetting. The output of the symbolic checker is processed by a critique-LLM to generate guidance for fixing the table, which is presented to the original LLM in a self-debug loop. Our extensive experiments demonstrate that TEN significantly outperforms purely neural baselines across multiple datasets and metrics, achieving significantly higher exact match accuracy and substantially reduced hallucination rates. A 21-participant user study further confirms that TEN’s tables are rated significantly more accurate (mean score: 5.0 vs 4.3; p = 0.021), and are consistently preferred for ease of verification and correction, with participants favoring our method in over 60% of the cases.

[5] Decoding Neural Emotion Patterns through Natural Language Processing Embeddings

Gideon Vos, Maryam Ebrahimpour, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi

Main category: cs.CL

TL;DR: A computational framework maps textual emotional content to brain regions without neuroimaging, using semantic embeddings and clustering. It distinguishes emotions in clinical populations and evaluates AI-generated text.

Details

Motivation: To bridge the gap between neuroimaging-based emotion localization and computational text analysis by creating a cost-effective, scalable method for emotion-brain mapping.

Method: Uses OpenAI’s text-embedding-ada-002 for semantic representations, dimensionality reduction, clustering, and mapping to 18 brain regions. Tested on conversational data (healthy vs. depressed), GoEmotions dataset, and human vs. LLM text.

Result: Neuroanatomically plausible mappings with high specificity. Depressed subjects showed greater limbic engagement. LLM text matched humans in basic emotions but lacked nuanced activation in empathy-related regions.

Conclusion: The framework enables large-scale analysis of naturalistic language, distinguishes clinical populations, and provides a brain-based benchmark for AI emotional expression.

Abstract: Understanding how emotional expression in language relates to brain function is a challenge in computational neuroscience and affective computing. Traditional neuroimaging is costly and lab-bound, but abundant digital text offers new avenues for emotion-brain mapping. Prior work has largely examined neuroimaging-based emotion localization or computational text analysis separately, with little integration. We propose a computational framework that maps textual emotional content to anatomically defined brain regions without requiring neuroimaging. Using OpenAI’s text-embedding-ada-002, we generate high-dimensional semantic representations, apply dimensionality reduction and clustering to identify emotional groups, and map them to 18 brain regions linked to emotional processing. Three experiments were conducted: i) analyzing conversational data from healthy vs. depressed subjects (DIAC-WOZ dataset) to compare mapping patterns, ii) applying the method to the GoEmotions dataset and iii) comparing human-written text with large language model (LLM) responses to assess differences in inferred brain activation. Emotional intensity was scored via lexical analysis. Results showed neuroanatomically plausible mappings with high spatial specificity. Depressed subjects exhibited greater limbic engagement tied to negative affect. Discrete emotions were successfully differentiated. LLM-generated text matched humans in basic emotion distribution but lacked nuanced activation in empathy and self-referential regions (medial prefrontal and posterior cingulate cortex). This cost-effective, scalable approach enables large-scale analysis of naturalistic language, distinguishes between clinical populations, and offers a brain-based benchmark for evaluating AI emotional expression.

[6] The Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich, Expert Consensus in Complex Domains

Cathy Speed, Ahmed A. Metwally

Main category: cs.CL

TL;DR: The paper introduces a Human-AI Hybrid Delphi (HAH-Delphi) framework to improve expert consensus by combining AI (Gemini 2.5 Pro) with human experts, showing high accuracy and efficiency in three testing phases.

Details

Motivation: Traditional consensus methods like Delphi studies face limitations like high burden and oversimplification, worsened by information overload. The study aims to enhance consensus development with AI-human collaboration.

Method: The HAH-Delphi framework integrates AI, small expert panels, and structured facilitation. It was tested in retrospective replication, prospective comparison, and applied deployment in endurance and mixed training domains.

Result: AI replicated 95% of consensus conclusions (Phase I) and agreed 95% with human experts (Phase II). Phase III showed >90% consensus coverage with compact panels, aided by AI scaffolding.

Conclusion: HAH-Delphi is a scalable, robust method for high-quality consensus, validated in health and performance science, enabling personalized guidance at scale.

Abstract: Expert consensus plays a critical role in domains where evidence is complex, conflicting, or insufficient for direct prescription. Traditional methods, such as Delphi studies, consensus conferences, and systematic guideline synthesis, offer structure but face limitations including high panel burden, interpretive oversimplification, and suppression of conditional nuance. These challenges are now exacerbated by information overload, fragmentation of the evidence base, and increasing reliance on publicly available sources that lack expert filtering. This study introduces and evaluates a Human-AI Hybrid Delphi (HAH-Delphi) framework designed to augment expert consensus development by integrating a generative AI model (Gemini 2.5 Pro), small panels of senior human experts, and structured facilitation. The HAH-Delphi was tested in three phases: retrospective replication, prospective comparison, and applied deployment in two applied domains (endurance training and resistance and mixed cardio/strength training). The AI replicated 95% of published expert consensus conclusions in Phase I and showed 95% directional agreement with senior human experts in Phase II, though it lacked experiential and pragmatic nuance. In Phase III, compact panels of six senior experts achieved >90% consensus coverage and reached thematic saturation before the final participant. The AI provided consistent, literature-grounded scaffolding that supported divergence resolution and accelerated saturation. The HAH-Delphi framework offers a flexible, scalable approach for generating high-quality, context-sensitive consensus. Its successful application across health, coaching, and performance science confirms its methodological robustness and supports its use as a foundation for generating conditional, personalised guidance and published consensus frameworks at scale.

[7] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Ju-Chieh Chou, Jiawei Zhou, Karen Livescu

Main category: cs.CL

TL;DR: The paper proposes a joint model for linguistic and acoustic information in textless spoken language models, improving acoustic detail without sacrificing linguistic performance.

Details

Motivation: Existing textless SLMs lack acoustic context and control, relying on separate vocoders. This work aims to integrate both linguistic and acoustic modeling.

Method: The model jointly generates semantic tokens and continuous acoustic representations using a flow-matching objective, predicting multiple future tokens for better linguistic preservation.

Result: The approach matches linguistic benchmarks while enhancing acoustic detail in generation.

Conclusion: Joint modeling of linguistic and acoustic information improves textless SLMs, offering better control and detail.

Abstract: Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

[8] APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification

Artem Chernodub, Aman Saini, Yejin Huh, Vivek Kulkarni, Vipul Raheja

Main category: cs.CL

TL;DR: APIO is a prompt induction and optimization method for Grammatical Error Correction and Text Simplification, achieving state-of-the-art performance without manual seed prompts.

Details

Motivation: To advance automatic prompt optimization (APO) by eliminating reliance on manually specified seed prompts for tasks like GEC and Text Simplification.

Method: APIO, a simple yet effective approach for prompt induction and optimization, tested on GEC and Text Simplification tasks.

Result: APIO achieves state-of-the-art performance for purely LLM-based prompting methods on these tasks.

Conclusion: APIO is a successful method for prompt optimization, with publicly available resources for further research.

Abstract: Recent advancements in large language models (LLMs) have enabled a wide range of natural language processing (NLP) tasks to be performed through simple prompt-based interactions. Consequently, several approaches have been proposed to engineer prompts that most effectively enable LLMs to perform a given task (e.g., chain-of-thought prompting). In settings with a well-defined metric to optimize model performance, automatic prompt optimization (APO) methods have been developed to refine a seed prompt. Advancing this line of research, we propose APIO, a simple but effective prompt induction and optimization approach for the tasks of Grammatical Error Correction (GEC) and Text Simplification, without relying on manually specified seed prompts. APIO achieves a new state-of-the-art performance for purely LLM-based prompting methods on these tasks. We make our data, code, prompts, and outputs publicly available.

[9] Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech

Lavanya Shankar, Leibny Paola Garcia Perera

Main category: cs.CL

TL;DR: The paper proposes using Zipformer for language identification in bilingual child-directed speech, achieving a 15.47% improvement in Balanced Accuracy.

Details

Motivation: Addressing challenges in code-switching and language identification in bilingual child-directed scenarios, particularly with imbalanced Mandarin and English speech.

Method: Utilizes Zipformer to encode language characteristics, selecting inner layers for embeddings and comparing different back-ends.

Result: Achieves a Balanced Accuracy (BAC) of 81.89%, a 15.47% improvement over the baseline.

Conclusion: Highlights the effectiveness of Zipformer in handling imbalanced data and its potential for real-world applications.

Abstract: Code-switching and language identification in child-directed scenarios present significant challenges, particularly in bilingual environments. This paper addresses this challenge by using Zipformer to handle the nuances of speech, which contains two imbalanced languages, Mandarin and English, in an utterance. This work demonstrates that the internal layers of the Zipformer effectively encode the language characteristics, which can be leveraged in language identification. We present the selection methodology of the inner layers to extract the embeddings and make a comparison with different back-ends. Our analysis shows that Zipformer is robust across these backends. Our approach effectively handles imbalanced data, achieving a Balanced Accuracy (BAC) of 81.89%, a 15.47% improvement over the language identification baseline. These findings highlight the potential of the transformer encoder architecture model in real scenarios.

[10] Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance

Abdelrahman A. Ali, Aya E. Fouda, Radwa J. Hanafy, Mohammed E. Fouda

Main category: cs.CL

TL;DR: The study explores using LLMs for multimodal mental health diagnostics (depression and PTSD) via text and audio, showing improved accuracy with combined modalities, especially with Gemini 1.5 Pro.

Details

Motivation: Address the urgent need for innovative tools in early mental health diagnosis due to rising global prevalence of disorders.

Method: Utilizes the E-DAIC dataset, compares text and audio modalities, and integrates both. Evaluates performance using custom metrics (Modal Superiority Score, Disagreement Resolvement Score) in zero-shot and few-shot settings.

Result: Gemini 1.5 Pro achieves the highest performance (F1: 0.67, BA: 77.4%) with combined modalities, outperforming single modalities by 3.1% (text) and 2.7% (audio).

Conclusion: Combining modalities enhances diagnostic accuracy, and LLMs like Gemini 1.5 Pro and GPT-4o mini perform robustly without task-specific fine-tuning.

Abstract: Mental health disorders are increasingly prevalent worldwide, creating an urgent need for innovative tools to support early diagnosis and intervention. This study explores the potential of Large Language Models (LLMs) in multimodal mental health diagnostics, specifically for detecting depression and Post Traumatic Stress Disorder through text and audio modalities. Using the E-DAIC dataset, we compare text and audio modalities to investigate whether LLMs can perform equally well or better with audio inputs. We further examine the integration of both modalities to determine if this can enhance diagnostic accuracy, which generally results in improved performance metrics. Our analysis specifically utilizes custom-formulated metrics; Modal Superiority Score and Disagreement Resolvement Score to evaluate how combined modalities influence model performance. The Gemini 1.5 Pro model achieves the highest scores in binary depression classification when using the combined modality, with an F1 score of 0.67 and a Balanced Accuracy (BA) of 77.4%, assessed across the full dataset. These results represent an increase of 3.1% over its performance with the text modality and 2.7% over the audio modality, highlighting the effectiveness of integrating modalities to enhance diagnostic accuracy. Notably, all results are obtained in zero-shot inferring, highlighting the robustness of the models without requiring task-specific fine-tuning. To explore the impact of different configurations on model performance, we conduct binary, severity, and multiclass tasks using both zero-shot and few-shot prompts, examining the effects of prompt variations on performance. The results reveal that models such as Gemini 1.5 Pro in text and audio modalities, and GPT-4o mini in the text modality, often surpass other models in balanced accuracy and F1 scores across multiple tasks.

[11] Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models

Ting Cai, Stephen Sheen, AnHai Doan

Main category: cs.CL

TL;DR: The paper introduces Columbo, an LLM-based solution for expanding table column abbreviations, outperforming existing methods by 4-29%. It addresses limitations in prior work with new datasets and improved accuracy measures.

Details

Motivation: The need to expand abbreviated column names in tables is critical for downstream tasks across enterprises, sciences, and government agencies. Prior work lacks real-world datasets and accurate evaluation measures.

Method: The authors introduce 4 new real-world datasets, propose synonym-aware accuracy measures, and develop Columbo, an LLM-based solution using context, rules, chain-of-thought reasoning, and token-level analysis.

Result: Columbo outperforms the current state-of-the-art solution, NameGuess, by 4-29% across 5 datasets and is deployed in production on the EDI data portal.

Conclusion: The paper advances the field by addressing prior limitations and demonstrating the effectiveness of Columbo, a robust solution for abbreviation expansion.

Abstract: Expanding the abbreviated column names of tables, such as esal'' to employee salary’’, is critical for numerous downstream data tasks. This problem arises in enterprises, domain sciences, government agencies, and more. In this paper we make three contributions that significantly advances the state of the art. First, we show that synthetic public data used by prior work has major limitations, and we introduce 4 new datasets in enterprise/science domains, with real-world abbreviations. Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately. Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis. Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29%, over 5 datasets. Columbo has been used in production on EDI, a major data portal for environmental sciences.

[12] Non-native Children’s Automatic Speech Assessment Challenge (NOCASA)

Yaroslav Getman, Tamás Grósz, Mikko Kurimo, Giampiero Salvi

Main category: cs.CL

TL;DR: The paper introduces NOCASA, a competition for developing systems to assess L2 learners’ pronunciation using gamified training, addressing data limitations and imbalance. It provides a dataset (TeflonNorL2) and baseline models, with a wav2vec 2.0 model achieving 36.37% UAR.

Details

Motivation: To improve pronunciation assessment for young L2 learners through gamified training, addressing challenges like limited and imbalanced data.

Method: Participants develop systems using provided pseudo-anonymized data (TeflonNorL2) and baseline models (SVM and wav2vec 2.0).

Result: The wav2vec 2.0 model outperforms, achieving 36.37% unweighted average recall on the test set.

Conclusion: NOCASA advances L2 pronunciation assessment by providing data and baselines, with wav2vec 2.0 showing promise.

Abstract: This paper presents the “Non-native Children’s Automatic Speech Assessment” (NOCASA) - a data competition part of the IEEE MLSP 2025 conference. NOCASA challenges participants to develop new systems that can assess single-word pronunciations of young second language (L2) learners as part of a gamified pronunciation training app. To achieve this, several issues must be addressed, most notably the limited nature of available training data and the highly unbalanced distribution among the pronunciation level categories. To expedite the development, we provide a pseudo-anonymized training data (TeflonNorL2), containing 10,334 recordings from 44 speakers attempting to pronounce 205 distinct Norwegian words, human-rated on a 1 to 5 scale (number of stars that should be given in the game). In addition to the data, two already trained systems are released as official baselines: an SVM classifier trained on the ComParE_16 acoustic feature set and a multi-task wav2vec 2.0 model. The latter achieves the best performance on the challenge test set, with an unweighted average recall (UAR) of 36.37%.

[13] From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text

Ridwan Mahbub, Mohammed Saidul Islam, Mir Tafseer Nayeem, Md Tahmid Rahman Laskar, Mizanur Rahman, Shafiq Joty, Enamul Hoque

Main category: cs.CL

TL;DR: The paper investigates geo-economic biases in Vision-Language Models (VLMs) when generating chart summaries, finding that high-income countries receive more positive descriptions than lower-income ones.

Details

Motivation: To address the overlooked issue of biases in VLM-generated chart summaries and their potential societal harm.

Method: Large-scale evaluation of 6,000 chart-country pairs across six VLMs, analyzing sentiment differences based on economic status.

Result: VLMs produce more positive summaries for high-income countries, with models like GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 showing varying bias levels. Prompt-based debiasing is only partially effective.

Conclusion: The study highlights the need for more robust debiasing strategies in VLMs to mitigate geo-economic biases.

Abstract: Charts are very common for exploring data and communicating insights, but extracting key takeaways from charts and articulating them in natural language can be challenging. The chart-to-text task aims to automate this process by generating textual summaries of charts. While with the rapid advancement of large Vision-Language Models (VLMs), we have witnessed great progress in this domain, little to no attention has been given to potential biases in their outputs. This paper investigates how VLMs can amplify geo-economic biases when generating chart summaries, potentially causing societal harm. Specifically, we conduct a large-scale evaluation of geo-economic biases in VLM-generated chart summaries across 6,000 chart-country pairs from six widely used proprietary and open-source models to understand how a country’s economic status influences the sentiment of generated summaries. Our analysis reveals that existing VLMs tend to produce more positive descriptions for high-income countries compared to middle- or low-income countries, even when country attribution is the only variable changed. We also find that models such as GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 exhibit varying degrees of bias. We further explore inference-time prompt-based debiasing techniques using positive distractors but find them only partially effective, underscoring the complexity of the issue and the need for more robust debiasing strategies. Our code and dataset are publicly available here.

[14] User-centric Subjective Leaderboard by Customizable Reward Modeling

Qi Jia, Xiujie Song, Zicheng Zhang, Yijin Guo, Kaiwei Zhang, Zijian Chen, Guangtao Zhai

Main category: cs.CL

TL;DR: The paper introduces a User-Centric Subjective Leaderboard (USL) and Customizable Reward Models (CRMs) to address the limitations of static benchmarks for LLMs, offering dynamic, preference-driven rankings based on real human preferences.

Details

Motivation: Existing benchmarks for LLMs focus on verifiable tasks, lacking utility for practical model selection. The paper aims to bridge this gap by incorporating human preferences.

Method: The USL is built on 10K+ subjective queries, revealing diverse human preferences. CRMs are introduced to address contradictions in preferences, achieving superior performance with 4B parameters.

Result: CRMs outperform models like GPT-4.1 and Gemini-2.5-pro, showing strong generalization. USL exhibits negative correlations to contradictory preferences.

Conclusion: The USL and CRMs provide a dynamic, preference-driven solution for LLM selection, addressing the limitations of static benchmarks.

Abstract: Existing benchmarks for large language models (LLMs) predominantely focus on assessing their capabilities through verifiable tasks. Such objective and static benchmarks offer limited utility for practical LLM selection, making it difficult for users to find suitable models for their individual needs. To bridge this gap, we present the first User-Centric Subjective Leaderboard (USL), which provides a preference-driven, dynamic ranking of LLMs across diverse real-world scenarios. Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries. Our investigation reveals significant diversity and contradictions in human preferences, which limit the effectiveness of state-of-the-art reward models. To address this, we introduce Customizable Reward Models (CRMs). With only 4B parameters, our CRM surpasses the performance of leading models such as GPT-4.1 and Gemini-2.5-pro, showing exceptional generalization capabilities across new topics and criteria. The USL, powered by CRMs, exhibits strong negative correlations to contradictory preferences.

[15] Learning Facts at Scale with Active Reading

Jessy Lin, Vincent-Pierre Berges, Xilun Chen, Wen-Tau Yih, Gargi Ghosh, Barlas Oğuz

Main category: cs.CL

TL;DR: Active Reading improves LLMs’ knowledge absorption by training them with self-generated learning strategies, outperforming vanilla finetuning and other methods.

Details

Motivation: Addressing the unreliable learning and recall of facts in LLMs by providing a tool for consistent knowledge absorption.

Method: Proposes Active Reading, a framework where models study material with self-generated learning strategies.

Result: Significant knowledge gains: 66% on SimpleQA (+313% over vanilla) and 26% on FinanceBench (+160% over vanilla). Meta WikiExpert-8B outperforms larger models on factual QA.

Conclusion: Active Reading enhances factual knowledge in LLMs, scalable to pre-training, and outperforms traditional methods.

Abstract: LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark. Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models. As a demonstration of this, we release Meta WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.

[16] Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: The paper introduces Memp, a method to enhance LLM-based agents with a dynamic, learnable procedural memory, improving task success and efficiency.

Details

Motivation: LLM-based agents lack robust procedural memory, which is often manually engineered or static, limiting adaptability and performance.

Method: Memp distills past agent trajectories into fine-grained instructions and higher-level abstractions, with strategies for Build, Retrieval, and Update.

Result: Agents with refined memory achieve higher success rates and efficiency; memory from stronger models boosts weaker models’ performance.

Conclusion: Dynamic procedural memory enhances LLM agents’ adaptability and performance, with potential for transferability across models.

Abstract: Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.

[17] From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation

Siyuan Meng, Junming Liu, Yirong Chen, Song Mao, Pinlong Cai, Guohang Yan, Botian Shi, Ding Wang

Main category: cs.CL

TL;DR: DPS is a dynamic passage selector for RAG systems that improves reranking by adaptively selecting relevant passages, outperforming state-of-the-art methods.

Details

Motivation: Traditional reranking modules in RAG systems struggle with multi-hop queries, either omitting crucial information or introducing noise due to fixed Top-K selection.

Method: DPS treats passage selection as a supervised learning problem, capturing inter-passage dependencies and dynamically selecting passages without modifying the RAG pipeline.

Result: DPS outperforms baselines, improving F1-score by 30.06% and 15.4% on MuSiQue over Qwen3-reranker and RankingGPT, respectively.

Conclusion: DPS enhances reasoning in complex RAG scenarios by enabling adaptive evidence selection.

Abstract: Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios.

[18] LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation

Jakub Šmíd, Pavel Přibáň, Pavel Král

Main category: cs.CL

TL;DR: A new method for cross-lingual aspect-based sentiment analysis (ABSA) avoids unreliable translation tools by using a large language model (LLM) to generate pseudo-labelled data, improving performance across languages and models.

Details

Motivation: Existing methods rely on unreliable translation tools for cross-lingual ABSA, limiting accuracy and robustness.

Method: The approach uses an LLM to generate pseudo-labelled data in the target language, fine-tuning an ABSA model on this data without translation.

Result: The method outperforms translation-based approaches across six languages and five models, with fine-tuned LLMs surpassing smaller multilingual models.

Conclusion: The proposed framework offers a robust, translation-free solution for cross-lingual ABSA, enhancing performance and scalability.

Abstract: Cross-lingual aspect-based sentiment analysis (ABSA) involves detailed sentiment analysis in a target language by transferring knowledge from a source language with available annotated data. Most existing methods depend heavily on often unreliable translation tools to bridge the language gap. In this paper, we propose a new approach that leverages a large language model (LLM) to generate high-quality pseudo-labelled data in the target language without the need for translation tools. First, the framework trains an ABSA model to obtain predictions for unlabelled target language data. Next, LLM is prompted to generate natural sentences that better represent these noisy predictions than the original text. The ABSA model is then further fine-tuned on the resulting pseudo-labelled dataset. We demonstrate the effectiveness of this method across six languages and five backbone models, surpassing previous state-of-the-art translation-based approaches. The proposed framework also supports generative models, and we show that fine-tuned LLMs outperform smaller multilingual models.

[19] Cross-lingual Aspect-Based Sentiment Analysis: A Survey on Tasks, Approaches, and Challenges

Jakub Šmíd, Pavel Král

Main category: cs.CL

TL;DR: This paper provides a comprehensive survey of cross-lingual aspect-based sentiment analysis (ABSA), summarizing tasks, datasets, methods, and challenges, while suggesting future research directions.

Details

Motivation: To address the under-explored area of cross-lingual ABSA by systematically reviewing the field and bridging the gap between resource-rich and low-resource languages.

Method: The paper reviews key ABSA tasks (e.g., aspect term extraction, sentiment classification), datasets, modeling paradigms, and cross-lingual transfer methods, including insights from monolingual, multilingual, and LLM-based ABSA.

Result: The survey highlights current progress, identifies gaps, and synthesizes contributions from related ABSA areas to cross-lingual ABSA.

Conclusion: The paper outlines challenges and proposes future research directions to advance cross-lingual ABSA systems.

Abstract: Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that focuses on understanding opinions at the aspect level, including sentiment towards specific aspect terms, categories, and opinions. While ABSA research has seen significant progress, much of the focus has been on monolingual settings. Cross-lingual ABSA, which aims to transfer knowledge from resource-rich languages (such as English) to low-resource languages, remains an under-explored area, with no systematic review of the field. This paper aims to fill that gap by providing a comprehensive survey of cross-lingual ABSA. We summarize key ABSA tasks, including aspect term extraction, aspect sentiment classification, and compound tasks involving multiple sentiment elements. Additionally, we review the datasets, modelling paradigms, and cross-lingual transfer methods used to solve these tasks. We also examine how existing work in monolingual and multilingual ABSA, as well as ABSA with LLMs, contributes to the development of cross-lingual ABSA. Finally, we highlight the main challenges and suggest directions for future research to advance cross-lingual ABSA systems.

[20] UWBa at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval

Ladislav Lenc, Daniel Cífka, Jiří Martínek, Jakub Šmíd, Pavel Král

Main category: cs.CL

TL;DR: A zero-shot system for fact-checked claim retrieval using state-of-the-art language models, achieving competitive rankings in monolingual and cross-lingual tasks.

Details

Motivation: To develop an effective zero-shot system for retrieving fact-checked claims using advanced language models.

Method: Employed multiple large language models for text embeddings, combined models for optimal results, and used cosine similarity to identify relevant claims.

Result: Achieved 7th place in monolingual and 9th in cross-lingual tasks, with NVIDIA NV-Embed-v2 performing best.

Conclusion: Combining models and leveraging embeddings effectively retrieves fact-checked claims, though multilingual models underperformed.

Abstract: This paper presents a zero-shot system for fact-checked claim retrieval. We employed several state-of-the-art large language models to obtain text embeddings. The models were then combined to obtain the best possible result. Our approach achieved 7th place in monolingual and 9th in cross-lingual subtasks. We used only English translations as an input to the text embedding models since multilingual models did not achieve satisfactory results. We identified the most relevant claims for each post by leveraging the embeddings and measuring cosine similarity. Overall, the best results were obtained by the NVIDIA NV-Embed-v2 model. For some languages, we benefited from model combinations (NV-Embed & GPT or Mistral).

[21] COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation

Yunxiao Wang, Meng Liu, Wenqi Liu, Kaiyu Jiang, Bin Wen, Fan Yang, Tingting Gao, Guorui Zhou, Liqiang Nie

Main category: cs.CL

TL;DR: The paper introduces controllable empathetic reasoning for emotional support conversations, combining NLP with psychological steps, and uses reinforcement learning and dataset annotation to improve model performance.

Details

Motivation: Current models lack deep empathetic reasoning grounded in psychological principles, limiting their effectiveness in emotional support.

Method: Proposes controllable empathetic reasoning, constructs a fine-grained annotated dataset, employs reinforcement learning with a unified reward model, and introduces personality-based rewriting and redundancy-aware rewards.

Result: The approach significantly improves the model’s emotional support ability, making it more empathetic and human-like.

Conclusion: The method advances the development of empathetic support systems by integrating psychological reasoning and precise feedback mechanisms.

Abstract: Emotional support conversations are crucial for promoting emotional well-being, yet current models often lack deep empathetic reasoning grounded in psychological principles. To address this, we propose controllable empathetic reasoning, which combines natural language reasoning with structured psychological steps. We construct a fine-grained dataset annotated with reasoning correctness and response preferences to enable this capability. To further enhance training, we employ reinforcement learning with a unified process-outcome reward model that delivers precise feedback. To mitigate response repetitiveness from entropy collapse, we introduce personality-based dialogue rewriting and a redundancy-aware reward reweighting strategy. Our approach significantly improves model’s emotional support ability, advancing the development of empathetic, human-like support systems.

[22] The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

Skyler Hallinan, Jaehun Jung, Melanie Sclar, Ximing Lu, Abhilasha Ravichander, Sahana Ramnath, Yejin Choi, Sai Praneeth Karimireddy, Niloofar Mireshghallah, Xiang Ren

Main category: cs.CL

TL;DR: The paper introduces the N-Gram Coverage Attack, a black-box membership inference attack using only text outputs, outperforming other black-box methods and matching white-box attacks. It scales with compute budget and reveals GPT-4’s improved privacy robustness.

Details

Motivation: To enable membership inference attacks on black-box models like GPT-4, which lack access to hidden states or probability distributions, addressing limitations of current methods.

Method: The N-Gram Coverage Attack uses n-gram overlap metrics to compare model-generated text with candidate data, predicting membership based on similarity. Performance scales with generated sequences.

Result: The attack outperforms black-box methods and rivals white-box attacks. It reveals GPT-4’s increased robustness to membership inference, indicating better privacy protections.

Conclusion: The N-Gram Coverage Attack is effective for black-box models, highlighting advancements in privacy for newer models like GPT-4.

Abstract: Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models’ hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.

[23] AINL-Eval 2025 Shared Task: Detection of AI-Generated Scientific Abstracts in Russian

Tatiana Batura, Elena Bruches, Milana Shvenk, Valentin Malykh

Main category: cs.CL

TL;DR: The AINL-Eval 2025 Shared Task introduces a dataset and challenge for detecting AI-generated scientific abstracts in Russian, aiming to address academic integrity concerns.

Details

Motivation: The rise of LLMs makes it hard to distinguish human- from AI-generated content, threatening academic integrity, especially in multilingual contexts.

Method: A large-scale dataset of 52,305 samples (human and AI-generated abstracts) was created, and a shared task was organized with 10 teams submitting 159 solutions.

Result: Top systems showed strong performance in detecting AI-generated content, even for unseen domains and models.

Conclusion: The shared task and platform aim to foster ongoing research in AI-generated content detection, with the dataset publicly available.

Abstract: The rapid advancement of large language models (LLMs) has revolutionized text generation, making it increasingly difficult to distinguish between human- and AI-generated content. This poses a significant challenge to academic integrity, particularly in scientific publishing and multilingual contexts where detection resources are often limited. To address this critical gap, we introduce the AINL-Eval 2025 Shared Task, specifically focused on the detection of AI-generated scientific abstracts in Russian. We present a novel, large-scale dataset comprising 52,305 samples, including human-written abstracts across 12 diverse scientific domains and AI-generated counterparts from five state-of-the-art LLMs (GPT-4-Turbo, Gemma2-27B, Llama3.3-70B, Deepseek-V3, and GigaChat-Lite). A core objective of the task is to challenge participants to develop robust solutions capable of generalizing to both (i) previously unseen scientific domains and (ii) models not included in the training data. The task was organized in two phases, attracting 10 teams and 159 submissions, with top systems demonstrating strong performance in identifying AI-generated content. We also establish a continuous shared task platform to foster ongoing research and long-term progress in this important area. The dataset and platform are publicly available at https://github.com/iis-research-team/AINL-Eval-2025.

[24] Improving Diversity in Language Models: When Temperature Fails, Change the Loss

Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, Benjamin Negrevergne

Main category: cs.CL

TL;DR: The paper explores how adjusting decoding temperature affects language model diversity, proposing a Precision-Recall framework for better trade-offs.

Details

Motivation: To understand why temperature adjustments often fail to improve coverage (Recall) and how to achieve better diversity in language models.

Method: Analyzes temperature scaling effects, proposes rethinking loss functions using the Precision-Recall framework.

Result: The proposed approach achieves a better Precision-Recall trade-off than traditional methods.

Conclusion: Rethinking loss functions via Precision-Recall can lead to more versatile and robust language models.

Abstract: Increasing diversity in language models is a challenging yet essential objective. A common approach is to raise the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision), while increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models by leveraging the Precision-Recall framework. Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques.

[25] EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

Yaoning Wang, Jiahao Ying, Yixin Cao, Yubo Ma, Yugang Jiang

Main category: cs.CL

TL;DR: EffiEval is a training-free method for efficient benchmarking of large language models (LLMs), addressing data redundancy while maintaining reliability by selecting representative subsets using the Model Utility Index (MUI).

Details

Motivation: The rapid growth of LLMs and diverse benchmarks creates computational challenges for evaluation, necessitating a more efficient and reliable method.

Method: EffiEval uses the Model Utility Index (MUI) to adaptively select high-quality, representative subsets of data, ensuring fairness and generalizability without extensive evaluation data.

Result: Experiments show EffiEval achieves strong ranking consistency with full-dataset evaluation using only a fraction of the data, while remaining flexible and scalable.

Conclusion: EffiEval offers a practical, fair, and efficient solution for LLM evaluation, balancing representativeness and computational efficiency.

Abstract: The rapid advancement of large language models (LLMs) and the development of increasingly large and diverse evaluation benchmarks have introduced substantial computational challenges for model assessment. In this paper, we present EffiEval, a training-free approach for efficient benchmarking that effectively addresses data redundancy while maintaining high evaluation reliability. Our method is specifically designed to meet three key criteria for high-quality evaluation: representativeness, by ensuring comprehensive coverage of model capabilities; fairness, by remaining independent of model performance during sample selection to avoid bias; and generalizability, by enabling flexible transfer across datasets and model families without reliance on large-scale evaluation data. Unlike traditional methods that rely on absolute performance or require extensive evaluation data, our approach adaptively selects high-quality representative subsets based on the Model Utility Index (MUI). Extensive experiments on multiple public benchmarks and diverse LLMs demonstrate that EffiEval achieves strong ranking consistency with full-dataset evaluation using only a small fraction of the original data. Furthermore, our method is flexible and scalable in size, allowing users to balance evaluation efficiency and representativeness according to specific needs. Overall, EffiEval provides a practical and generalizable solution for reliable, fair, and efficient evaluation in the era of LLMs.

[26] Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

Ziyang Ma, Qingyue Yuan, Linhai Zhang, Deyu Zhou

Main category: cs.CL

TL;DR: The paper introduces SLowED, a safe distillation method for Small Language Models (SLMs) to maintain safety while enhancing reasoning during Chain-of-Thought (CoT) distillation.

Details

Motivation: Existing CoT distillation methods improve SLM reasoning but compromise safety. Current safety alignment techniques require extra resources and may harm reasoning. This work aims to balance safety and reasoning in SLMs.

Method: Proposes SLowED with two modules: Slow Tuning (limits weight changes) and Low-Entropy Masking (excludes low-entropy tokens from fine-tuning).

Result: SLowED retains SLM safety and improves reasoning on benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench). Ablation confirms module effectiveness.

Conclusion: SLowED effectively balances safety and reasoning in SLMs during CoT distillation, outperforming existing methods.

Abstract: Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model’s safety in the early stage and the latter prolonging the safe training epochs.

[27] Evaluating the Role of Large Language Models in Legal Practice in India

Rahul Hemrajani

Main category: cs.CL

TL;DR: LLMs like GPT, Claude, and Llama perform well in drafting and issue spotting but struggle with specialized legal research, often producing incorrect outputs. Human expertise remains crucial for nuanced legal tasks.

Details

Motivation: To evaluate the performance of LLMs in key legal tasks in the Indian context and compare them with human lawyers.

Method: Survey experiment comparing LLM outputs with a junior lawyer’s work, rated by advanced law students on helpfulness, accuracy, and comprehensiveness.

Result: LLMs excel in drafting and issue spotting but falter in specialized research, often generating hallucinations or incorrect outputs.

Conclusion: LLMs can augment certain legal tasks, but human expertise is essential for nuanced reasoning and precise legal application.

Abstract: The integration of Artificial Intelligence(AI) into the legal profession raises significant questions about the capacity of Large Language Models(LLM) to perform key legal tasks. In this paper, I empirically evaluate how well LLMs, such as GPT, Claude, and Llama, perform key legal tasks in the Indian context, including issue spotting, legal drafting, advice, research, and reasoning. Through a survey experiment, I compare outputs from LLMs with those of a junior lawyer, with advanced law students rating the work on helpfulness, accuracy, and comprehensiveness. LLMs excel in drafting and issue spotting, often matching or surpassing human work. However, they struggle with specialised legal research, frequently generating hallucinations, factually incorrect or fabricated outputs. I conclude that while LLMs can augment certain legal tasks, human expertise remains essential for nuanced reasoning and the precise application of law.

[28] The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models

Ridwan Mahbub, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Mizanur Rahman, Mir Tafseer Nayeem, Enamul Hoque

Main category: cs.CL

TL;DR: The study evaluates how Vision-Language Models (VLMs) interpret misleading visualizations, finding that most are deceived by deceptive designs, altering chart interpretations despite unchanged data.

Details

Motivation: To understand VLMs' susceptibility to deceptive visualization designs, as their misuse can spread misinformation, especially among non-experts.

Method: Analyzed over 16,000 responses from ten VLMs across eight types of misleading chart designs.

Result: Most VLMs were deceived by misleading visualizations, leading to altered interpretations of the same data.

Conclusion: Highlights the need for safeguards in VLMs to prevent visual misinformation.

Abstract: Information visualizations are powerful tools that help users quickly identify patterns, trends, and outliers, facilitating informed decision-making. However, when visualizations incorporate deceptive design elements-such as truncated or inverted axes, unjustified 3D effects, or violations of best practices-they can mislead viewers and distort understanding, spreading misinformation. While some deceptive tactics are obvious, others subtly manipulate perception while maintaining a facade of legitimacy. As Vision-Language Models (VLMs) are increasingly used to interpret visualizations, especially by non-expert users, it is critical to understand how susceptible these models are to deceptive visual designs. In this study, we conduct an in-depth evaluation of VLMs’ ability to interpret misleading visualizations. By analyzing over 16,000 responses from ten different models across eight distinct types of misleading chart designs, we demonstrate that most VLMs are deceived by them. This leads to altered interpretations of charts, despite the underlying data remaining the same. Our findings highlight the need for robust safeguards in VLMs against visual misinformation.

[29] Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, Dimitris Papailiopoulos

Main category: cs.CL

TL;DR: GFPO reduces response length inflation in large language models by filtering training responses based on length and token efficiency, maintaining accuracy while cutting unnecessary verbosity.

Details

Motivation: Address the issue of models inflating response lengths to gain accuracy, often with filler content, by optimizing for efficiency.

Method: Introduces GFPO, which samples larger groups per problem during training and filters responses based on length and token efficiency (reward per token). Also proposes Adaptive Difficulty GFPO for dynamic resource allocation.

Result: GFPO reduces length inflation by 46-71% (up to 85% with token efficiency optimization) on benchmarks while maintaining accuracy. Adaptive Difficulty GFPO improves efficiency on harder questions.

Conclusion: GFPO effectively trades training-time compute for reduced test-time compute, offering a simple yet powerful method for efficient reasoning.

Abstract: Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length–inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely “filler”: repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO’s length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute–a simple yet effective trade-off for efficient reasoning.

[30] Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

Seokgi Lee

Main category: cs.CL

TL;DR: A novel RAG framework for multihop QA uses LLM for query decomposition and answerable-question embeddings, improving performance on benchmarks.

Details

Motivation: Address ambiguity in multihop queries and enhance retrieval accuracy by decomposing questions and embedding answerable questions.

Method: Decompose multihop questions into single-hop subquestions; generate and embed answerable questions from document chunks for retrieval.

Result: Improved RAG performance on MuSiQue, 2WikiMultiHopQa, and HotpotQA datasets.

Conclusion: Answerable-question embeddings and LLM-based query decomposition enhance multihop QA performance.

Abstract: We introduce a novel retrieval-augmented generation (RAG) framework tailored for multihop question answering. First, our system uses large language model (LLM) to decompose complex multihop questions into a sequence of single-hop subquestions that guide document retrieval. This decomposition mitigates the ambiguity inherent in multi-hop queries by clearly targeting distinct knowledge facets. Second, instead of embedding raw or chunked documents directly, we generate answerable questions from each document chunk using Qwen3-8B, embed these generated questions, and retrieve relevant chunks via question-question embedding similarity. During inference, the retrieved chunks are then fed along with the original question into the RAG pipeline. We evaluate on three multihop question datasets (MuSiQue, 2WikiMultiHopQa, HotpotQA) from LongBench. Our method improves RAG performacne compared to baseline systems. Our contributions highlight the benefits of using answerable-question embeddings for RAG, and the effectiveness of LLM-based query decomposition for multihop scenarios.

[31] Echoes of Agreement: Argument Driven Opinion Shifts in Large Language Models

Avneet Kaur

Main category: cs.CL

TL;DR: LLMs’ political bias evaluations are highly sensitive to prompts, especially when prompts include suggestive arguments. Experiments show model responses align with provided arguments, indicating sycophantic tendencies.

Details

Motivation: To understand how robust bias evaluations are and how LLMs behave when interacting with opinionated text, especially with suggestive prompts.

Method: Conducted experiments evaluating political bias in LLMs with supporting and refuting arguments in single-turn and multi-turn settings.

Result: Model responses align with the direction of provided arguments, and argument strength influences directional agreement rates.

Conclusion: LLMs exhibit sycophantic tendencies, adapting stances to align with arguments, impacting bias measurement and mitigation strategies.

Abstract: There have been numerous studies evaluating bias of LLMs towards political topics. However, how positions towards these topics in model outputs are highly sensitive to the prompt. What happens when the prompt itself is suggestive of certain arguments towards those positions remains underexplored. This is crucial for understanding how robust these bias evaluations are and for understanding model behaviour, as these models frequently interact with opinionated text. To that end, we conduct experiments for political bias evaluation in presence of supporting and refuting arguments. Our experiments show that such arguments substantially alter model responses towards the direction of the provided argument in both single-turn and multi-turn settings. Moreover, we find that the strength of these arguments influences the directional agreement rate of model responses. These effects point to a sycophantic tendency in LLMs adapting their stance to align with the presented arguments which has downstream implications for measuring political bias and developing effective mitigation strategies.

[32] Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study

Mahdi Dhaini, Juraj Vladika, Ege Erdogan, Zineb Attaoui, Gjergji Kasneci

Main category: cs.CL

TL;DR: An automated framework using LLMs generates high-quality textual explanations, rivaling human annotations in improving NLP model performance.

Details

Motivation: To address the cost and scalability issues of human-annotated explanations in NLP, leveraging LLMs for automated explanation generation.

Method: Utilizes multiple state-of-the-art LLMs to generate explanations, evaluated with NLG metrics, and tests their impact on PLMs and LLMs in inference tasks.

Result: Automated explanations are highly competitive with human-annotated ones in enhancing model performance.

Conclusion: LLM-based automated explanation generation is a scalable and effective method for enriching NLP datasets and boosting model performance.

Abstract: In the rapidly evolving field of Explainable Natural Language Processing (NLP), textual explanations, i.e., human-like rationales, are pivotal for explaining model predictions and enriching datasets with interpretable labels. Traditional approaches rely on human annotation, which is costly, labor-intensive, and impedes scalability. In this work, we present an automated framework that leverages multiple state-of-the-art large language models (LLMs) to generate high-quality textual explanations. We rigorously assess the quality of these LLM-generated explanations using a comprehensive suite of Natural Language Generation (NLG) metrics. Furthermore, we investigate the downstream impact of these explanations on the performance of pre-trained language models (PLMs) and LLMs across natural language inference tasks on two diverse benchmark datasets. Our experiments demonstrate that automated explanations exhibit highly competitive effectiveness compared to human-annotated explanations in improving model performance. Our findings underscore a promising avenue for scalable, automated LLM-based textual explanation generation for extending NLP datasets and enhancing model performance.

[33] Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges

Mahdi Dhaini, Tobias Müller, Roksoliana Rabets, Gjergji Kasneci

Main category: cs.CL

TL;DR: The paper explores practitioners’ perspectives on explainable NLP, revealing gaps, low satisfaction, and evaluation challenges, advocating for clearer definitions and user-centric frameworks.

Details

Motivation: The opacity of complex NLP models necessitates transparency and explanations for understanding and deployment, especially in high-stakes environments.

Method: Qualitative interview-based study with industry practitioners and academic researchers to analyze motivations, techniques, satisfaction, and challenges.

Result: Findings show conceptual gaps, low satisfaction with current methods, and evaluation challenges.

Conclusion: Clearer definitions and user-centric frameworks are needed for practical adoption of explainable NLP.

Abstract: The field of explainable natural language processing (NLP) has grown rapidly in recent years. The growing opacity of complex models calls for transparency and explanations of their decisions, which is crucial to understand their reasoning and facilitate deployment, especially in high-stakes environments. Despite increasing attention given to explainable NLP, practitioners' perspectives regarding its practical adoption and effectiveness remain underexplored. This paper addresses this research gap by investigating practitioners’ experiences with explainability methods, specifically focusing on their motivations for adopting such methods, the techniques employed, satisfaction levels, and the practical challenges encountered in real-world NLP applications. Through a qualitative interview-based study with industry practitioners and complementary interviews with academic researchers, we systematically analyze and compare their perspectives. Our findings reveal conceptual gaps, low satisfaction with current explainability methods, and highlight evaluation challenges. Our findings emphasize the need for clear definitions and user-centric frameworks for better adoption of explainable NLP in practice.

[34] BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

Ahmed Masry, Abhay Puri, Masoud Hashemi, Juan A. Rodriguez, Megh Thakkar, Khyati Mahajan, Vikas Yadav, Sathwik Tejaswi Madhusudhan, Alexandre Piché, Dzmitry Bahdanau, Christopher Pal, David Vazquez, Enamul Hoque, Perouz Taslakian, Sai Rajeswar, Spandana Gella

Main category: cs.CL

TL;DR: The paper introduces BigCharts, a dataset and training framework for improving chart comprehension in vision-language models, addressing limitations of current datasets and methods.

Details

Motivation: Current VLMs struggle with chart comprehension due to low-quality, non-diverse datasets and reliance on supervised fine-tuning, limiting their effectiveness.

Method: Proposes BigCharts, a pipeline for generating diverse, authentic chart images using real-world data, and a training framework combining supervised fine-tuning with GRPO-based reinforcement learning.

Result: BigCharts-R1 outperforms existing models on chart question-answering benchmarks, demonstrating superior robustness and generalization.

Conclusion: The BigCharts approach effectively addresses dataset and training limitations, advancing the state-of-the-art in chart reasoning.

Abstract: Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. Although current vision-language models (VLMs) have made significant progress, they continue to struggle with chart comprehension due to training on datasets that lack diversity and real-world authenticity, or on automatically extracted underlying data tables of charts, which can contain numerous estimation errors. Furthermore, existing models only rely on supervised fine-tuning using these low-quality datasets, severely limiting their effectiveness. To address these issues, we first propose BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts sourced from multiple online platforms. Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity, while still retaining accurate underlying data due to our proposed replotting process. Additionally, we introduce a comprehensive training framework that integrates supervised fine-tuning with Group Relative Policy Optimization (GRPO)-based reinforcement learning. By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization across diverse chart styles and domains, resulting in a state-of-the-art chart reasoning model, BigCharts-R1. Extensive experiments demonstrate that our models surpass existing methods on multiple chart question-answering benchmarks compared to even larger open-source and closed-source models.

[35] A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

Aishik Mandal, Prottay Kumar Adhikary, Hiba Arnaout, Iryna Gurevych, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: A survey of clinical mental health datasets for AI development highlights gaps like lack of longitudinal data and cultural diversity, offering recommendations for better datasets.

Details

Motivation: The rise in mental health disorders and shortage of clinicians necessitates AI assistance, but current datasets are scattered and inadequate.

Method: The paper surveys and categorizes datasets by disorder, modality, task, accessibility, and sociocultural context, including synthetic data.

Result: Identified gaps include limited longitudinal data, cultural representation, and inconsistent standards, hindering AI model robustness.

Conclusion: Recommendations are provided to improve dataset curation for more equitable and generalizable mental health AI systems.

Abstract: Mental health disorders are rising worldwide. However, the availability of trained clinicians has not scaled proportionally, leaving many people without adequate or timely support. To bridge this gap, recent studies have shown the promise of Artificial Intelligence (AI) to assist mental health diagnosis, monitoring, and intervention. However, the development of efficient, reliable, and ethical AI to assist clinicians is heavily dependent on high-quality clinical training datasets. Despite growing interest in data curation for training clinical AI assistants, existing datasets largely remain scattered, under-documented, and often inaccessible, hindering the reproducibility, comparability, and generalizability of AI models developed for clinical mental health care. In this paper, we present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants. We categorize these datasets by mental disorders (e.g., depression, schizophrenia), data modalities (e.g., text, speech, physiological signals), task types (e.g., diagnosis prediction, symptom severity estimation, intervention generation), accessibility (public, restricted or private), and sociocultural context (e.g., language and cultural background). Along with these, we also investigate synthetic clinical mental health datasets. Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data. We conclude by outlining key challenges in curating and standardizing future datasets and provide actionable recommendations to facilitate the development of more robust, generalizable, and equitable mental health AI systems.

[36] Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng

Main category: cs.CL

TL;DR: A survey on innovative LLM architectures addressing transformer limitations to improve efficiency, covering methods like linear/sparse modeling, efficient attention variants, and hybrid models.

Details

Motivation: Traditional transformers are computationally heavy, hindering large-scale training and deployment. This survey explores efficient alternatives to overcome these challenges.

Method: Systematic examination of techniques like linear/sparse sequence modeling, efficient attention variants, sparse mixture-of-experts, hybrid architectures, and diffusion LLMs.

Result: Provides a blueprint of modern efficient LLM architectures, grouping studies to highlight advancements in scalability and resource efficiency.

Conclusion: The survey aims to inspire future research for more efficient and versatile AI systems by summarizing current innovations in LLM architectures.

Abstract: Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

[37] PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou

Main category: cs.CL

TL;DR: PRELUDE is a benchmark for evaluating long-context understanding by assessing the consistency of prequel stories with original narratives, revealing gaps in AI performance compared to humans.

Details

Motivation: To address the need for benchmarks that demand global comprehension and deep reasoning, as existing ones fall short in evaluating long-context understanding.

Method: Introduces PRELUDE, a task requiring evidence from multiple parts of a narrative to assess prequel plausibility, tested with state-of-the-art LLMs and commercial services.

Result: AI models lag behind humans by >15%, with a 30% gap in reasoning accuracy due to flawed reasoning despite correct answers.

Conclusion: The study highlights significant challenges and room for improvement in long-context understanding and reasoning for AI models.

Abstract: We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks – as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

[38] Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription

Abdul Rehman Antall, Naveed Akhtar

Main category: cs.CL

TL;DR: The study evaluates lightweight Whisper models (Tiny, Base, Small) for Urdu speech recognition in low-resource settings, finding Whisper-Small performs best (33.68% WER) but highlights persistent challenges.

Details

Motivation: Urdu, despite being widely spoken, lacks robust ASR systems due to dialectal diversity, code-switching, and sparse data. This study aims to address this gap.

Method: Benchmarked Whisper models (Tiny, Base, Small) on a curated Urdu dataset using WER without fine-tuning.

Result: Whisper-Small achieved the lowest WER (33.68%), outperforming Tiny (67.08%) and Base (53.67%), but challenges in phonetic accuracy and lexical coherence remain.

Conclusion: Whisper-Small shows promise for Urdu ASR, but further research is needed to address gaps in low-resource settings.

Abstract: This study evaluates the feasibility of lightweight Whisper models (Tiny, Base, Small) for Urdu speech recognition in low-resource settings. Despite Urdu being the 10th most spoken language globally with over 230 million speakers, its representation in automatic speech recognition (ASR) systems remains limited due to dialectal diversity, code-switching, and sparse training data. We benchmark these models on a curated Urdu dataset using word error rate (WER), without fine-tuning. Results show Whisper-Small achieves the lowest error rates (33.68% WER), outperforming Tiny (67.08% WER) and Base (53.67% WER). Qualitative analysis reveals persistent challenges in phonetic accuracy and lexical coherence, particularly for complex utterances. While Whisper-Small demonstrates promise for deployable Urdu ASR, significant gaps remain. Our findings emphasize lay the groundwork for future research into effective, low-resource ASR systems.

[39] Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin

Main category: cs.CL

TL;DR: Memory Decoder is a plug-and-play pretrained memory for efficient domain adaptation of LLMs, reducing perplexity by 6.17 points on average.

Details

Motivation: Adapting LLMs to specific domains is costly and suffers from issues like catastrophic forgetting (DAPT) or high latency (RAG).

Method: Uses a small transformer decoder to mimic an external retriever, enabling seamless integration with any pretrained model sharing the same tokenizer.

Result: Effective adaptation of Qwen and Llama models to biomedicine, finance, and law, with reduced perplexity.

Conclusion: Memory Decoder offers a novel, efficient paradigm for domain-specific adaptation without modifying original model parameters.

Abstract: Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model’s parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.

[40] A Survey of Cognitive Distortion Detection and Classification in NLP

Archie Sage, Jeroen Keppens, Helen Yannakoudakis

Main category: cs.CL

TL;DR: A survey of NLP applications in mental health for detecting cognitive distortions (CDs), reviewing 38 studies to address inconsistencies and improve research coherence.

Details

Motivation: The increasing use of NLP in mental health highlights the need for standardized approaches to detect and classify CDs, given their therapeutic importance.

Method: The paper reviews 38 studies over two decades, analyzing datasets, modeling approaches, and evaluation strategies to propose a consolidated CD taxonomy.

Result: The survey identifies inconsistencies in CD taxonomies and evaluation practices, offering a structured reference to guide future research.

Conclusion: The study calls for more coherent and reproducible research in NLP for mental health, emphasizing standardized taxonomies and evaluation methods.

Abstract: As interest grows in the application of natural language processing (NLP) techniques to mental health, a growing body of work explores the automatic detection and classification of cognitive distortions (CDs). CDs are habitual patterns of negatively biased or flawed thinking that distort how people perceive events, judge themselves, and react to the world around them. Identifying and addressing them is an important part of therapy. Despite its momentum, the field remains fragmented, with inconsistencies in CD taxonomies, task formulations, and evaluation practices. This survey reviews 38 studies spanning two decades, providing a structured overview of datasets, modelling approaches, and evaluation strategies. We provide a consolidated CD taxonomy reference, summarise common task setups, and highlight open challenges to support more coherent and reproducible research in this emerging area.

[41] Language of Persuasion and Misrepresentation in Business Communication: A Textual Detection Approach

Sayem Hossen, Monalisa Moon Joti, Md. Golam Rashed

Main category: cs.CL

TL;DR: The paper explores how digitization in business communication enables both transparency and deception, proposing a method to detect deceptive language using persuasive lexicon with high accuracy, though multilingual challenges persist.

Details

Motivation: To address the growing gap between theoretical and empirical representations of communication in digital business contexts, especially with the rise of AI-based discourse.

Method: Combines classical rhetoric, communication psychology, linguistic theory, and empirical studies to detect deceptive language using computational textual analysis and personalised transformer models.

Result: Achieved over 99% detection accuracy in controlled settings, but faced challenges in multilingual reproducibility due to data scarcity and lack of infrastructure.

Conclusion: Strong automatic text-identification systems are needed as AI-based discourse becomes more realistic, highlighting the gap between theory and practice in communication.

Abstract: Business communication digitisation has reorganised the process of persuasive discourse, which allows not only greater transparency but also advanced deception. This inquiry synthesises classical rhetoric and communication psychology with linguistic theory and empirical studies in the financial reporting, sustainability discourse, and digital marketing to explain how deceptive language can be systematically detected using persuasive lexicon. In controlled settings, detection accuracies of greater than 99% were achieved by using computational textual analysis as well as personalised transformer models. However, reproducing this performance in multilingual settings is also problematic and, to a large extent, this is because it is not easy to find sufficient data, and because few multilingual text-processing infrastructures are in place. This evidence shows that there has been an increasing gap between the theoretical representations of communication and those empirically approximated, and therefore, there is a need to have strong automatic text-identification systems where AI-based discourse is becoming more realistic in communicating with humans.

[42] A Comprehensive Evaluation framework of Alignment Techniques for LLMs

Muneeza Azmat, Momin Abbas, Maysa Malfiza Garcia de Macedo, Marcelo Carpinette Grave, Luan Soares de Souza, Tiago Machado, Rogerio A de Paula, Raya Horesh, Yixin Chen, Heloisa Caroline de Souza Pereira Candello, Rebecka Nordenlow, Aminat Adebiyi

Main category: cs.CL

TL;DR: The paper introduces a multi-dimensional evaluation framework for comparing alignment techniques in LLMs, assessing detection, quality, efficiency, and robustness.

Details

Motivation: To address the lack of unified evaluation frameworks for comparing diverse LLM alignment approaches, ensuring safe and value-aligned outputs.

Method: Proposes a comprehensive framework evaluating alignment techniques across four dimensions: detection, quality, efficiency, and robustness, tested on various models and strategies.

Result: The framework effectively identifies strengths and limitations of current alignment methods, offering insights for future research.

Conclusion: The introduced framework provides a systematic way to evaluate and compare LLM alignment techniques, aiding deployment decisions and future advancements.

Abstract: As Large Language Models (LLMs) become increasingly integrated into real-world applications, ensuring their outputs align with human values and safety standards has become critical. The field has developed diverse alignment approaches including traditional fine-tuning methods (RLHF, instruction tuning), post-hoc correction systems, and inference-time interventions, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these paradigms and guide deployment decisions. This paper introduces a multi-dimensional evaluation of alignment techniques for LLMs, a comprehensive evaluation framework that provides a systematic comparison across all major alignment paradigms. Our framework assesses methods along four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Through experiments across diverse base models and alignment strategies, we demonstrate the utility of our framework in identifying strengths and limitations of current state-of-the-art models, providing valuable insights for future research directions.

[43] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, Furu Wei

Main category: cs.CL

TL;DR: VisCodex integrates vision and coding models to enhance multimodal code generation, supported by the new Multimodal Coding Dataset (MCD) and benchmark InfiBench-V, achieving near-GPT-4o performance.

Details

Motivation: Current MLLMs lack strong multimodal code generation capabilities, limiting their utility in visually-rich programming tasks.

Method: VisCodex merges a coding LLM with a vision-language backbone using task vector-based model merging, preserving both visual and coding skills.

Result: VisCodex achieves state-of-the-art performance among open-source MLLMs, nearing proprietary models like GPT-4o.

Conclusion: The framework and datasets effectively bridge the gap in multimodal code generation, demonstrating the potential of model merging and high-quality data.

Abstract: Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

[44] Specialised or Generic? Tokenization Choices for Radiology Language Models

Hermione Warr, Wentian Xu, Harry Anthony, Yasin Ibrahim, Daniel McGowan, Konstantinos Kamnitsas

Main category: cs.CL

TL;DR: Medical and domain-specific tokenizers outperform general ones in radiology report summarization, especially when trained from scratch. Pre-training reduces performance gaps, but domain-specific tokenizers still lead, offering efficiency benefits.

Details

Motivation: To explore the impact of tokenizer vocabularies (general, medical, domain-specific) on radiology report summarization quality, addressing a gap in radiology-focused LM research.

Method: Systematic comparison of tokenizers across three imaging modalities, with and without LM pre-training on PubMed abstracts.

Result: Domain-specific tokenizers achieve the best performance and efficiency, reducing memory needs and sequence lengths. Pre-training mitigates but doesn’t eliminate differences.

Conclusion: Adapting LM vocabularies to clinical domains improves performance and efficiency, enhancing accessibility for healthcare applications.

Abstract: The vocabulary used by language models (LM) - defined by the tokenizer - plays a key role in text generation quality. However, its impact remains under-explored in radiology. In this work, we address this gap by systematically comparing general, medical, and domain-specific tokenizers on the task of radiology report summarisation across three imaging modalities. We also investigate scenarios with and without LM pre-training on PubMed abstracts. Our findings demonstrate that medical and domain-specific vocabularies outperformed widely used natural language alternatives when models are trained from scratch. Pre-training partially mitigates performance differences between tokenizers, whilst the domain-specific tokenizers achieve the most favourable results. Domain-specific tokenizers also reduce memory requirements due to smaller vocabularies and shorter sequences. These results demonstrate that adapting the vocabulary of LMs to the clinical domain provides practical benefits, including improved performance and reduced computational demands, making such models more accessible and effective for both research and real-world healthcare settings.

[45] Shaping Event Backstories to Estimate Potential Emotion Contexts

Johannes Schäfer, Roman Klinger

Main category: cs.CL

TL;DR: The paper introduces a novel approach to emotion analysis by adding contextual narratives to event descriptions, improving annotation reliability.

Details

Motivation: Addressing ambiguity in emotion analysis by considering missing context, which previous work overlooked.

Method: Automatically generating multiple event chains with differing emotions to enrich context, using short story generation techniques.

Result: Contextual narratives improve emotion interpretation and annotation consistency, validated by automatic and human evaluation.

Conclusion: Enriched contexts enhance emotion analysis reliability, offering a systematic approach to contextualized emotion annotation.

Abstract: Emotion analysis is an inherently ambiguous task. Previous work studied annotator properties to explain disagreement, but this overlooks the possibility that ambiguity may stem from missing information about the context of events. In this paper, we propose a novel approach that adds reasonable contexts to event descriptions, which may better explain a particular situation. Our goal is to understand whether these enriched contexts enable human annotators to annotate emotions more reliably. We disambiguate a target event description by automatically generating multiple event chains conditioned on differing emotions. By combining techniques from short story generation in various settings, we achieve coherent narratives that result in a specialized dataset for the first comprehensive and systematic examination of contextualized emotion analysis. Through automatic and human evaluation, we find that contextual narratives enhance the interpretation of specific emotions and support annotators in producing more consistent annotations.

[46] Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Fares Antaki, David Mikhail, Daniel Milad, Danny A Mammo, Sumit Sharma, Sunil K Srivastava, Bing Yu Chen, Samir Touma, Mertcan Sevgi, Jonathan El-Khoury, Pearse A Keane, Qingyu Chen, Yih Chung Tham, Renaud Duval

Main category: cs.CL

TL;DR: GPT-5-high outperforms other models in accuracy and rationale quality for medical QA tasks, with GPT-5-mini-low offering a cost-effective balance.

Details

Motivation: To evaluate the performance and cost-efficiency of GPT-5 configurations on complex medical question-answering tasks.

Method: Tested 12 GPT-5 configurations on 260 ophthalmology questions, measuring accuracy, rationale quality, and cost.

Result: GPT-5-high achieved the highest accuracy (0.965) and rationale quality, with GPT-5-mini-low being cost-effective.

Conclusion: GPT-5 configurations show promise for medical QA, with trade-offs between accuracy and cost.

Abstract: Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI’s GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology.

[47] Which one Performs Better? Wav2Vec or Whisper? Applying both in Badini Kurdish Speech to Text (BKSTT)

Renas Adnan, Hossein Hassani

Main category: cs.CL

TL;DR: The paper addresses the lack of speech-to-text (STT) systems for the Badini Kurdish dialect by developing and evaluating language models using Wav2Vec2-Large-XLSR-53 and Whisper-small, with Wav2Vec2 outperforming Whisper in accuracy and readability.

Details

Motivation: To bridge the gap in STT systems for under-resourced Kurdish dialects like Badini, enhancing accessibility and global visibility for its speakers.

Method: Utilized Badini kids’ stories (78 stories from 8 books) as textual input, recorded by six narrators (~17 hours). Preprocessed data (~15 hours, 19193 segments, 25221 words) and developed models using Wav2Vec2-Large-XLSR-53 and Whisper-small.

Result: Wav2Vec2-Large-XLSR-53 outperformed Whisper-small with 90.38% readability and 82.67% accuracy, compared to 65.45% and 53.17%, respectively.

Conclusion: The Wav2Vec2 model is more effective for Badini STT, demonstrating its potential for under-resourced languages.

Abstract: Speech-to-text (STT) systems have a wide range of applications. They are available in many languages, albeit at different quality levels. Although Kurdish is considered a less-resourced language from a processing perspective, SST is available for some of the Kurdish dialects, for instance, Sorani (Central Kurdish). However, that is not applied to other Kurdish dialects, Badini and Hawrami, for example. This research is an attempt to address this gap. Bandin, approximately, has two million speakers, and STT systems can help their community use mobile and computer-based technologies while giving their dialect more global visibility. We aim to create a language model based on Badini’s speech and evaluate its performance. To cover a conversational aspect, have a proper confidence level of grammatical accuracy, and ready transcriptions, we chose Badini kids’ stories, eight books including 78 stories, as the textual input. Six narrators narrated the books, which resulted in approximately 17 hours of recording. We cleaned, segmented, and tokenized the input. The preprocessing produced nearly 15 hours of speech, including 19193 segments and 25221 words. We used Wav2Vec2-Large-XLSR-53 and Whisper-small to develop the language models. The experiments indicate that the transcriptions process based on the Wav2Vec2-Large-XLSR-53 model provides a significantly more accurate and readable output than the Whisper-small model, with 90.38% and 65.45% readability, and 82.67% and 53.17% accuracy, respectively.

[48] Neural Bandit Based Optimal LLM Selection for a Pipeline of Tasks

Baran Atalar, Eddie Zhang, Carlee Joe-Wong

Main category: cs.CL

TL;DR: A neural contextual bandit-based algorithm is proposed to select sequences of LLMs for complex tasks by learning performance dependencies online, outperforming existing LLM selection methods.

Details

Motivation: The increasing use of LLMs for specialized tasks requires strategies to predict successful LLM sequences at low cost, especially as tasks may be too complex for a single LLM.

Method: A neural contextual bandit algorithm trains neural networks to model LLM success on subtasks online, guiding LLM selections dynamically.

Result: Experiments on telecom QA and medical diagnosis datasets show the proposed approach outperforms other LLM selection algorithms.

Conclusion: The method effectively handles complex task dependencies in LLM sequences, improving cost and success rates.

Abstract: With the increasing popularity of large language models (LLMs) for a variety of tasks, there has been a growing interest in strategies that can predict which out of a set of LLMs will yield a successful answer at low cost. This problem promises to become more and more relevant as providers like Microsoft allow users to easily create custom LLM “assistants” specialized to particular types of queries. However, some tasks (i.e., queries) may be too specialized and difficult for a single LLM to handle alone. These applications often benefit from breaking down the task into smaller subtasks, each of which can then be executed by a LLM expected to perform well on that specific subtask. For example, in extracting a diagnosis from medical records, one can first select an LLM to summarize the record, select another to validate the summary, and then select another, possibly different, LLM to extract the diagnosis from the summarized record. Unlike existing LLM selection or routing algorithms, this setting requires that we select a sequence of LLMs, with the output of each LLM feeding into the next and potentially influencing its success. Thus, unlike single LLM selection, the quality of each subtask’s output directly affects the inputs, and hence the cost and success rate, of downstream LLMs, creating complex performance dependencies that must be learned and accounted for during selection. We propose a neural contextual bandit-based algorithm that trains neural networks that model LLM success on each subtask in an online manner, thus learning to guide the LLM selections for the different subtasks, even in the absence of historical LLM performance data. Experiments on telecommunications question answering and medical diagnosis prediction datasets illustrate the effectiveness of our proposed approach compared to other LLM selection algorithms.

[49] From Stars to Insights: Exploration and Implementation of Unified Sentiment Analysis with Distant Supervision

Wenchang Li, John P. Lalor, Yixing Chen, Vamsi K. Kanuri

Main category: cs.CL

TL;DR: The paper introduces a unified sentiment analysis framework (DSPN) that integrates aspect-category detection, sentiment analysis, and rating prediction, using distant supervision for efficiency.

Details

Motivation: Conventional sentiment analysis methods handle tasks independently, missing interdependencies and requiring costly annotations. A unified approach is needed.

Method: Proposes the Distantly Supervised Pyramid Network (DSPN), a hierarchical model capturing sentiment at word, aspect, and document levels.

Result: DSPN performs comparably to benchmarks using only star ratings for supervision, with added interpretability.

Conclusion: DSPN is an effective, efficient, and interpretable unified framework for sentiment analysis.

Abstract: Sentiment analysis is integral to understanding the voice of the customer and informing businesses’ strategic decisions. Conventional sentiment analysis involves three separate tasks: aspect-category detection, aspect-category sentiment analysis, and rating prediction. However, independently tackling these tasks can overlook their interdependencies and often requires expensive, fine-grained annotations. This paper introduces unified sentiment analysis, a novel learning paradigm that integrates the three aforementioned tasks into a coherent framework. To achieve this, we propose the Distantly Supervised Pyramid Network (DSPN), which employs a pyramid structure to capture sentiment at word, aspect, and document levels in a hierarchical manner. Evaluations on multi-aspect review datasets in English and Chinese show that DSPN, using only star rating labels for supervision, demonstrates significant efficiency advantages while performing comparably well to a variety of benchmark models. Additionally, DSPN’s pyramid structure enables the interpretability of its outputs. Our findings validate DSPN’s effectiveness and efficiency, establishing a robust, resource-efficient, unified framework for sentiment analysis.

[50] Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs

Zichao Hu, Junyi Jessy Li, Arjun Guha, Joydeep Biswas

Main category: cs.CL

TL;DR: ROBO-INSTRUCT dynamically synthesizes task-specific simulation environments to generate valid robot programs, outperforming baselines and matching larger models.

Details

Motivation: Specialized LLMs for robot tasks require task-program datasets, which are costly to collect. Existing methods lack physical-world constraint adherence.

Method: ROBO-INSTRUCT creates simulation environments on the fly, infers entity properties, and enforces constraints. It also refines instructions using LLM post-processing.

Result: Fine-tuned models with ROBO-INSTRUCT outperform baselines and rival larger, proprietary models.

Conclusion: ROBO-INSTRUCT effectively addresses dataset and constraint challenges, enabling efficient robot task programming.

Abstract: Code LLMs have shown promising results with converting tasks in natural language to programs that can be executed by service robots. We are interested in finetuning small, specialized LLMs for this purpose, but collecting datasets of task-program pairs specific to each robot is time-consuming and expensive. While approaches such as SELF-INSTRUCT and EVOL-INSTRUCT are capable of generating novel tasks given a few examples, they are unable to provide the corresponding programs that correctly abide by physical-world and robot-constraints using the provided programming interface. Using a simulator is a natural potential solution to checking for such constraints, but building simulation environments that can handle arbitrary tasks and their necessary objects and locations, is challenging. To address these challenges, we introduce ROBO-INSTRUCT, which synthesizes task-specific simulation environments on the fly during program execution, by opportunistically inferring entity properties and enforcing corresponding constraints based on how the entities are used in the task program. Additionally, ROBO-INSTRUCT integrates an LLM-aided post-processing procedure to refine instructions for better alignment with robot programs. We demonstrate the effectiveness of ROBO-INSTRUCT across multiple LLMs, showing that our fine-tuned models outperform all baseline methods and even match or surpass the performance of several larger and proprietary models.

[51] LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Shawn Gavin, Tuney Zheng, Jiaheng Liu, Quehry Que, Noah Wang, Jian Yang, Chenchen Zhang, Wenhao Huang, Ge Zhang

Main category: cs.CL

TL;DR: The paper introduces LongIns, a benchmark to evaluate LLMs’ long-context reasoning, revealing gaps in current benchmarks and LLM performance.

Details

Motivation: Existing benchmarks focus on retrieval, not reasoning, and fail to test claimed context lengths of LLMs.

Method: Proposes LongIns with three settings (GIST, LIST, LIMT) to assess LLMs’ long-context reasoning.

Result: GPT-4 performs poorly at 16k context, and many LLMs struggle with multi-hop reasoning under 4k.

Conclusion: LongIns highlights limitations in LLMs’ long-context reasoning, urging further improvement.

Abstract: The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).

[52] Improving Multimodal Large Language Models Using Continual Learning

Shikhar Srivastava, Md Yousuf Harun, Robik Shrestha, Christopher Kanan

Main category: cs.CL

TL;DR: The study addresses performance loss in multimodal LLMs (MLLMs) when integrating vision models, proposing a continual learning approach to mitigate linguistic degradation while enhancing visual understanding.

Details

Motivation: Integrating vision models into LLMs often reduces their natural language performance. This study aims to solve this issue by treating it as a continual learning problem.

Method: The study evaluates five continual learning methods using the LLaVA MLLM to minimize linguistic performance loss while improving visual understanding.

Result: The proposed method reduces linguistic degradation by up to 15% compared to LLaVA, maintaining high multimodal accuracy and robustness across tasks.

Conclusion: Continual learning effectively preserves linguistic skills in MLLMs while acquiring new multimodal capabilities, offering a balanced solution.

Abstract: Generative large language models (LLMs) exhibit impressive capabilities, which can be further augmented by integrating a pre-trained vision model into the original LLM to create a multimodal LLM (MLLM). However, this integration often significantly decreases performance on natural language understanding and generation tasks, compared to the original LLM. This study investigates this issue using the LLaVA MLLM, treating the integration as a continual learning problem. We evaluate five continual learning methods to mitigate forgetting and identify a technique that enhances visual understanding while minimizing linguistic performance loss. Our approach reduces linguistic performance degradation by up to 15% over the LLaVA recipe, while maintaining high multimodal accuracy. We also demonstrate the robustness of our method through continual learning on a sequence of vision-language tasks, effectively preserving linguistic skills while acquiring new multimodal capabilities. Project webpage: https://shikhar-srivastava.github.io/cl-for-improving-mllms

[53] Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models’ Character Understanding Evaluation

Yuxuan Jiang, Francis Ferraro

Main category: cs.CL

TL;DR: The paper addresses concerns about LLMs memorizing fictional works in character understanding tasks, proposing ‘gist memory’ over ‘verbatim memory’ to reduce memorization-driven performance.

Details

Motivation: To ensure LLMs genuinely understand and reason about fictional characters rather than relying on memorization from pre-training corpora.

Method: Introduces a method to mitigate mechanized memorization while preserving essential cues for comprehension and reasoning.

Result: Reduces memorization-driven accuracy from 96% to 72% and causes up to an 18% drop in accuracy across tasks.

Conclusion: Highlights data contamination in benchmarks, emphasizing the need to measure true character understanding over memorization.

Abstract: Recently, Large Language Models (LLMs) have shown impressive performance in character understanding tasks, such as analyzing the roles, personalities, and relationships of fictional characters. However, the extensive pre-training corpora used by LLMs raise concerns that they may rely on memorizing popular fictional works rather than genuinely understanding and reasoning about them. In this work, we argue that ‘gist memory’-capturing essential meaning - should be the primary mechanism for character understanding tasks, as opposed to ‘verbatim memory’ - exact match of a string. We introduce a simple yet effective method to mitigate mechanized memorization in character understanding evaluations while preserving the essential implicit cues needed for comprehension and reasoning. Our approach reduces memorization-driven performance on popular fictional works from 96% accuracy to 72% and results in up to an 18% drop in accuracy across various character understanding tasks. These findings underscore the issue of data contamination in existing benchmarks, which often measure memorization rather than true character understanding.

[54] Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi

Main category: cs.CL

TL;DR: The paper evaluates LLMs’ generalization to out-of-domain language using Construction Grammar (CxG), revealing a 40% performance drop on syntactically identical but semantically divergent cases.

Details

Motivation: To assess if LLMs can generalize beyond common pretraining data to dynamic, real-world language instances.

Method: Constructed a diagnostic evaluation using CxG, testing models on phrasal constructions with abstract meanings.

Result: State-of-the-art models like GPT-01 show a 40% performance drop on tasks requiring generalization over syntactically identical forms.

Conclusion: LLMs struggle with generalization to rare but intuitive language cases, highlighting a gap in human-like understanding.

Abstract: The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can ‘understand’ the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.

[55] Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions

Zijin Hong, Hao Wu, Su Dong, Junnan Dong, Yilin Xiao, Yujing Zhang, Zhu Wang, Feiran Huang, Linyi Li, Hongxia Yang, Xiao Huang

Main category: cs.CL

TL;DR: RV-Bench is a new method to evaluate LLMs’ mathematical reasoning by using random variable questions (RVQs) to test genuine understanding and robustness.

Details

Motivation: Address concerns about unreliable math benchmarks by creating a method to assess LLMs' true reasoning capabilities.

Method: Generate RVQs with randomized variables to test LLMs on unseen data, evaluating accuracy and robustness.

Result: LLMs show proficiency imbalance between seen and unseen data, with limited generalization but potential for improvement via test-time scaling.

Conclusion: RV-Bench effectively evaluates LLMs’ mathematical reasoning, highlighting gaps and potential for eliciting better performance.

Abstract: Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable benchmark that effectively evaluates large language models’ (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we build question-generating functions to produce random variable questions (RVQs), whose background content mirrors original benchmark problems, but with randomized variable combinations, rendering them “unseen” to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLM’s genuine reasoning capability is reflected through its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1,000 RVQs. Our findings propose that LLMs exhibit a proficiency imbalance between encountered and ``unseen’’ data distributions. Furthermore, RV-Bench reveals that proficiency generalization across similar mathematical reasoning tasks is limited, but we verified it can still be effectively elicited through test-time scaling.

[56] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

Main category: cs.CL

TL;DR: RocketKV is a training-free KV cache compression strategy for Transformer-based models, achieving high compression and speedup with minimal accuracy loss.

Details

Motivation: The KV cache in large language models grows with input length, straining memory bandwidth and capacity during decoding.

Method: RocketKV uses two stages: coarse-grain permanent KV cache eviction and fine-grain hybrid sparse attention with dimensionality reduction.

Result: Achieves up to 400x compression, 3.7x speedup, and 32.6% memory reduction with negligible accuracy loss.

Conclusion: RocketKV effectively addresses KV cache inefficiency and outperforms existing methods, especially in multi-turn scenarios.

Abstract: Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400$\times$, end-to-end speedup of up to 3.7$\times$ as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme. The source code is available here: https://github.com/NVlabs/RocketKV.

[57] EvoP: Robust LLM Inference via Evolutionary Pruning

Shangyu Wu, Hongchao Du, Ying Xiong, Shuai Chen, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

Main category: cs.CL

TL;DR: EvoP is an evolutionary pruning framework for LLMs that improves efficiency and performance by using a diverse calibration dataset and optimal pruning patterns.

Details

Motivation: LLMs are resource-intensive, and existing pruning methods are heuristic and ignore data characteristics, leading to suboptimal performance.

Method: EvoP uses cluster-based calibration dataset sampling (CCDS) and evolutionary pruning pattern searching (EPPS) to optimize pruning.

Result: EvoP outperforms existing pruning techniques in performance and efficiency across various LLMs and tasks.

Conclusion: EvoP is a practical and scalable solution for deploying LLMs in resource-constrained environments.

Abstract: Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing model pruning methods address this issue by removing redundant structures (e.g., elements, channels, layers) from the model. However, these methods employ a heuristic pruning strategy, which leads to suboptimal performance. Besides, they also ignore the data characteristics when pruning the model. To overcome these limitations, we propose EvoP, an evolutionary pruning framework for robust LLM inference. EvoP first presents a cluster-based calibration dataset sampling (CCDS) strategy for creating a more diverse calibration dataset. EvoP then introduces an evolutionary pruning pattern searching (EPPS) method to find the optimal pruning pattern. Compared to existing model pruning techniques, EvoP achieves the best performance while maintaining the best efficiency. Experiments across different LLMs and different downstream tasks validate the effectiveness of the proposed EvoP, making it a practical and scalable solution for deploying LLMs in real-world applications.

[58] Efficient Inference for Large Reasoning Models: A Survey

Yue Liu, Jiaying Wu, Yufei He, Ruihan Gong, Jun Xia, Liang Li, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, Bryan Hooi, Stan Z. Li, Keqin Li

Main category: cs.CL

TL;DR: A survey on efficient inference methods for Large Reasoning Models (LRMs) to address token inefficiency while maintaining reasoning quality.

Details

Motivation: LRMs improve reasoning in LLMs but suffer from inefficiencies in token usage, memory, and inference time. This paper reviews methods to mitigate these issues.

Method: Introduces a taxonomy of methods: (a) explicit compact Chain-of-Thought (CoT) and (b) implicit latent CoT, followed by empirical analyses and discussion of strengths/weaknesses.

Result: Identifies open challenges like human-centric reasoning, interpretability-efficiency trade-offs, safety, and broader applications. Highlights techniques like model merging and agent routers.

Conclusion: The paper serves as a guide for researchers, offering insights and resources to enhance LRMs’ inference efficiency.

Abstract: Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason, exhibiting promising performance in solving complex tasks. However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time. Thus, this survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality. The overview structure of this paper is shown in Figure~\ref{fig:paper_structure}. First, we introduce a taxonomy to group the recent methods into two main categories: (a) explicit compact Chain-of-Thought (CoT), which reduces tokens while keeping the explicit reasoning structure, and (b) implicit latent CoT, which encodes reasoning steps within hidden representations instead of explicit tokens. Meanwhile, we discuss their strengths and weaknesses. Then, we conduct empirical analyses on existing methods from reasoning scenarios, object functions, and performance & efficiency aspects. Besides, we present open challenges in this field, including human-centric controllable reasoning, trade-off between interpretability and efficiency of reasoning, ensuring the safety of efficient reasoning, and broader applications of efficient reasoning. In addition, we highlight key insights for enhancing LRMs’ inference efficiency via techniques such as model merging, new architectures, and agent routers. We hope this work serves as a valuable guide, helping researchers overcome challenges in this vibrant field. A collection of efficient reasoning methods for LRMs (papers and codes) is provided at this link: https://github.com/yueliu1999/Awesome-Efficient-Inference-for-LRMs.

[59] Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Guy Kaplan, Michael Toker, Yuval Reif, Yonatan Belinkov, Roy Schwartz

Main category: cs.CL

TL;DR: The paper analyzes how semantic information is distributed in token representations during the textual encoding stage of text-to-image models, revealing surprising patterns of information concentration and isolation.

Details

Motivation: Prior work focused on refining the diffusion process for better alignment, but this study investigates the textual encoding stage to understand how semantic information is distributed across tokens.

Method: The authors analyze information flow at two levels: in-item representation (token-level encoding within lexical items) and cross-item interaction (information flow between tokens of different lexical items). Patching techniques are used to uncover encoding patterns.

Result: Findings show information is often concentrated in one or two tokens of a lexical item (e.g., ‘Gate’ in ‘Golden Gate Bridge’), while cross-item interactions are rare but can lead to misinterpretations (e.g., ‘pool’ representing a pool table).

Conclusion: Token-level encoding plays a critical role in image generation, and misalignment issues may stem from the textual encoding stage rather than the diffusion process.

Abstract: Text-to-image (T2I) models generate images by encoding text prompts into token representations, which then guide the diffusion process. While prior work has largely focused on improving alignment by refining the diffusion process, we focus on the textual encoding stage. Specifically, we investigate how semantic information is distributed across token representations within and between lexical items (i.e., words or expressions conveying a single concept) in the prompt. We analyze information flow at two levels: (1) in-item representation-whether individual tokens represent their lexical item, and (2) cross-item interaction-whether information flows across the tokens of different lexical items. We use patching techniques to uncover surprising encoding patterns. We find information is usually concentrated in only one or two of the item’s tokens-For example, in the item “San Francisco’s Golden Gate Bridge”, the token “Gate” sufficiently captures the entire expression while the other tokens could effectively be discarded. Lexical items also tend to remain isolated; for instance, the token “dog” encodes no visual information about “green” in the prompt “a green dog”. However, in some cases, items do influence each other’s representation, often leading to misinterpretations-e.g., in the prompt “a pool by a table”, the token pool represents a pool table after contextualization. Our findings highlight the critical role of token-level encoding in image generation, suggesting that misalignment issues may originate already during the textual encoding.

[60] CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Weiwei Sun, Shengyu Feng, Shanda Li, Yiming Yang

Main category: cs.CL

TL;DR: The paper introduces CO-Bench, a benchmark suite for evaluating LLM-based agents in combinatorial optimization (CO), addressing the lack of systematic benchmarks in this area.

Details

Motivation: The role of LLM-based agents in combinatorial optimization is underexplored, and there's a need for benchmarks to study their potential in constraint-intensive problems.

Method: CO-Bench is introduced, featuring 36 real-world CO problems with structured formulations and curated data. Multiple agentic frameworks are evaluated against human-designed algorithms.

Result: The evaluation reveals the strengths and limitations of existing LLM agents in CO, highlighting areas for future research.

Conclusion: CO-Bench provides a foundation for systematic investigation of LLM agents in CO, with the suite publicly available for further research.

Abstract: Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems – a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agentic frameworks against established human-designed algorithms, revealing the strengths and limitations of existing LLM agents and identifying promising directions for future research. CO-Bench is publicly available at https://github.com/sunnweiwei/CO-Bench.

[61] AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation

Tuhin Chakrabarty, Philippe Laban, Chien-Sheng Wu

Main category: cs.CL

TL;DR: The paper introduces a Writing Quality Benchmark (WQ) and specialized Writing Quality Reward Models (WQRM) to evaluate and improve AI-generated text quality, showing strong generalization and human preference alignment.

Details

Motivation: Assessing and improving AI-generated text quality is challenging due to its subjective nature and lack of community focus.

Method: Consolidated five datasets into 4,729 judgments for WQ, trained WQRM models, and used test-time compute for candidate revisions.

Result: WQRM achieved 74% accuracy on WQ and 66-72.2% human preference alignment in expert evaluations.

Conclusion: The work advances AI writing quality assessment and alignment with human preferences, releasing datasets and models for community use.

Abstract: AI-generated text is proliferating across domains, from creative writing and journalism to marketing content and scientific articles. Models can follow user-provided instructions to generate coherent and grammatically correct outputs but in this work, we study a more fundamental question: how do we evaluate and improve the writing quality of AI-generated text? Writing quality assessment has received less attention from the community, in part because it is fundamentally subjective and requires expertise. We first introduce the Writing Quality Benchmark (WQ) by consolidating five writing-preference datasets into 4,729 writing quality judgments. Our experiments show that most of the competitive baselines, including state-of-the-art LLMs that excel at reasoning tasks, barely outperform random baselines on WQ. We then train specialized Writing Quality Reward Models (WQRM) of various sizes for writing quality assessment that demonstrate strong generalization on four out-of-distribution test sets and 74% accuracy on the WQ benchmark. To further show WQRM’s practical benefits during inference, we leverage additional test-time compute to generate and rank multiple candidate revisions, allowing us to select higher-quality outputs from an initial draft. Human evaluation with 9 experienced writers confirm that WQRM-based selection produces writing samples preferred by experts 66% overall, and 72.2% when the reward gap is larger than 1 point. We release our datasets and models to encourage community engagement with writing quality assessment and development of AI writing systems better aligned with human preferences.

[62] IP-CRR: Information Pursuit for Interpretable Classification of Chest Radiology Reports

Yuyan Ge, Kwan Ho Ryan Chan, Pablo Messina, René Vidal

Main category: cs.CL

TL;DR: An interpretable AI framework for classifying chest radiology reports, using representative facts and query-answer pairs for transparent predictions.

Details

Motivation: To address the lack of interpretability in AI-based medical diagnosis, which hinders clinical adoption.

Method: Extracts representative facts, queries entailment, and predicts diagnoses using Information Pursuit, NLP, and a classifier.

Result: Effective on the MIMIC-CXR dataset, enhancing trust and usability in medical AI.

Conclusion: The framework improves interpretability and potential clinical adoption of AI in radiology.

Abstract: The development of AI-based methods to analyze radiology reports could lead to significant advances in medical diagnosis, from improving diagnostic accuracy to enhancing efficiency and reducing workload. However, the lack of interpretability of AI-based methods could hinder their adoption in clinical settings. In this paper, we propose an interpretable-by-design framework for classifying chest radiology reports. First, we extract a set of representative facts from a large set of reports. Then, given a new report, we query whether a small subset of the representative facts is entailed by the report, and predict a diagnosis based on the selected subset of query-answer pairs. The explanation for a prediction is, by construction, the set of selected queries and answers. We use the Information Pursuit framework to select the most informative queries, a natural language inference model to determine if a fact is entailed by the report, and a classifier to predict the disease. Experiments on the MIMIC-CXR dataset demonstrate the effectiveness of the proposed method, highlighting its potential to enhance trust and usability in medical AI.

[63] Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

Sai Krishna Mendu, Harish Yenala, Aditi Gulati, Shanu Kumar, Parag Agrawal

Main category: cs.CL

TL;DR: The paper analyzes harmful content in datasets used for LLM pretraining, introduces tools for filtering toxic content, and provides benchmarks for safer AI development.

Details

Motivation: To address the risks of training LLMs on unfiltered datasets containing harmful content like hate speech and misinformation, which can perpetuate toxicity and bias.

Method: Conducts a large-scale analysis of inappropriate content, introduces a taxonomy (Topical and Toxic), develops a prompt evaluation dataset (TTP), a transformer-based model (HarmFormer), and a toxicity benchmark (HAVOC).

Result: Provides tools (TTP, HarmFormer, HAVOC) and insights for filtering harmful content and evaluating model responses to toxic inputs, aiding safer LLM pretraining.

Conclusion: The work contributes to Responsible AI by offering resources for compliance and safer LLM development.

Abstract: Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt (TTP), and a transformer-based model (HarmFormer) for harmful content filtering. Additionally, we create a new multi-harm open-ended toxicity benchmark (HAVOC) and provide crucial insights into how models respond to adversarial toxic inputs. We share TTP, TTP-Eval, HAVOC and a sample of C4 inferenced on HarmFormer. Our work offers insights into ensuring safer LLM pretraining and serves as a resource for Responsible AI (RAI) compliance.

[64] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Main category: cs.CL

TL;DR: CANOE is a framework to reduce faithfulness hallucinations in LLMs using synthesized QA data and Dual-GRPO, a rule-based RL method, outperforming advanced models like GPT-4o.

Details

Motivation: Ensuring LLMs are faithful to context is critical for reliable information-seeking systems.

Method: Uses synthesized short-form QA data and Dual-GRPO, a rule-based RL method with tailored rewards, to optimize response generation.

Result: Improves faithfulness across 11 tasks, surpassing models like GPT-4o.

Conclusion: CANOE effectively reduces hallucinations in LLMs without human annotations, enhancing reliability.

Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to reduce faithfulness hallucinations of LLMs across different downstream tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

[65] LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning

Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng

Main category: cs.CL

TL;DR: LogicCat is a new Text-to-SQL benchmark dataset focusing on complex reasoning, including physics, arithmetic, commonsense, and hypothetical scenarios, surpassing existing datasets in complexity and difficulty.

Details

Motivation: Existing Text-to-SQL datasets lack coverage of complex reasoning like domain knowledge, math, and hypothetical scenarios, limiting real-world applicability.

Method: Introduces LogicCat, a dataset with 4,038 questions and 12,114 reasoning steps across 45 domains, designed for chain-of-thought parsing.

Result: State-of-the-art models achieve only 33.20% execution accuracy on LogicCat, highlighting its challenge.

Conclusion: LogicCat advances Text-to-SQL for real-world enterprise data analysis and autonomous query generation.

Abstract: Text-to-SQL is a critical task in natural language processing that aims to transform natural language questions into accurate and executable SQL queries. In real-world scenarios, these reasoning tasks are often accompanied by complex mathematical computations, domain knowledge, and hypothetical reasoning scenarios. However, existing large-scale Text-to-SQL datasets typically focus on business logic and task logic, neglecting critical factors such as vertical domain knowledge, complex mathematical reasoning, and hypothetical reasoning, which are essential for realistically reflecting the reasoning demands in practical applications and completing data querying and analysis. To bridge this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset specifically designed for complex reasoning and chain-of-thought parsing, encompassing physics, arithmetic, commonsense, and hypothetical reasoning scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed chain-of-thought reasoning steps, spanning 45 databases across diverse domains, significantly surpassing existing datasets in complexity. Experimental results demonstrate that LogicCat substantially increases the task difficulty for current state-of-the-art models to at most 33.20% execution accuracy, indicating that this task remains exceptionally challenging. The advancement of LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.

[66] MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents

Yiming Du, Bingbing Wang, Yang He, Bin Liang, Baojun Wang, Zhongyang Li, Lin Gui, Jeff Z. Pan, Ruifeng Xu, Kam-Fai Wong

Main category: cs.CL

TL;DR: MemGuide, a two-stage framework for intent-driven memory selection in task-oriented dialogues, improves task success by 11% and reduces dialogue length by 2.84 turns.

Details

Motivation: Current TOD systems rely on semantic similarity, neglecting task intent and reducing coherence in multi-session dialogues.

Method: MemGuide uses Intent-Aligned Retrieval and Missing-Slot Guided Filtering to select memory units, enhancing task coherence.

Result: MemGuide achieves a 99% task success rate and reduces dialogue length by 2.84 turns in multi-session settings.

Conclusion: MemGuide effectively addresses intent-driven memory selection, improving multi-session TOD performance.

Abstract: Modern task-oriented dialogue (TOD) systems increasingly rely on large language model (LLM) agents, leveraging Retrieval-Augmented Generation (RAG) and long-context capabilities for long-term memory utilization. However, these methods are primarily based on semantic similarity, overlooking task intent and reducing task coherence in multi-session dialogues. To address this challenge, we introduce MemGuide, a two-stage framework for intent-driven memory selection. (1) Intent-Aligned Retrieval matches the current dialogue context with stored intent descriptions in the memory bank, retrieving QA-formatted memory units that share the same goal. (2) Missing-Slot Guided Filtering employs a chain-of-thought slot reasoner to enumerate unfilled slots, then uses a fine-tuned LLaMA-8B filter to re-rank the retrieved units by marginal slot-completion gain. The resulting memory units inform a proactive strategy that minimizes conversational turns by directly addressing information gaps. Based on this framework, we introduce the MS-TOD, the first multi-session TOD benchmark comprising 132 diverse personas, 956 task goals, and annotated intent-aligned memory targets, supporting efficient multi-session task completion. Evaluations on MS-TOD show that MemGuide raises the task success rate by 11% (88% -> 99%) and reduces dialogue length by 2.84 turns in multi-session settings, while maintaining parity with single-session benchmarks.

[67] Exploring Scaling Laws for EHR Foundation Models

Sheng Zhang, Qin Liu, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon

Main category: cs.CL

TL;DR: Scaling laws, proven effective for large language models, are now explored for electronic health records (EHRs), revealing similar patterns and potential for clinical utility.

Details

Motivation: To investigate if scaling laws, which predict performance gains in large language models, apply to EHRs, a structurally different but rich data source.

Method: Training transformer architectures on MIMIC-IV patient timeline data, varying model sizes and compute budgets to analyze scaling patterns.

Result: EHR models show consistent scaling behavior (parabolic IsoFLOPs curves, power-law relationships) akin to LLMs, enabling resource-efficient training strategies.

Conclusion: Scaling laws apply to EHRs, paving the way for powerful foundation models to enhance clinical predictions and personalized healthcare.

Abstract: The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) – a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.

[68] Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

Lang Xiong, Raina Gao, Alyssa Jeong, Yicheng Fu, Sean O’Brien, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: The paper introduces Sarc7, a benchmark for classifying 7 types of sarcasm, and proposes an emotion-based prompting method for sarcasm classification and generation, outperforming other techniques.

Details

Motivation: Sarcasm is nuanced and challenging for computational models, making classification and generation vital for understanding human communication.

Method: The study uses the MUStARD dataset, evaluating classification via zero-shot, few-shot, chain-of-thought, and a novel emotion-based prompting technique. It also proposes an emotion-based generation method focusing on incongruity, shock value, and context dependency.

Result: Gemini 2.5 with emotion-based prompting achieved the highest F1 score (0.3664). Human evaluators preferred emotion-based generations, with 38.46% more success than zero-shot.

Conclusion: Emotion-based prompting is effective for sarcasm classification and generation, outperforming traditional methods and improving human evaluation outcomes.

Abstract: Sarcasm is a form of humor where expressions convey meanings opposite to their literal interpretations. Classifying and generating sarcasm using large language models is vital for interpreting human communication. Sarcasm poses challenges for computational models, due to its nuanced nature. We introduce Sarc7, a benchmark that classifies 7 types of sarcasm: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic by annotating entries of the MUStARD dataset. Classification was evaluated using zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based prompting technique. We propose an emotion-based generation method developed by identifying key components of sarcasm-incongruity, shock value, and context dependency. Our classification experiments show that Gemini 2.5, using emotion-based prompting, outperforms other setups with an F1 score of 0.3664. Human evaluators preferred our emotion-based prompting, with 38.46% more successful generations than zero-shot prompting.

[69] DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: DefenderBench is an open-source toolkit for evaluating LLM agents in cybersecurity tasks, benchmarking models like Claude-3.7-sonnet and Llama 3.3 70B.

Details

Motivation: The potential of LLM agents in cybersecurity is underexplored, prompting the need for a practical evaluation framework.

Method: DefenderBench includes tasks like network intrusion detection and vulnerability analysis, using a standardized agentic framework to benchmark LLMs.

Result: Claude-3.7-sonnet scored highest (81.65), followed by Claude-3.7-sonnet-think (78.40), with Llama 3.3 70B at 71.81.

Conclusion: DefenderBench provides an affordable, modular tool for fair LLM evaluation in cybersecurity, promoting reproducibility.

Abstract: Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench’s modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

[70] ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng

Main category: cs.CL

TL;DR: The paper introduces a professionally annotated benchmark for Chinese harmful content detection, addressing the scarcity of Chinese datasets. It includes a knowledge rule base and proposes a knowledge-augmented baseline for improved performance.

Details

Motivation: Existing harmful content detection resources are mainly English-focused, with limited Chinese datasets. The paper aims to fill this gap by providing a comprehensive Chinese benchmark.

Method: A professionally annotated benchmark is created, covering six categories of harmful content. A knowledge rule base is developed, and a knowledge-augmented baseline integrates human-annotated rules and LLM knowledge.

Result: The proposed method enables smaller models to achieve performance comparable to state-of-the-art LLMs in Chinese harmful content detection.

Conclusion: The paper provides a valuable resource for Chinese harmful content detection and demonstrates the effectiveness of knowledge-augmented approaches.

Abstract: Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.

[71] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

Michal Podstawski

Main category: cs.CL

TL;DR: The paper explores using pretrained text embedding models to enhance semantic analysis in labeled property graphs, improving tasks like node classification and relation prediction without altering the graph structure.

Details

Motivation: To leverage rich textual attributes in labeled property graphs for better analytical tasks by incorporating semantic understanding.

Method: Integrates pretrained text embedding models to embed textual node and edge properties, maintaining the original graph structure.

Result: Demonstrates that textual semantics enhance accuracy and interpretability in property graph analysis.

Conclusion: Textual semantics from embeddings significantly improve property graph analysis without structural changes.

Abstract: Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.

[72] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu

Main category: cs.CL

TL;DR: The paper introduces CSEDB, a framework to evaluate LLMs in clinical settings, revealing moderate performance and safety gaps, especially in high-risk scenarios.

Details

Motivation: To address the lack of standardized evaluation for LLMs in clinical decision support, ensuring safety and effectiveness.

Method: Developed CSEDB with expert consensus, tested six LLMs using 2,069 Q&A items across 26 departments.

Result: LLMs showed moderate performance (57.2% avg), with a 13.3% drop in high-risk scenarios. Domain-specific models outperformed general ones.

Conclusion: CSEDB provides a standardized metric for LLM evaluation, aiding safer and more effective deployment in healthcare.

Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

[73] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Hongze Tan, Jianfei Pan

Main category: cs.CL

TL;DR: The paper introduces Dynamic Entropy Weighting to improve RL-based LLM reasoning by addressing coarse-grained credit assignment. It proposes GTPO and GRPO-S for fine-grained rewards, outperforming DAPO.

Details

Motivation: Coarse-grained credit assignment in RL limits LLM reasoning, especially in long-chain tasks.

Method: Dynamic Entropy Weighting with GTPO (token-level) and GRPO-S (sequence-level) for fine-grained rewards.

Result: Outperforms DAPO baseline, showing entropy-weighting boosts reasoning performance.

Conclusion: Entropy-weighting enhances deep reasoning, offering a better path for policy updates.

Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

[74] Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu

Main category: cs.CL

TL;DR: The paper reviews the dual role of LLMs in beneficial applications and harmful content generation, proposing a taxonomy of harms and defenses, and assessing mitigation techniques like RLHF and prompt engineering.

Details

Motivation: The dual role of LLMs as powerful tools and potential sources of harmful content presents a sociotechnical challenge, necessitating a systematic review of harms and defenses.

Method: The paper systematically reviews studies on LLM-related harms (unintentional toxicity, adversarial attacks) and defenses (RLHF, prompt engineering, safety alignment), proposing a unified taxonomy.

Result: The synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methods, and outlines future research directions.

Conclusion: The paper emphasizes the need for robust, ethically aligned language technologies and provides guidance for future research in LLM safety.

Abstract: Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content generation, question and answering (Q&A), programming, and code reasoning. Meanwhile, they also pose serious risks by inadvertently or intentionally producing toxic, offensive, or biased content. This dual role of LLMs, both as powerful tools for solving real-world problems and as potential sources of harmful language, presents a pressing sociotechnical challenge. In this survey, we systematically review recent studies spanning unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. We propose a unified taxonomy of LLM-related harms and defenses, analyze emerging multimodal and LLM-assisted jailbreak strategies, and assess mitigation efforts, including reinforcement learning with human feedback (RLHF), prompt engineering, and safety alignment. Our synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methodologies, and outlines future research directions to guide the development of robust and ethically aligned language technologies.

[75] InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

Keummin Ka, Junhyeong Park, Jaehyun Jeon, Youngjae Yu

Main category: cs.CL

TL;DR: InfoCausalQA is a new benchmark for evaluating causal reasoning in VLMs using infographics, revealing their limitations compared to humans.

Details

Motivation: To address the underexplored area of causal inference in VLMs, particularly in multimodal settings.

Method: Developed InfoCausalQA with two tasks (quantitative and semantic causal reasoning) using 494 infographic-text pairs and GPT-4o-generated QA pairs.

Result: Current VLMs show limited computational and semantic causal reasoning abilities, lagging behind humans.

Conclusion: InfoCausalQA underscores the need to improve causal reasoning in multimodal AI systems.

Abstract: Recent advances in Vision-Language Models (VLMs) have demonstrated impressive capabilities in perception and reasoning. However, the ability to perform causal inference – a core aspect of human cognition – remains underexplored, particularly in multimodal settings. In this study, we introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics that combine structured visual data with textual context. The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations: cause, effect, intervention, counterfactual, and temporal. We manually collected 494 infographic-text pairs from four public sources and used GPT-4o to generate 1,482 high-quality multiple-choice QA pairs. These questions were then carefully revised by humans to ensure they cannot be answered based on surface-level cues alone but instead require genuine visual grounding. Our experimental results reveal that current VLMs exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning. Their significantly lower performance compared to humans indicates a substantial gap in leveraging infographic-based information for causal inference. Through InfoCausalQA, we highlight the need for advancing the causal reasoning abilities of multimodal AI systems.

[76] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens

Anna Seo Gyeong Choi, Hoon Choi

Main category: cs.CL

TL;DR: The paper examines ASR bias through a philosophical lens, highlighting its ethical implications beyond technical limitations, and calls for recognition of diverse speech varieties.

Details

Motivation: To address the limited research on fairness implications of ASR systems and their potential to compound historical injustices against marginalized linguistic communities.

Method: The paper distinguishes between morally neutral classification and harmful discrimination, identifies unique ethical dimensions of ASR bias, and analyzes the tension between linguistic standardization and pluralism.

Result: ASR systems can inadvertently transform neutral classification into harmful discrimination, creating asymmetric power relationships not captured by existing fairness metrics.

Conclusion: Addressing ASR bias requires recognizing diverse speech varieties as legitimate and developing systems that respect linguistic diversity and speaker autonomy.

Abstract: Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation – it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties (“temporal taxation”), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy.

[77] Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu

Main category: cs.CL

TL;DR: ASearcher is an open-source project for large-scale RL training of search agents, improving scalability, efficiency, and data quality in LLM-based agents.

Details

Motivation: Open-source agents lack expert-level Search Intelligence, and existing methods are limited in scalability and efficiency.

Method: ASearcher uses scalable fully asynchronous RL training and a prompt-based LLM agent to synthesize high-quality QAs for training.

Result: The QwQ-32B agent achieves 46.7% and 20.8% Avg@4 gains on xBench and GAIA, with extreme long-horizon search capabilities.

Conclusion: ASearcher outperforms existing open-source 32B agents, demonstrating significant improvements in search intelligence.

Abstract: Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

[78] Capabilities of GPT-5 on Multimodal Medical Reasoning

Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang

Main category: cs.CL

TL;DR: GPT-5 outperforms GPT-4o and human experts in medical QA tasks, excelling in multimodal reasoning and decision support.

Details

Motivation: To evaluate GPT-5's zero-shot reasoning for medical decision-making, integrating text and visual data.

Method: Benchmarked GPT-5 variants on MedQA, MedXpertQA, MMLU, USMLE, and VQA-RAD using standardized protocols.

Result: GPT-5 achieves state-of-the-art accuracy, surpassing GPT-4o and human experts in reasoning and understanding.

Conclusion: GPT-5’s superior performance suggests potential for advanced clinical decision-support systems.

Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.26% and +26.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5’s ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

[79] MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

Haiyun Guo, ZhiYan Hou, Yu Chen, Jinghan He, Yandu Sun, Yuzhe Zhou, Shujing Guo, Kuan Zhu, Jinqiao Wang

Main category: cs.CL

TL;DR: The paper introduces MLLM-CTBench, a dataset for evaluating continual learning in multimodal large language models (MLLMs), with multidimensional metrics, comprehensive algorithm assessment, and a comparison of Reinforcement Fine-tuning (RFT) versus Supervised Fine-tuning (SFT).

Details

Motivation: The lack of rigorous benchmarks for continual instruction tuning in MLLMs hinders progress, prompting the creation of a systematic evaluation framework.

Method: The study curates seven tasks from six domains, introduces multidimensional evaluation metrics, assesses eight continual learning algorithms, and compares RFT and SFT.

Result: Reasoning processes in MLLMs are more resilient to forgetting than final outputs. Stronger models resist forgetting better, and properly regularized RFT outperforms SFT.

Conclusion: The work provides actionable insights for continual learning in MLLMs, highlighting the importance of model capability, task sequence, and regularization in RFT.

Abstract: Multimodal large language models (MLLMs) require continual instruction tuning during their post-training phase to adapt to the dynamic real-world demands. However, the absence of rigorous and systematic benchmarks has hindered progress in this area. To bridge this gap, we introduce \textbf{MLLM-CTBench}, a dataset curating seven challenging tasks from six diverse domains with three contributions. First,to enable fine-grained analysis of continual learning ability, we introduce \textbf{multidimensional evaluation metrics}, which combines final answer accuracy with Chain-of-Thought (CoT) reasoning quality assessment through a carefully trained MLLM evaluator. Then, we conduct a \textbf{comprehensive evaluation of continual learning algorithms}, systematically assessing eight algorithms from four major categories to provide actionable insights for algorithm design and adoption. Finally ,we evaluate the efficacy of \textbf{Reinforcement Fine-tuning (RFT) versus Supervised Fine-tuning (SFT)} in maintaining model performance across sequential tasks during continual instruction tuning. Our experiments demonstrate that reasoning processes in MLLMs exhibit greater resilience than final outputs to forgetting during continual learning, aligning with cognitive theories of hierarchical forgetting. We further show that both model capability and task sequence significantly influence continual learning outcomes, with stronger baseline models exhibiting greater resistance to forgetting. Notably, properly regularized RFT emerges as a more robust approach than SFT for maintaining performance across tasks.One of the key contributing factors is KL-divergence regularization, without which RFT leads to even worse forgetting than SFT on old tasks though may perform better on new tasks.

[80] A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu

Main category: cs.CL

TL;DR: The paper surveys parallel text generation methods to overcome the speed limitations of autoregressive LLMs, categorizing them into AR-based and Non-AR-based paradigms, and analyzing their trade-offs and future potential.

Details

Motivation: Autoregressive generation in LLMs is slow due to sequential token production. Parallel text generation aims to improve speed and efficiency, but lacks comprehensive analysis.

Method: The paper systematically surveys parallel text generation methods, categorizing them into AR-based and Non-AR-based paradigms, and evaluates their trade-offs in speed, quality, and efficiency.

Result: The survey provides a detailed taxonomy of techniques, assesses their performance, and identifies opportunities for combining or comparing them with other acceleration strategies.

Conclusion: The paper highlights recent advancements, open challenges, and future directions in parallel text generation, along with a GitHub repository for resources.

Abstract: As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation. We have also created a GitHub repository for indexing relevant papers and open resources available at https://github.com/zhanglingzhe0820/Awesome-Parallel-Text-Generation.

cs.CV

[81] A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection

Mohammad Zia Ur Rehman, Sufyaan Zahoor, Areeb Manzoor, Musharaf Maqbool, Nagendra Kumar

Main category: cs.CV

TL;DR: A novel multimodal framework for detecting misogynistic content on social media, outperforming existing methods by 10.17% and 8.88% on two datasets.

Details

Motivation: Addressing the challenge of detecting misogynistic content, which general offensive content detection methods struggle with.

Method: Proposes three modules: MANM for multimodal attention, GFRM for graph-based feature refinement, and CFLM for content-specific feature learning, along with misogynous lexicons and test-time augmentation.

Result: Achieves average improvements of 10.17% and 8.88% in macro-F1 on MAMI and MMHS150K datasets, respectively.

Conclusion: The framework effectively detects misogynistic content, demonstrating significant performance gains over existing methods.

Abstract: A substantial portion of offensive content on social media is directed towards women. Since the approaches for general offensive content detection face a challenge in detecting misogynistic content, it requires solutions tailored to address offensive content against women. To this end, we propose a novel multimodal framework for the detection of misogynistic and sexist content. The framework comprises three modules: the Multimodal Attention module (MANM), the Graph-based Feature Reconstruction Module (GFRM), and the Content-specific Features Learning Module (CFLM). The MANM employs adaptive gating-based multimodal context-aware attention, enabling the model to focus on relevant visual and textual information and generating contextually relevant features. The GFRM module utilizes graphs to refine features within individual modalities, while the CFLM focuses on learning text and image-specific features such as toxicity features and caption features. Additionally, we curate a set of misogynous lexicons to compute the misogyny-specific lexicon score from the text. We apply test-time augmentation in feature space to better generalize the predictions on diverse inputs. The performance of the proposed approach has been evaluated on two multimodal datasets, MAMI and MMHS150K, with 11,000 and 13,494 samples, respectively. The proposed method demonstrates an average improvement of 10.17% and 8.88% in macro-F1 over existing methods on the MAMI and MMHS150K datasets, respectively.

[82] IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

Yanhui Li, Yunkang Cao, Chengliang Liu, Yuan Xiong, Xinghui Dong, Chao Huang

Main category: cs.CV

TL;DR: IAD-R1 is a universal post-training framework for Vision-Language Models (VLMs) that enhances anomaly detection in industrial settings through a two-stage training strategy, achieving significant performance improvements.

Details

Motivation: The scarcity of defective samples limits traditional anomaly detection methods, and VLMs, despite their generalization capabilities, underperform in industrial anomaly detection.

Method: IAD-R1 uses a two-stage approach: PA-SFT for anomaly perception training with a Chain-of-Thought dataset (Expert-AD), and SC-GRPO for optimizing anomaly interpretation with reward functions.

Result: IAD-R1 improves average accuracy by up to 43.3% across 7 VLMs on 6 benchmarks, with a 0.5B parameter model outperforming GPT-4.1 and Claude-Sonnet-4 in zero-shot settings.

Conclusion: IAD-R1 is effective and superior, with publicly available resources, demonstrating its potential for industrial anomaly detection.

Abstract: Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from “Anomaly Perception” to “Anomaly Interpretation”. Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, attaining up to 43.3% enhancement in average accuracy on 6 industrial anomaly detection benchmark datasets. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at https://github.com/Yanhui-Lee/IAD-R1.

[83] A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality

Rongqian Chen, Allison Andreyev, Yanming Xiu, Mahdi Imani, Bin Li, Maria Gorlatova, Gang Tan, Tian Lan

Main category: cs.CV

TL;DR: CADAR is a neurosymbolic approach for detecting cognitive attacks in AR, combining neural vision-language models with symbolic reasoning for improved accuracy and interpretability.

Details

Motivation: Cognitive attacks in AR manipulate users' perception, but existing detection methods lack semantic reasoning or interpretability.

Method: CADAR fuses vision-language inputs into a symbolic perception-graph and uses particle-filter based reasoning to detect attacks.

Result: Experiments show CADAR improves accuracy by up to 10.7% over baselines in challenging AR attack scenarios.

Conclusion: CADAR demonstrates the potential of neurosymbolic methods for effective and interpretable cognitive attack detection in AR.

Abstract: Augmented Reality (AR) enriches perception by overlaying virtual elements on the physical world. Due to its growing popularity, cognitive attacks that alter AR content to manipulate users’ semantic perception have received increasing attention. Existing detection methods often focus on visual changes, which are restricted to pixel- or image-level processing and lack semantic reasoning capabilities, or they rely on pre-trained vision-language models (VLMs), which function as black-box approaches with limited interpretability. In this paper, we present CADAR, a novel neurosymbolic approach for cognitive attack detection in AR. It fuses multimodal vision-language inputs using neural VLMs to obtain a symbolic perception-graph representation, incorporating prior knowledge, salience weighting, and temporal correlations. The model then enables particle-filter based statistical reasoning – a sequential Monte Carlo method – to detect cognitive attacks. Thus, CADAR inherits the adaptability of pre-trained VLM and the interpretability and reasoning rigor of particle filtering. Experiments on an extended AR cognitive attack dataset show accuracy improvements of up to 10.7% over strong baselines on challenging AR attack scenarios, underscoring the promise of neurosymbolic methods for effective and interpretable cognitive attack detection.

[84] Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

Guangxun Zhu, Shiyu Fan, Hang Dai, Edmond S. L. Ho

Main category: cs.CV

TL;DR: Waymo-3DSkelMo is a large-scale dataset for high-quality 3D skeletal motions in multi-person interactions, derived from LiDAR data, addressing limitations of monocular RGB-based methods.

Details

Motivation: Existing datasets for 3D motion lack quality due to occlusion and temporal discontinuity in monocular RGB data, hindering fine-grained pedestrian interaction understanding in autonomous driving.

Method: Utilizes 3D human body shape and motion priors to enhance 3D pose sequences from LiDAR point clouds, creating a dataset with explicit interaction semantics.

Result: The dataset includes 14,000+ seconds of real driving scenarios with rich interactions (up to 250 agents per scene) and benchmarks for 3D pose forecasting.

Conclusion: Waymo-3DSkelMo serves as a foundational resource for future research on human behavior understanding in complex urban environments.

Abstract: Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo-3DSkelMo

[85] RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System

Abdolazim Rezaei, Mehdi Sookhak, Mahboobeh Haghparast

Main category: cs.CV

TL;DR: RL-MoE transforms visual data into privacy-preserving text, balancing accuracy and privacy, reducing replay attacks to 9.4%.

Details

Motivation: Resolve the trade-off between data utility and privacy in AI-powered ITS cameras.

Method: Combines Mixture-of-Experts for scene decomposition with Reinforcement Learning for optimized text generation.

Result: Superior privacy protection (9.4% replay attack success) and richer textual output than baselines.

Conclusion: RL-MoE offers a scalable solution for privacy-sensitive AI systems in smart cities and autonomous vehicles.

Abstract: The proliferation of AI-powered cameras in Intelligent Transportation Systems (ITS) creates a severe conflict between the need for rich visual data and the fundamental right to privacy. Existing privacy-preserving mechanisms, such as blurring or encryption, are often insufficient, creating an undesirable trade-off where either privacy is compromised against advanced reconstruction attacks or data utility is critically degraded. To resolve this impasse, we propose RL-MoE, a novel framework that transforms sensitive visual data into privacy-preserving textual descriptions, eliminating the need for direct image transmission. RL-MoE uniquely combines a Mixture-of-Experts (MoE) architecture for nuanced, multi-aspect scene decomposition with a Reinforcement Learning (RL) agent that optimizes the generated text for a dual objective of semantic accuracy and privacy preservation. Extensive experiments demonstrate that RL-MoE provides superior privacy protection, reducing the success rate of replay attacks to just 9.4% on the CFP-FP dataset, while simultaneously generating richer textual content than baseline methods. Our work provides a practical and scalable solution for building trustworthy AI systems in privacy-sensitive domains, paving the way for more secure smart city and autonomous vehicle networks.

[86] Episodic Memory Representation for Long-form Video Understanding

Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li

Main category: cs.CV

TL;DR: Video-EM improves long-form video understanding by modeling keyframes as episodic events, leveraging temporal dynamics and CoT reasoning for better accuracy with fewer frames.

Details

Motivation: Current Video-LLMs struggle with long-form videos due to context limits and oversimplified keyframe methods, missing spatio-temporal relationships.

Method: Video-EM treats keyframes as temporally ordered episodic events, capturing spatial and temporal dynamics, and uses CoT reasoning to select informative frames.

Result: Video-EM achieves 4-9% performance gains over baselines on benchmarks like Video-MME, EgoSchema, HourVideo, and LVBench, using fewer frames.

Conclusion: Video-EM addresses key limitations of existing methods, offering a robust, training-free framework for accurate long-form video understanding.

Abstract: Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text image matching, overlooking spatio temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain of thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench benchmarks confirm the superiority of Video-EM, which achieves highly competitive results with performance gains of 4-9 percent over respective baselines while utilizing fewer frames.

[87] Synthetic Data Generation for Emotional Depth Faces: Optimizing Conditional DCGANs via Genetic Algorithms in the Latent Space and Stabilizing Training with Knowledge Distillation

Seyed Muhammad Hossein Mousavi, S. Younes Mirinezhad

Main category: cs.CV

TL;DR: A framework for synthetic depth face generation using optimized GAN with Knowledge Distillation and Genetic Algorithms improves diversity and quality, outperforming other methods in emotion recognition.

Details

Motivation: The lack of high-quality, diverse depth facial datasets for recognizing subtle emotional expressions in affective computing.

Method: Uses an optimized GAN with Knowledge Distillation (EMA teacher models) and Genetic Algorithms to evolve GAN latent vectors, enhancing diversity and quality. Feature extraction (LBP, HOG, Sobel edge, intensity histogram) and XGBoost for classification.

Result: Outperforms GAN, VAE, GMM, and KDE in diversity and quality. Achieves 94% and 96% accuracy in classification. Evaluation metrics (FID, IS, SSIM, PSNR) show consistent improvement over state-of-the-art methods.

Conclusion: The proposed framework effectively addresses the challenge of generating high-quality, diverse synthetic depth faces for emotion recognition, demonstrating superior performance.

Abstract: Affective computing faces a major challenge: the lack of high-quality, diverse depth facial datasets for recognizing subtle emotional expressions. We propose a framework for synthetic depth face generation using an optimized GAN with Knowledge Distillation (EMA teacher models) to stabilize training, improve quality, and prevent mode collapse. We also apply Genetic Algorithms to evolve GAN latent vectors based on image statistics, boosting diversity and visual quality for target emotions. The approach outperforms GAN, VAE, GMM, and KDE in both diversity and quality. For classification, we extract and concatenate LBP, HOG, Sobel edge, and intensity histogram features, achieving 94% and 96% accuracy with XGBoost. Evaluation using FID, IS, SSIM, and PSNR shows consistent improvement over state-of-the-art methods.

[88] $Δ$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

Jucheng Hu, Suorong Yang, Dongzhan Zhou

Main category: cs.CV

TL;DR: The paper introduces Δ-AttnMask, a data-efficient framework for Visual Instruction Finetuning (VIF) that evaluates sample quality using attention-guided masking, achieving high performance with minimal data.

Details

Motivation: VIF requires multimodal data for joint visual-textual understanding, posing data selection challenges. Current methods are inefficient and understudied.

Method: Δ-AttnMask quantifies sample quality by masking hidden states and computing loss differences (Δ) between original and masked states, without needing extra labels or models.

Result: The framework achieves state-of-the-art performance with 20% of data, speeding up training by 5x and improving accuracy by +10.1%.

Conclusion: Δ-AttnMask is model- and data-agnostic, offering a scalable and efficient solution for VIF data selection.

Abstract: Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose $\Delta$-AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model’s hidden states, jointly evaluating image-text pairs without requiring domain labels, auxiliary models, or extra training. By computing loss differences ($\Delta$) between the original states and states masked using high-attention regions, $\Delta$-AttnMask intrinsically assesses sample quality. Experiments across multiple VLMs and datasets show that $\Delta$-AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy. Its model-agnostic and data-agnostic design ensures broad applicability across modalities and architectures.

[89] Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

Masoumeh Sharafi, Soufiane Belharbi, Houssem Ben Salem, Ali Etemad, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger

Main category: cs.CV

TL;DR: The paper proposes Personalized Feature Translation (PFT) for source-free domain adaptation (SFDA) in facial expression recognition (FER), addressing challenges like subtle expressions and single-class target data. PFT operates in latent space, avoiding image synthesis and reducing computational overhead.

Details

Motivation: Current FER models struggle with subtle expressions and inter-subject variability. SFDA methods often require labeled source data or multi-class target data, which are unavailable in this scenario. PFT aims to adapt models using only neutral target data without source data or image synthesis.

Method: PFT pre-trains a translator on source domain data to transform style features between subjects while preserving expression information. It then adapts the translator on neutral target data without source data or image generation, operating in latent space for efficiency.

Result: PFT avoids the complexity of face expression generation, produces discriminative embeddings, and reduces computational overhead compared to image-based translation methods.

Conclusion: PFT is an efficient and lightweight solution for SFDA in FER, eliminating the need for image synthesis and adapting only part of the model, making it suitable for real-world applications.

Abstract: Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation.

[90] GANime: Generating Anime and Manga Character Drawings from Sketches with Deep Learning

Tai Vu, Robert Yang

Main category: cs.CV

TL;DR: C-GAN is the most effective model for generating high-quality colorized anime images from sketches.

Details

Motivation: Addressing the costly bottleneck of colorizing sketches in the manga and anime industry.

Method: Evaluated Neural Style Transfer, C-GAN, and CycleGAN for image-to-image translation.

Result: C-GAN outperformed others, producing high-quality, human-like images.

Conclusion: C-GAN is recommended for efficient and high-resolution sketch-to-image translation in anime production.

Abstract: The process of generating fully colorized drawings from sketches is a large, usually costly bottleneck in the manga and anime industry. In this study, we examine multiple models for image-to-image translation between anime characters and their sketches, including Neural Style Transfer, C-GAN, and CycleGAN. By assessing them qualitatively and quantitatively, we find that C-GAN is the most effective model that is able to produce high-quality and high-resolution images close to those created by humans.

[91] MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng

Main category: cs.CV

TL;DR: The paper introduces MME-Emotion, a benchmark to evaluate multimodal large language models (MLLMs) in emotional understanding and reasoning, revealing their current limitations and potential improvements.

Details

Motivation: Current emotional benchmarks for MLLMs lack assessment of generalization and reasoning capabilities, prompting the creation of MME-Emotion to address these gaps.

Method: MME-Emotion includes over 6,000 video clips with QA pairs, spanning eight emotional tasks, and uses hybrid metrics for evaluation via a multi-agent system framework.

Result: Evaluation of 20 MLLMs shows unsatisfactory emotional intelligence, with top models scoring 39.3% in recognition and 56.0% in reasoning. Generalist and specialist models exhibit different strengths.

Conclusion: MME-Emotion aims to advance MLLMs’ emotional intelligence by providing a scalable, diverse, and unified benchmark for future research.

Abstract: Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3%$ recognition score and $56.0%$ Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs’ emotional intelligence in the future.

[92] STAC: Leveraging Spatio-Temporal Data Associations For Efficient Cross-Camera Streaming and Analytics

Ragini Gupta, Lingzhi Zhao, Jiaxi Li, Volodymyr Vakhniuk, Claudiu Danilov, Josh Eckhardt, Keyshla Bernard, Klara Nahrstedt

Main category: cs.CV

TL;DR: STAC is a cross-camera surveillance system that optimizes bandwidth and reduces redundant data in IoT networks while maintaining high tracking accuracy.

Details

Motivation: Addressing the trade-off between reducing data to save bandwidth and maintaining model performance in multi-camera video analytics.

Method: STAC uses spatio-temporal associations, multi-resolution feature learning, frame filtering, FFmpeg compression, and RoI masking to eliminate redundancy.

Result: 76% improvement in tracking accuracy, 8.6x reduction in latency, and 29% fewer redundant frames on the AICity Challenge dataset.

Conclusion: STAC effectively balances network efficiency and model performance for real-time multi-camera surveillance.

Abstract: In IoT based distributed network of cameras, real-time multi-camera video analytics is challenged by high bandwidth demands and redundant visual data, creating a fundamental tension where reducing data saves network overhead but can degrade model performance, and vice versa. We present STAC, a cross-cameras surveillance system that leverages spatio-temporal associations for efficient object tracking under constrained network conditions. STAC integrates multi-resolution feature learning, ensuring robustness under variable networked system level optimizations such as frame filtering, FFmpeg-based compression, and Region-of-Interest (RoI) masking, to eliminate redundant content across distributed video streams while preserving downstream model accuracy for object identification and tracking. Evaluated on NVIDIA’s AICity Challenge dataset, STAC achieves a 76% improvement in tracking accuracy and an 8.6x reduction in inference latency over a standard multi-object multi-camera tracking baseline (using YOLOv4 and DeepSORT). Furthermore, 29% of redundant frames are filtered, significantly reducing data volume without compromising inference quality.

[93] Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Zuoou Li, Weitong Zhang, Jingyuan Wang, Shuyuan Zhang, Wenjia Bai, Bernhard Kainz, Mengyun Qiao

Main category: cs.CV

TL;DR: The paper introduces a four-axis framework to evaluate jailbreak attacks on MLLMs, revealing a trade-off between prompt relevance and novelty. It proposes BSD, a recursive rewriting strategy, which outperforms previous methods in success rates and harmfulness.

Details

Motivation: Current evaluation standards for jailbreak attacks on MLLMs may overestimate effectiveness, as many 'successful' responses are benign or unrelated to malicious goals.

Method: A four-axis evaluation framework (on-topicness, OOD intensity, harmfulness, refusal rate) is introduced. BSD, a recursive rewriting strategy, is developed to balance relevance and novelty in prompts.

Result: BSD improves attack success rates by 67% and harmfulness by 21% across 13 MLLMs, highlighting weaknesses in current safety systems.

Conclusion: The study underscores the need for better evaluation metrics and safety mechanisms in MLLMs, with BSD demonstrating higher effectiveness in exploiting vulnerabilities.

Abstract: Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as “successful” are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by $67%$ and harmfulness by $21%$, revealing a previously underappreciated weakness in current multimodal safety systems.

[94] Towards Scalable Training for Handwritten Mathematical Expression Recognition

Haoyang Li, Jiaqing Li, Jialun Cao, Zongyuan Yang, Yongping Xiong

Main category: cs.CV

TL;DR: The paper introduces a scalable data engine to generate a large dataset (Tex80M) for HMER by combining limited handwritten formulas with LaTeX-rendered ones. It also presents TexTeller, a model achieving SOTA performance through mix-training.

Details

Motivation: Addressing the data scarcity in HMER due to costly manual annotation by leveraging LaTeX-rendered formulas.

Method: Developed a scalable data engine to create Tex80M, then trained TexTeller by mix-training Tex80M with a small HME dataset.

Result: TexTeller achieved state-of-the-art performance across benchmarks.

Conclusion: The complete model, dataset, and codebase will be openly released to advance HMER research.

Abstract: Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.

[95] Gradient-Direction-Aware Density Control for 3D Gaussian Splatting

Zheng Zhou, Yu-Jie Xiong, Chun-Ming Xia, Jia-Chen Zhang, Hong-Jian Zhan

Main category: cs.CV

TL;DR: GDAGS introduces gradient-direction-aware adaptive density control to address over-reconstruction and over-densification in 3D Gaussian Splatting, improving rendering quality and reducing memory usage.

Details

Motivation: Existing 3DGS methods suffer from over-reconstruction due to persistent large Gaussians and over-densification from redundant Gaussians, increasing memory overhead.

Method: GDAGS uses a gradient coherence ratio (GCR) and nonlinear dynamic weighting to prioritize splitting conflicting-gradient Gaussians and densifying concordant-direction Gaussians.

Result: GDAGS achieves superior rendering quality, reduces memory consumption by 50%, and mitigates over-reconstruction and over-densification.

Conclusion: GDAGS effectively optimizes Gaussian utilization, enhancing scene representation compactness and rendering performance.

Abstract: The emergence of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis through explicit scene representation, enabling real-time photorealistic rendering. However, existing approaches manifest two critical limitations in complex scenarios: (1) Over-reconstruction occurs when persistent large Gaussians cannot meet adaptive splitting thresholds during density control. This is exacerbated by conflicting gradient directions that prevent effective splitting of these Gaussians; (2) Over-densification of Gaussians occurs in regions with aligned gradient aggregation, leading to redundant component proliferation. This redundancy significantly increases memory overhead due to unnecessary data retention. We present Gradient-Direction-Aware Gaussian Splatting (GDAGS), a gradient-direction-aware adaptive density control framework to address these challenges. Our key innovations: the gradient coherence ratio (GCR), computed through normalized gradient vector norms, which explicitly discriminates Gaussians with concordant versus conflicting gradient directions; and a nonlinear dynamic weighting mechanism leverages the GCR to enable gradient-direction-aware density control. Specifically, GDAGS prioritizes conflicting-gradient Gaussians during splitting operations to enhance geometric details while suppressing redundant concordant-direction Gaussians. Conversely, in cloning processes, GDAGS promotes concordant-direction Gaussian densification for structural completion while preventing conflicting-direction Gaussian overpopulation. Comprehensive evaluations across diverse real-world benchmarks demonstrate that GDAGS achieves superior rendering quality while effectively mitigating over-reconstruction, suppressing over-densification, and constructing compact scene representations with 50% reduced memory consumption through optimized Gaussians utilization.

[96] FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Miao Fang, Xiuying Chen

Main category: cs.CV

TL;DR: FineState-Bench is introduced as the first evaluation standard for fine-grained GUI agent operations, addressing flaws in existing benchmarks by focusing on detailed control capabilities. It includes a multi-platform framework and a Visual Diagnostic Assistant (VDA) for quantitative analysis. Results show current models struggle with fine-grained interactions, with visual positioning identified as the main bottleneck.

Details

Motivation: Current GUI agent evaluation frameworks focus too much on coarse-grained task completion and neglect fine-grained control, which is critical for real-world applications. This gap motivates the creation of FineState-Bench.

Method: The authors introduce FineState-Bench, a multi-platform framework with 2257 task benchmarks and a four-phase indicator for assessing perception-to-control. They also develop the VDA for quantitative decoupling analysis of visual capabilities.

Result: Experiments reveal that advanced models achieve only 32.8% fine-grained interaction accuracy. The VDA shows ideal visual localization can improve success rates by 14.9%, confirming visual positioning as the primary bottleneck.

Conclusion: FineState-Bench provides a comprehensive diagnostic framework for fine-grained GUI agent evaluation, highlighting visual positioning as the key challenge. All resources are open-source for broader adoption.

Abstract: With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash’s success rate by 14.9%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.All resources are fully open-source. github: https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench

Jeffri Murrugarra-LLerena, Haoran Niu, K. Suzanne Barber, Hal Daumé III, Yang Trista Cao, Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: FiGPriv is a fine-grained privacy protection framework for VLMs that selectively masks high-risk private info, improving usability while ensuring privacy.

Details

Motivation: Address privacy concerns for blind/low-vision users who may unintentionally capture private info in images, overcoming limitations of coarse-grained masking.

Method: Integrates fine-grained segmentation with a data-driven risk scoring mechanism.

Result: Preserves 26% more image content, improves VLM response usefulness by 11%, and enhances content identification by 45%.

Conclusion: FiGPriv effectively balances privacy and usability, outperforming existing methods.

Abstract: As visual assistant systems powered by visual language models (VLMs) become more prevalent, concerns over user privacy have grown, particularly for blind and low vision users who may unknowingly capture personal private information in their images. Existing privacy protection methods rely on coarse-grained segmentation, which uniformly masks entire private objects, often at the cost of usability. In this work, we propose FiGPriv, a fine-grained privacy protection framework that selectively masks only high-risk private information while preserving low-risk information. Our approach integrates fine-grained segmentation with a data-driven risk scoring mechanism. We evaluate our framework using the BIV-Priv-Seg dataset and show that FiG-Priv preserves +26% of image content, enhancing the ability of VLMs to provide useful responses by 11% and identify the image content by 45%, while ensuring privacy protection. Project Page: https://artcs1.github.io/VLMPrivacy/

[98] Harnessing Input-Adaptive Inference for Efficient VLN

Dongwoo Kang, Akhil Perincherry, Zachary Coalson, Aiden Gabriel, Stefan Lee, Sanghyun Hong

Main category: cs.CV

TL;DR: The paper proposes input-adaptive navigation methods to enhance efficiency in vision-and-language navigation (VLN) models, addressing computational bottlenecks without significant performance loss.

Details

Motivation: The scale of history-aware multi-modal transformer models in VLN is a computational bottleneck, especially in resource-limited settings.

Method: Three adaptive algorithms are introduced: (1) selective panoramic view processing, (2) importance-based adaptive thresholding for early-exit, and (3) a caching mechanism for previously seen views.

Result: Evaluations show over a 2× reduction in computation across three off-the-shelf agents on seven VLN benchmarks.

Conclusion: The proposed methods effectively improve efficiency in VLN models without substantial performance degradation.

Abstract: An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input-adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2$\times$ reduction in computation across three off-the-shelf agents in both standard and continuous environments. Our code is publicly available at https://github.com/secure-ai-systems-group/adaptive-vision-and-language-navigation.

[99] RoHOI: Robustness Benchmark for Human-Object Interaction Detection

Di Wen, Kunyu Peng, Kailun Yang, Yufan Chen, Ruiping Liu, Junwei Zheng, Alina Roitberg, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: The paper introduces RoHOI, the first robustness benchmark for HOI detection, addressing model degradation in real-world conditions. It proposes SAMPL, a learning strategy to enhance robustness, outperforming existing methods.

Details

Motivation: Current HOI detection models fail under real-world corruptions like noise and occlusions, necessitating a robustness benchmark and improved methods.

Method: The authors create RoHOI, a benchmark with 20 corruption types, and propose SAMPL, a semantic-aware masking-based progressive learning strategy.

Result: Experiments show SAMPL outperforms state-of-the-art methods, improving robustness in HOI detection.

Conclusion: The work sets a new standard for robust HOI detection, with benchmarks and code made publicly available.

Abstract: Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate prediction. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model’s optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code will be made publicly available at https://github.com/Kratos-Wen/RoHOI.

[100] SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning

Alexandre Brown, Glen Berseth

Main category: cs.CV

TL;DR: SegDAC, a Segmentation-Driven Actor-Critic method, leverages SAM and YOLO-World for object-centric decomposition and semantic grounding, achieving superior visual generalization and sample efficiency in RL tasks.

Details

Motivation: Visual RL struggles with integrating large perception models for effective learning from high-dimensional inputs and noisy rewards.

Method: SegDAC combines SAM for object-centric decomposition, YOLO-World for semantic grounding via text prompts, and a transformer-based architecture for dynamic segment focus using online RL.

Result: SegDAC doubles prior performance on the hardest visual generalization benchmark and matches/surpasses sample efficiency in diverse tasks.

Conclusion: SegDAC effectively integrates perception models into RL, improving visual generalization and sample efficiency without human labels.

Abstract: Visual reinforcement learning (RL) is challenging due to the need to learn both perception and actions from high-dimensional inputs and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains unclear. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks.

[101] CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data

Chongke Bi, Xin Gao, Jiangkang Deng, Guan Li, Jun Han

Main category: cs.CV

TL;DR: CD-TVD combines contrastive learning and diffusion-based super-resolution to achieve accurate 3D super-resolution from limited HR data, reducing reliance on large datasets.

Details

Motivation: Existing super-resolution methods require extensive HR training data, limiting their applicability in diverse simulation scenarios.

Method: CD-TVD uses a contrastive encoder and diffusion model with local attention, pre-trained on historical data and fine-tuned with minimal new HR data.

Result: Experiments show CD-TVD provides accurate, resource-efficient 3D super-resolution for fluid and atmospheric simulations.

Conclusion: CD-TVD advances data augmentation for large-scale simulations by minimizing HR data dependency while preserving detail recovery.

Abstract: Large-scale scientific simulations require significant resources to generate high-resolution time-varying data (TVD). While super-resolution is an efficient post-processing strategy to reduce costs, existing methods rely on a large amount of HR training data, limiting their applicability to diverse simulation scenarios. To address this constraint, we proposed CD-TVD, a novel framework that combines contrastive learning and an improved diffusion-based super-resolution model to achieve accurate 3D super-resolution from limited time-step high-resolution data. During pre-training on historical simulation data, the contrastive encoder and diffusion superresolution modules learn degradation patterns and detailed features of high-resolution and low-resolution samples. In the training phase, the improved diffusion model with a local attention mechanism is fine-tuned using only one newly generated high-resolution timestep, leveraging the degradation knowledge learned by the encoder. This design minimizes the reliance on large-scale high-resolution datasets while maintaining the capability to recover fine-grained details. Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. The code is available at https://github.com/Xin-Gao-private/CD-TVD.

[102] Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model

Yifan Jiang, Ahmad Shariftabrizi, Venkata SK. Manem

Main category: cs.CV

TL;DR: Lung-DDPM+ is an improved generative AI model for lung cancer diagnosis, offering higher efficiency and anatomical precision compared to its predecessor and other SOTA models.

Details

Motivation: Existing generative models for lung cancer diagnosis are inefficient and anatomically imprecise, limiting clinical use.

Method: Lung-DDPM+ uses a DDPM guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver to focus on lesions and improve efficiency.

Result: The model achieves 8× fewer FLOPs, 6.8× lower GPU memory use, and 14× faster sampling while maintaining sample quality. It also performs well in segmentation tasks and a Visual Turing Test.

Conclusion: Lung-DDPM+ effectively generates high-quality thoracic CT images, showing promise for broader medical imaging applications.

Abstract: Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM-PLUS.

[103] UltraLight Med-Vision Mamba for Classification of Neoplastic Progression in Tubular Adenomas

Aqsa Sultana, Nordin Abouzahra, Ahmed Rahu, Brian Shula, Brandon Combs, Derrick Forchetti, Theus Aspiras, Vijayan K. Asari

Main category: cs.CV

TL;DR: Ultralight Med-Vision Mamba, a deep learning model, improves adenoma classification in colonoscopy screenings, enhancing risk assessment and enabling personalized patient care.

Details

Motivation: Precise identification of precancerous polyps is crucial for reducing colorectal cancer risk, requiring advanced tools for accurate classification and risk stratification.

Method: The paper employs Ultralight Med-Vision Mamba, a state-space based model (SSM), to analyze whole slide images, leveraging its ability to model long- and short-range dependencies and generalize images.

Result: The model excels in adenoma classification and stratification, offering computational efficiency and scalability for real-time clinical use.

Conclusion: Ultralight Med-Vision Mamba is a promising tool for improving colonoscopy screenings through precise, efficient, and scalable deep learning.

Abstract: Identification of precancerous polyps during routine colonoscopy screenings is vital for their excision, lowering the risk of developing colorectal cancer. Advanced deep learning algorithms enable precise adenoma classification and stratification, improving risk assessment accuracy and enabling personalized surveillance protocols that optimize patient outcomes. Ultralight Med-Vision Mamba, a state-space based model (SSM), has excelled in modeling long- and short-range dependencies and image generalization, critical factors for analyzing whole slide images. Furthermore, Ultralight Med-Vision Mamba’s efficient architecture offers advantages in both computational speed and scalability, making it a promising tool for real-time clinical deployment.

[104] Blink-to-code: real-time Morse code communication via eye blink detection and classification

Anushka Bhatt

Main category: cs.CV

TL;DR: A real-time system translates eye blinks into Morse code for communication in motor-impaired individuals, achieving 62% accuracy.

Details

Motivation: To provide a low-cost assistive communication method for people with severe motor impairments.

Method: Uses a webcam and computer vision to detect and classify blinks as Morse code dots or dashes, then decodes them into characters.

Result: 62% decoding accuracy with 18-20 seconds response time in experiments with five participants.

Conclusion: The system is a viable, low-cost solution for assistive communication.

Abstract: This study proposes a real-time system that translates voluntary eye blinks into Morse code, enabling communication for individuals with severe motor impairments. Using a standard webcam and computer vision, the system detects and classifies blinks as short (dot) or long (dash), then decodes them into alphanumeric characters. Experiments with five participants show 62% decoding accuracy and 18-20 seconds response times, demonstrating a viable, low-cost assistive communication method.

[105] FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition

Md. Milon Islam, Md Rezwanul Haque, S M Taslim Uddin Raju, Fakhri Karray

Main category: cs.CV

TL;DR: FusionEnsemble-Net, an attention-based ensemble of spatiotemporal networks, improves sign language recognition by fusing visual and motion data, achieving 99.44% accuracy on the MultiMeDaLIS dataset.

Details

Motivation: Accurate sign language recognition in healthcare is challenging due to complex multimodal gestures, requiring advanced frameworks.

Method: Proposes FusionEnsemble-Net, which processes RGB video and radar data through four spatiotemporal networks with attention-based fusion and ensemble classification.

Result: Achieves 99.44% test accuracy on the MultiMeDaLIS dataset, outperforming state-of-the-art methods.

Conclusion: Attention-based fusion and ensemble of diverse spatiotemporal networks create a robust framework for multimodal gesture recognition.

Abstract: Accurate recognition of sign language in healthcare communication poses a significant challenge, requiring frameworks that can accurately interpret complex multimodal gestures. To deal with this, we propose FusionEnsemble-Net, a novel attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data to enhance recognition accuracy. The proposed approach processes RGB video and range Doppler map radar modalities synchronously through four different spatiotemporal networks. For each network, features from both modalities are continuously fused using an attention-based fusion module before being fed into an ensemble of classifiers. Finally, the outputs of these four different fused channels are combined in an ensemble classification head, thereby enhancing the model’s robustness. Experiments demonstrate that FusionEnsemble-Net outperforms state-of-the-art approaches with a test accuracy of 99.44% on the large-scale MultiMeDaLIS dataset for Italian Sign Language. Our findings indicate that an ensemble of diverse spatiotemporal networks, unified by attention-based fusion, yields a robust and accurate framework for complex, multimodal isolated gesture recognition tasks. The source code is available at: https://github.com/rezwanh001/Multimodal-Isolated-Italian-Sign-Language-Recognition.

[106] A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition

Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Fakhri Karray

Main category: cs.CV

TL;DR: A dual-architecture framework for Continuous Sign Language Recognition (CSLR) addresses signer variability and novel sentence structures, achieving state-of-the-art results on the Isharah-1000 dataset.

Details

Motivation: To overcome challenges like inter-signer variability and poor generalization to unseen sentences in CSLR.

Method: Proposes a Signer-Invariant Conformer for signer-agnostic representations and a Multi-Scale Fusion Transformer for novel sentence comprehension.

Result: Achieves WER of 13.07% (SI task) and 47.78% (US task), outperforming prior work.

Conclusion: Task-specific networks for CSLR challenges significantly improve performance, setting a new research benchmark.

Abstract: Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model’s ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah.

[107] What Can We Learn from Inter-Annotator Variability in Skin Lesion Segmentation?

Kumar Abhishek, Jeremy Kawahara, Ghassan Hamarneh

Main category: cs.CV

TL;DR: The paper studies variability in medical image segmentation, focusing on skin lesions, and introduces IMA++, a multi-annotator dataset. It links inter-annotator agreement (IAA) to malignancy and uses IAA to improve segmentation accuracy.

Details

Motivation: Address variability in medical image segmentation due to ambiguous boundaries and annotator differences, particularly in skin lesions.

Method: Curate IMA++, analyze variability factors, predict IAA from images, and integrate IAA into multi-task learning.

Result: Found significant link between IAA and malignancy (p<0.001), predicted IAA with 0.108 MAE, and improved segmentation accuracy by 4.2%.

Conclusion: IAA is a useful soft feature for improving segmentation models, validated across datasets.

Abstract: Medical image segmentation exhibits intra- and inter-annotator variability due to ambiguous object boundaries, annotator preferences, expertise, and tools, among other factors. Lesions with ambiguous boundaries, e.g., spiculated or infiltrative nodules, or irregular borders per the ABCD rule, are particularly prone to disagreement and are often associated with malignancy. In this work, we curate IMA++, the largest multi-annotator skin lesion segmentation dataset, on which we conduct an in-depth study of variability due to annotator, malignancy, tool, and skill factors. We find a statistically significant (p<0.001) association between inter-annotator agreement (IAA), measured using Dice, and the malignancy of skin lesions. We further show that IAA can be accurately predicted directly from dermoscopic images, achieving a mean absolute error of 0.108. Finally, we leverage this association by utilizing IAA as a “soft” clinical feature within a multi-task learning objective, yielding a 4.2% improvement in balanced accuracy averaged across multiple model architectures and across IMA++ and four public dermoscopic datasets. The code is available at https://github.com/sfu-mial/skin-IAV.

[108] X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents

Guoxian Song, Hongyi Xu, Xiaochen Zhao, You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Linjie Luo

Main category: cs.CV

TL;DR: X-UniMotion introduces a unified latent representation for whole-body human motion, enabling high-fidelity cross-identity motion transfer via disentangled tokens and a self-supervised framework.

Details

Motivation: Prior methods rely on explicit skeletal poses and heuristic adjustments, lacking expressiveness and identity-agnostic capabilities.

Method: A self-supervised framework encodes motion into four disentangled latent tokens (face, body, hands) using a DiT-based generative model and auxiliary decoders.

Result: X-UniMotion outperforms state-of-the-art methods, producing expressive animations with superior motion fidelity and identity preservation.

Conclusion: The approach successfully achieves high-fidelity, identity-agnostic motion transfer, advancing whole-body motion representation.

Abstract: We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens – one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion-identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation.

[109] DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

Kang Ni, Minrui Zou, Yuxuan Li, Xiang Li, Kehua Guo, Ming-Ming Cheng, Yimian Dai

Main category: cs.CV

TL;DR: DenoDet V2 introduces a novel transform-domain approach for SAR object detection, leveraging amplitude and phase information via mutual modulation, outperforming DenoDet V1 with improved accuracy and reduced complexity.

Details

Motivation: Addressing the challenge of coherent noise in SAR object detection, the paper explores a new perspective in the transform domain to enhance feature modulation.

Method: DenoDet V2 uses a band-wise mutual modulation mechanism to exploit amplitude and phase information, improving feature enhancement in the transform domain.

Result: DenoDet V2 achieves a 0.8% improvement on SARDet-100K and reduces model complexity by half compared to DenoDet V1.

Conclusion: DenoDet V2 sets a new benchmark for SAR object detection by effectively leveraging transform-domain features and mutual modulation.

Abstract: One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half. The code is available at https://github.com/GrokCV/GrokSAR.

[110] Skyshield: Event-Driven Submillimetre Thin Obstacle Detection for Drone Flight Safety

Zhengli Zhang, Xinyu Luo, Yuchen Sun, Wenhua Ding, Dongyu Huang, Xinlei Chen

Main category: cs.CV

TL;DR: SkyShield is an event-driven framework for detecting submillimeter obstacles like wires, using a lightweight U-Net and Dice-Contour Loss, achieving high accuracy and low latency.

Details

Motivation: Thin obstacles (e.g., wires) are hard to detect with conventional sensors, posing risks to drones in complex environments.

Method: Uses event-driven data, a lightweight U-Net, and Dice-Contour Regularization Loss for precise obstacle detection.

Result: Achieves a mean F1 Score of 0.7088 with 21.2 ms latency, suitable for edge/mobile platforms.

Conclusion: SkyShield effectively detects thin obstacles with high performance, ideal for drone applications.

Abstract: Drones operating in complex environments face a significant threat from thin obstacles, such as steel wires and kite strings at the submillimeter level, which are notoriously difficult for conventional sensors like RGB cameras, LiDAR, and depth cameras to detect. This paper introduces SkyShield, an event-driven, end-to-end framework designed for the perception of submillimeter scale obstacles. Drawing upon the unique features that thin obstacles present in the event stream, our method employs a lightweight U-Net architecture and an innovative Dice-Contour Regularization Loss to ensure precise detection. Experimental results demonstrate that our event-based approach achieves mean F1 Score of 0.7088 with a low latency of 21.2 ms, making it ideal for deployment on edge and mobile platforms.

[111] Autonomous AI Bird Feeder for Backyard Biodiversity Monitoring

El Mustapha Mansouri

Main category: cs.CV

TL;DR: A low-cost, on-premise system for autonomous bird monitoring in Belgian urban gardens uses motion-triggered IP cameras and local processing to classify birds without cloud dependency.

Details

Motivation: To enable citizen-science-grade biodiversity logging at home while preserving privacy and avoiding cloud fees.

Method: Uses motion-triggered IP cameras, Detectron2 for bird localization, and an EfficientNet-B3 model for classification, all running on commodity hardware.

Result: Achieves high validation performance (99.5%) and practical field accuracy (88% top-1) on held-out species.

Conclusion: Demonstrates feasibility for affordable, privacy-preserving autonomous bird monitoring in urban gardens.

Abstract: This paper presents a low cost, on premise system for autonomous backyard bird monitoring in Belgian urban gardens. A motion triggered IP camera uploads short clips via FTP to a local server, where frames are sampled and birds are localized with Detectron2; cropped regions are then classified by an EfficientNet-B3 model fine tuned on a 40-species Belgian subset derived from a larger Kaggle corpus. All processing runs on commodity hardware without a discrete GPU, preserving privacy and avoiding cloud fees. The physical feeder uses small entry ports (30 mm) to exclude pigeons and reduce nuisance triggers. Detector-guided cropping improves classification accuracy over raw-frame classification. The classifier attains high validation performance on the curated subset (about 99.5 percent) and delivers practical field accuracy (top-1 about 88 percent) on held-out species, demonstrating feasibility for citizen-science-grade biodiversity logging at home.

[112] RampNet: A Two-Stage Pipeline for Bootstrapping Curb Ramp Detection in Streetscape Images from Open Government Metadata

John S. O’Meara, Jared Hwang, Zeyu Wang, Michael Saugstad, Jon E. Froehlich

Main category: cs.CV

TL;DR: The paper introduces RampNet, a two-stage pipeline for large-scale curb ramp detection, achieving state-of-the-art performance with a dataset of 210,000 annotated GSV panoramas and a modified ConvNeXt V2 model.

Details

Motivation: Curb ramps are vital for urban accessibility, but detecting them in images is challenging due to the lack of quality datasets. Existing methods are limited in scale or quality.

Method: A two-stage pipeline: Stage 1 auto-translates government curb ramp data into pixel coordinates in GSV panoramas, creating a dataset. Stage 2 trains a modified ConvNeXt V2 model on this dataset.

Result: The dataset achieves 94.0% precision and 92.5% recall; the model reaches 0.9236 AP, outperforming prior work.

Conclusion: The work provides the first large-scale, high-quality curb ramp detection dataset, benchmark, and model, significantly advancing the field.

Abstract: Curb ramps are critical for urban accessibility, but robustly detecting them in images remains an open problem due to the lack of large-scale, high-quality datasets. While prior work has attempted to improve data availability with crowdsourced or manually labeled data, these efforts often fall short in either quality or scale. In this paper, we introduce and evaluate a two-stage pipeline called RampNet to scale curb ramp detection datasets and improve model performance. In Stage 1, we generate a dataset of more than 210,000 annotated Google Street View (GSV) panoramas by auto-translating government-provided curb ramp location data to pixel coordinates in panoramic images. In Stage 2, we train a curb ramp detection model (modified ConvNeXt V2) from the generated dataset, achieving state-of-the-art performance. To evaluate both stages of our pipeline, we compare to manually labeled panoramas. Our generated dataset achieves 94.0% precision and 92.5% recall, and our detection model reaches 0.9236 AP – far exceeding prior work. Our work contributes the first large-scale, high-quality curb ramp detection dataset, benchmark, and model.

[113] Distilling LLM Prior to Flow Model for Generalizable Agent’s Imagination in Object Goal Navigation

Badi Li, Ren-jie Lu, Yu Zhou, Jingke Meng, Wei-shi Zheng

Main category: cs.CV

TL;DR: GOAL is a generative flow-based framework for Object Goal Navigation (ObjectNav) that improves generalization by modeling semantic distributions using LLM-enriched maps.

Details

Motivation: Prior deterministic models overlook uncertainty in indoor layouts, limiting generalization to unseen environments.

Method: GOAL uses generative flow-based modeling and LLM-enriched semantic maps, encoding spatial priors as Gaussian fields.

Result: Achieves state-of-the-art performance on MP3D and Gibson, with strong generalization in HM3D.

Conclusion: GOAL’s generative approach effectively addresses uncertainty and improves generalization in ObjectNav.

Abstract: The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.

[114] What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset

Yuxiao Wang, Yu Lei, Wolin Liang, Weiying Xue, Zhenao Wei, Nan Zhuang, Qi Liu

Main category: cs.CV

TL;DR: The paper introduces PaIR-Net, a framework for predicting action semantics and body-part contact regions, addressing the gap in joint modeling of action and spatial context.

Details

Motivation: Current methods fail to jointly model action semantics and spatial contextualization, limiting comprehensive understanding of actions in visual contexts.

Method: Proposes PaIR-Net with three components: CPAM for contact-relevant body parts, PGCS for pixel-wise segmentation, and IIM for global interaction relationships. Uses the PaIR dataset (13,979 images, 654 actions, 80 objects, 17 body parts).

Result: PaIR-Net outperforms baselines, with ablation studies confirming the efficacy of its components.

Conclusion: The framework successfully bridges the gap in joint modeling of action and spatial context, with plans to release the code and dataset.

Abstract: People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication.

[115] MPT: Motion Prompt Tuning for Micro-Expression Recognition

Jiateng Liu, Hengcan Shi, Feng Chen, Zhiwen Shao, Yaonan Wang, Jianfei Cai, Wenming Zheng

Main category: cs.CV

TL;DR: The paper introduces Motion Prompt Tuning (MPT) to adapt large pre-training models for micro-expression recognition (MER), addressing challenges like scarce annotations and subtle motion capture.

Details

Motivation: Micro-expression recognition is vital for applications like lie detection but suffers from limited training data and the inability of existing models to capture subtle facial movements.

Method: MPT uses motion magnification and Gaussian tokenization to generate prompts for large models, along with a group adapter to enhance MER-specific representation.

Result: Experiments on three MER datasets show MPT outperforms state-of-the-art methods.

Conclusion: MPT effectively adapts large models for MER, improving recognition of subtle facial motions.

Abstract: Micro-expression recognition (MER) is crucial in the affective computing field due to its wide application in medical diagnosis, lie detection, and criminal investigation. Despite its significance, obtaining micro-expression (ME) annotations is challenging due to the expertise required from psychological professionals. Consequently, ME datasets often suffer from a scarcity of training samples, severely constraining the learning of MER models. While current large pre-training models (LMs) offer general and discriminative representations, their direct application to MER is hindered by an inability to capture transitory and subtle facial movements-essential elements for effective MER. This paper introduces Motion Prompt Tuning (MPT) as a novel approach to adapting LMs for MER, representing a pioneering method for subtle motion prompt tuning. Particularly, we introduce motion prompt generation, including motion magnification and Gaussian tokenization, to extract subtle motions as prompts for LMs. Additionally, a group adapter is carefully designed and inserted into the LM to enhance it in the target MER domain, facilitating a more nuanced distinction of ME representation. Furthermore, extensive experiments conducted on three widely used MER datasets demonstrate that our proposed MPT consistently surpasses state-of-the-art approaches and verifies its effectiveness.

[116] RASR: Retrieval-Augmented Super Resolution for Practical Reference-based Image Restoration

Jiaqi Yan, Shuning Xu, Xiangyu Chen, Dell Zhang, Jie Tang, Gangshan Wu, Jie Liu

Main category: cs.CV

TL;DR: RASR introduces a scalable RefSR method by automatically retrieving relevant references, improving practicality and performance over SISR.

Details

Motivation: Overcome the impracticality of manually curated target-reference pairs in existing RefSR methods.

Method: Proposes RASRNet, combining a semantic reference retriever with a diffusion-based generator for enhanced RefSR.

Result: RASRNet outperforms SISR baselines (+0.38 dB PSNR, -0.0131 LPIPS) and generates more realistic textures.

Conclusion: Retrieval augmentation bridges the gap between academic RefSR research and real-world applications.

Abstract: Reference-based Super Resolution (RefSR) improves upon Single Image Super Resolution (SISR) by leveraging high-quality reference images to enhance texture fidelity and visual realism. However, a critical limitation of existing RefSR approaches is their reliance on manually curated target-reference image pairs, which severely constrains their practicality in real-world scenarios. To overcome this, we introduce Retrieval-Augmented Super Resolution (RASR), a new and practical RefSR paradigm that automatically retrieves semantically relevant high-resolution images from a reference database given only a low-quality input. This enables scalable and flexible RefSR in realistic use cases, such as enhancing mobile photos taken in environments like zoos or museums, where category-specific reference data (e.g., animals, artworks) can be readily collected or pre-curated. To facilitate research in this direction, we construct RASR-Flickr30, the first benchmark dataset designed for RASR. Unlike prior datasets with fixed target-reference pairs, RASR-Flickr30 provides per-category reference databases to support open-world retrieval. We further propose RASRNet, a strong baseline that combines a semantic reference retriever with a diffusion-based RefSR generator. It retrieves relevant references based on semantic similarity and employs a diffusion-based generator enhanced with semantic conditioning. Experiments on RASR-Flickr30 demonstrate that RASRNet consistently improves over SISR baselines, achieving +0.38 dB PSNR and -0.0131 LPIPS, while generating more realistic textures. These findings highlight retrieval augmentation as a promising direction to bridge the gap between academic RefSR research and real-world applicability.

[117] HyperKD: Distilling Cross-Spectral Knowledge in Masked Autoencoders via Inverse Domain Shift with Spatial-Aware Masking and Specialized Loss

Abdul Matin, Tanjim Bin Faruk, Shrideep Pallickara, Sangmi Lee Pallickara

Main category: cs.CV

TL;DR: HyperKD is a knowledge distillation framework for hyperspectral remote sensing, transferring learned representations from a simpler teacher model to a student model, overcoming spectral disparities and improving downstream tasks.

Details

Motivation: Direct application of foundation models to hyperspectral remote sensing is challenging due to spectral disparities and limited observations. HyperKD aims to bridge this gap.

Method: HyperKD uses a simpler teacher model for inverse knowledge transfer, featuring spectral range-based channel alignment, spatial feature-guided masking, and an enhanced loss function.

Result: HyperKD improves representation learning, enhancing reconstruction fidelity and performance on tasks like land cover classification and crop type identification.

Conclusion: HyperKD effectively bridges spectral domain gaps, demonstrating the potential of knowledge distillation in hyperspectral remote sensing analytics.

Abstract: The proliferation of foundation models, pretrained on large-scale unlabeled datasets, has emerged as an effective approach in creating adaptable and reusable architectures that can be leveraged for various downstream tasks using satellite observations. However, their direct application to hyperspectral remote sensing remains challenging due to inherent spectral disparities and the scarcity of available observations. In this work, we present HyperKD, a novel knowledge distillation framework that enables transferring learned representations from a teacher model into a student model for effective development of a foundation model on hyperspectral images. Unlike typical knowledge distillation frameworks, which use a complex teacher to guide a simpler student, HyperKD enables an inverse form of knowledge transfer across different types of spectral data, guided by a simpler teacher model. Building upon a Masked Autoencoder, HyperKD distills knowledge from the Prithvi foundational model into a student tailored for EnMAP hyperspectral imagery. HyperKD addresses the inverse domain adaptation problem with spectral gaps by introducing a feature-based strategy that includes spectral range-based channel alignment, spatial feature-guided masking, and an enhanced loss function tailored for hyperspectral images. HyperKD bridges the substantial spectral domain gap, enabling the effective use of pretrained foundation models for geospatial applications. Extensive experiments show that HyperKD significantly improves representation learning in MAEs, leading to enhanced reconstruction fidelity and more robust performance on downstream tasks such as land cover classification, crop type identification, and soil organic carbon prediction, underpinning the potential of knowledge distillation frameworks in remote sensing analytics with hyperspectral imagery.

[118] Animate-X++: Universal Character Image Animation with Dynamic Backgrounds

Shuai Tan, Biao Gong, Zhuoxin Liu, Yan Wang, Xi Chen, Yifan Feng, Hengshuang Zhao

Main category: cs.CV

TL;DR: Animate-X++ is a universal animation framework for various character types, addressing limitations in motion modeling and static backgrounds by introducing Pose Indicator and multi-task training.

Details

Motivation: Existing methods for character image animation are limited to human figures and static backgrounds, lacking realism and generalization for anthropomorphic characters.

Method: Proposes Animate-X++ with Pose Indicator for motion representation and multi-task training for dynamic backgrounds.

Result: Achieves superior performance in universal animation, including anthropomorphic characters and dynamic backgrounds.

Conclusion: Animate-X++ effectively addresses key challenges in character animation, offering a versatile and realistic solution.

Abstract: Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Furthermore, previous methods could only generate videos with static backgrounds, which limits the realism of the videos. For the first challenge, our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X++, a universal animation framework based on DiT for various character types, including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of DiT by simulating possible inputs in advance that may arise during inference. For the second challenge, we introduce a multi-task training strategy that jointly trains the animation and TI2V tasks. Combined with the proposed partial parameter training, this approach achieves not only character animation but also text-driven background dynamics, making the videos more realistic. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench) to evaluate the performance of Animate-X++ on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X++.

[119] IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding

Junxian Li, Beining Xu, Di Zhang

Main category: cs.CV

TL;DR: The paper introduces IAG, a novel input-aware backdoor attack method for vision-language models (VLMs), manipulating grounding behavior by embedding semantic triggers into images. It achieves high attack success rates while maintaining stealth and minimal impact on clean samples.

Details

Motivation: Security issues in VLMs, particularly backdoor attacks in visual grounding tasks, are underexplored. The paper aims to address this gap by proposing a method to manipulate VLMs' grounding behavior.

Method: IAG uses an adaptive trigger generator with a text-conditional U-Net to embed semantic attack targets into images, ensuring stealth via reconstruction loss. A unified method for attack data generation is also introduced.

Result: IAG achieves over 65% ASR@0.5 on InternVL-2.5-8B and shows effectiveness on Ferret-7B and LlaVA-1.5-7B with minimal clean sample accuracy loss. Ablation studies and defense tests confirm robustness.

Conclusion: IAG is a feasible, effective, and stealthy backdoor attack method for VLMs, demonstrating high success rates and transferability while maintaining clean sample performance.

Abstract: Vision-language models (VLMs) have shown significant advancements in tasks such as visual grounding, where they localize specific objects in images based on natural language queries and images. However, security issues in visual grounding tasks for VLMs remain underexplored, especially in the context of backdoor attacks. In this paper, we introduce a novel input-aware backdoor attack method, IAG, designed to manipulate the grounding behavior of VLMs. This attack forces the model to ground a specific target object in the input image, regardless of the user’s query. We propose an adaptive trigger generator that embeds the semantic information of the attack target’s description into the original image using a text-conditional U-Net, thereby overcoming the open-vocabulary attack challenge. To ensure the attack’s stealthiness, we utilize a reconstruction loss to minimize visual discrepancies between poisoned and clean images. Additionally, we introduce a unified method for generating attack data. IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness. Notably, our ASR@0.5 on InternVL-2.5-8B reaches over 65% on various testing sets. IAG also shows promising potential on manipulating Ferret-7B and LlaVA-1.5-7B with very little accuracy decrease on clean samples. Extensive specific experiments, such as ablation study and potential defense, also indicate the robustness and transferability of our attack.

[120] RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia

Main category: cs.CV

TL;DR: RelayFormer is a unified architecture for visual manipulation localization (VML) in images and videos, offering scalable, resolution-agnostic processing with strong generalization.

Details

Motivation: Existing VML methods lack cross-modal generalization and struggle with high-resolution or long-duration inputs.

Method: RelayFormer uses flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, integrating with Transformer-based backbones via lightweight adaptation modules. It includes a query-based mask decoder for efficient video inference.

Result: Achieves state-of-the-art localization performance across benchmarks.

Conclusion: RelayFormer sets a new baseline for scalable and modality-agnostic VML.

Abstract: Visual manipulation localization (VML) – across both images and videos – is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.

[121] Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

Hao Yu, Rupayan Mallick, Margrit Betke, Sarah Adel Bargal

Main category: cs.CV

TL;DR: GEN-AFFECT is a novel framework for generating expressive and identity-consistent 2D avatars using a multimodal diffusion transformer and consistent attention at inference.

Details

Motivation: Existing avatar generation methods often fail to capture fine-grained facial expressions and struggle with identity preservation across expressions.

Method: GEN-AFFECT conditions a multimodal diffusion transformer on identity-expression representations and uses consistent attention at inference for identity consistency.

Result: The framework outperforms state-of-the-art methods in expression accuracy, identity preservation, and consistency across expressions.

Conclusion: GEN-AFFECT offers a robust solution for personalized avatar generation with high fidelity in expressions and identity.

Abstract: Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.

[122] Event-driven Robust Fitting on Neuromorphic Hardware

Tam Ngoc-Bang Nguyen, Anh-Dzung Doan, Zhipeng Cai, Tat-Jun Chin

Main category: cs.CV

TL;DR: The paper introduces an energy-efficient robust fitting method for computer vision using neuromorphic computing (Intel Loihi 2), achieving 15% of the energy consumption of traditional CPU methods.

Details

Motivation: Energy efficiency in robust fitting is underexplored but critical for sustainable AI adoption.

Method: A novel spiking neural network is designed for neuromorphic hardware, with event-driven model estimation and algorithmic adaptations for hardware limitations.

Result: The neuromorphic approach consumes only 15% of the energy of standard CPU methods while maintaining equivalent accuracy.

Conclusion: Neuromorphic computing offers a promising, energy-efficient solution for robust fitting in computer vision.

Abstract: Robust fitting of geometric models is a fundamental task in many computer vision pipelines. Numerous innovations have been produced on the topic, from improving the efficiency and accuracy of random sampling heuristics to generating novel theoretical insights that underpin new approaches with mathematical guarantees. However, one aspect of robust fitting that has received little attention is energy efficiency. This performance metric has become critical as high energy consumption is a growing concern for AI adoption. In this paper, we explore energy-efficient robust fitting via the neuromorphic computing paradigm. Specifically, we designed a novel spiking neural network for robust fitting on real neuromorphic hardware, the Intel Loihi 2. Enabling this are novel event-driven formulations of model estimation that allow robust fitting to be implemented in the unique architecture of Loihi 2, and algorithmic strategies to alleviate the current limited precision and instruction set of the hardware. Results show that our neuromorphic robust fitting consumes only a fraction (15%) of the energy required to run the established robust fitting algorithm on a standard CPU to equivalent accuracy.

[123] CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios

Jialei Xu, Zizhuang Wei, Weikang You, Linyun Li, Weijian Sun

Main category: cs.CV

TL;DR: CitySeg is a foundation model for city-scale point cloud semantic segmentation, using text modality for open vocabulary and zero-shot inference, achieving SOTA performance.

Details

Motivation: Addressing limited 3D data scale and domain gaps in existing models to improve generalization for UAV perception systems.

Method: Custom data preprocessing, local-global cross-attention network, hierarchical classification strategy, and two-stage training with hinge loss.

Result: SOTA performance on nine benchmarks and zero-shot generalization in city-scale point clouds.

Conclusion: CitySeg effectively tackles domain gaps and label discrepancies, enabling advanced UAV perception without visual data.

Abstract: Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city-scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero-shot inference. Specifically, in order to mitigate the issue of non-uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local-global cross-attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two-stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA) performance on nine closed-set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero-shot generalization in city-scale point cloud scenarios without relying on visual information.

[124] Leveraging Failed Samples: A Few-Shot and Training-Free Framework for Generalized Deepfake Detection

Shibo Yao, Renshuai Tao, Xiaolong Zheng, Chao Liang, Chunjie Zhang

Main category: cs.CV

TL;DR: The paper proposes FTNet, a few-shot training-free network for deepfake detection, outperforming existing methods by 8.7% by leveraging failed samples for improvement.

Details

Motivation: Addressing the challenge of poor generalization in deepfake detection by treating it as a few-shot task, where minimal samples can enhance performance.

Method: FTNet uses one fake sample from an evaluation set, comparing test samples to known fake and real samples without training or parameter updates.

Result: Achieves state-of-the-art performance with an 8.7% average improvement over existing methods on 29 generative models.

Conclusion: Leveraging failed samples in few-shot scenarios improves deepfake detection, offering a practical real-world solution.

Abstract: Recent deepfake detection studies often treat unseen sample detection as a zero-shot" task, training on images generated by known models but generalizing to unknown ones. A key real-world challenge arises when a model performs poorly on unknown samples, yet these samples remain available for analysis. This highlights that it should be approached as a few-shot" task, where effectively utilizing a small number of samples can lead to significant improvement. Unlike typical few-shot tasks focused on semantic understanding, deepfake detection prioritizes image realism, which closely mirrors real-world distributions. In this work, we propose the Few-shot Training-free Network (FTNet) for real-world few-shot deepfake detection. Simple yet effective, FTNet differs from traditional methods that rely on large-scale known data for training. Instead, FTNet uses only one fake samplefrom an evaluation set, mimicking the scenario where new samples emerge in the real world and can be gathered for use, without any training or parameter updates. During evaluation, each test sample is compared to the known fake and real samples, and it is classified based on the category of the nearest sample. We conduct a comprehensive analysis of AI-generated images from 29 different generative models and achieve a new SoTA performance, with an average improvement of 8.7% compared to existing methods. This work introduces a fresh perspective on real-world deepfake detection: when the model struggles to generalize on a few-shot sample, leveraging the failed samples leads to better performance.

[125] From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts

Yuji Wang, Moran Li, Xiaobin Hu, Ran Yi, Jiangning Zhang, Chengming Xu, Weijian Cao, Yabiao Wang, Chengjie Wang, Lizhuang Ma

Main category: cs.CV

TL;DR: The paper introduces a Mixture of Facial Experts (MoFE) and a tailored data pipeline to improve identity preservation in video generation for large facial angles, outperforming prior methods.

Details

Motivation: Current video generation models fail to preserve identity under large facial angles due to ineffective feature integration in DiT structures and lack of suitable datasets.

Method: Proposes MoFE, combining three experts for identity, semantics, and details, and a data pipeline with Face Constraints and Identity Consistency to create the LFA Dataset.

Result: Outperforms SOTA methods in face similarity, FID, and CLIP alignment on the LFA benchmark.

Conclusion: The MoFE and LFA Dataset effectively address identity preservation in large-angle video generation, with code and dataset made public.

Abstract: Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at https://github.com/rain152/LFA-Video-Generation.

[126] CLIP-Flow: A Universal Discriminator for AI-Generated Images Inspired by Anomaly Detection

Zhipeng Yuan, Kai Wang, Weize Quan, Dong-Ming Yan, Tieru Wu

Main category: cs.CV

TL;DR: A universal AI-generated image detector is proposed using anomaly detection, trained without AIIs but with proxy images, achieving strong performance across unseen generative models.

Details

Motivation: Addressing the security concerns of AI-generated images (AIIs) by improving detection performance for unseen generative models, as conventional methods often fail.

Method: Uses a pre-trained CLIP encoder for feature extraction and a normalizing flow-like unsupervised model, trained with proxy images (spectrally modified natural images) instead of AIIs.

Result: Demonstrates effectiveness in detecting AIIs from various unseen generative models.

Conclusion: The proposed detector is generalizable and effective for detecting AIIs without requiring access to actual AIIs during training.

Abstract: With the rapid advancement of AI generative models, the visual quality of AI-generated images (AIIs) has become increasingly close to natural images, which inevitably raises security concerns. Most AII detectors often employ the conventional image classification pipeline with natural images and AIIs (generated by a generative model), which can result in limited detection performance for AIIs from unseen generative models. To solve this, we proposed a universal AI-generated image detector from the perspective of anomaly detection. Our discriminator does not need to access any AIIs and learn a generalizable representation with unsupervised learning. Specifically, we use the pre-trained CLIP encoder as the feature extractor and design a normalizing flow-like unsupervised model. Instead of AIIs, proxy images, e.g., obtained by applying a spectral modification operation on natural images, are used for training. Our models are trained by minimizing the likelihood of proxy images, optionally combined with maximizing the likelihood of natural images. Extensive experiments demonstrate the effectiveness of our method on AIIs produced by various image generators.

[127] GazeLT: Visual attention-guided long-tailed disease classification in chest radiographs

Moinak Bhattacharya, Gagandeep Singh, Shubham Jain, Prateek Prasanna

Main category: cs.CV

TL;DR: GazeLT improves long-tailed disease classification by integrating radiologists’ gaze patterns into a deep learning framework, outperforming baselines by significant margins.

Details

Motivation: Radiologists' gaze patterns capture fine-grained and coarse disease information, which can enhance automated image interpretation, especially for long-tailed classes.

Method: GazeLT uses an integration-disintegration mechanism to harness temporal aspects of visual search, applied to NIH-CXR-LT and MIMIC-CXR-LT datasets.

Result: GazeLT outperforms the best long-tailed loss by 4.1% and visual attention baselines by 21.7% in average accuracy.

Conclusion: GazeLT effectively leverages radiologists’ gaze for improved long-tailed disease classification, with code publicly available.

Abstract: In this work, we present GazeLT, a human visual attention integration-disintegration approach for long-tailed disease classification. A radiologist’s eye gaze has distinct patterns that capture both fine-grained and coarser level disease related information. While interpreting an image, a radiologist’s attention varies throughout the duration; it is critical to incorporate this into a deep learning framework to improve automated image interpretation. Another important aspect of visual attention is that apart from looking at major/obvious disease patterns, experts also look at minor/incidental findings (few of these constituting long-tailed classes) during the course of image interpretation. GazeLT harnesses the temporal aspect of the visual search process, via an integration and disintegration mechanism, to improve long-tailed disease classification. We show the efficacy of GazeLT on two publicly available datasets for long-tailed disease classification, namely the NIH-CXR-LT (n=89237) and the MIMIC-CXR-LT (n=111898) datasets. GazeLT outperforms the best long-tailed loss by 4.1% and the visual attention-based baseline by 21.7% in average accuracy metrics for these datasets. Our code is available at https://github.com/lordmoinak1/gazelt.

[128] SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, Yongjun Zhang

Main category: cs.CV

TL;DR: SkySplat is a self-supervised framework for 3D scene reconstruction from sparse-view satellite images, integrating RPC models into generalizable 3DGS for improved accuracy and speed.

Details

Motivation: Existing 3DGS methods are incompatible with RPC models and struggle with multi-temporal satellite images due to geometric constraints, transient objects, and radiometric inconsistencies.

Method: SkySplat uses RGB images and radiometric-robust relative height supervision, avoiding ground-truth height maps. It includes a Cross-Self Consistency Module (CSCM) for transient object mitigation and multi-view consistency aggregation.

Result: SkySplat achieves an 86x speedup over EOGS with higher accuracy, reduces MAE from 13.18 m to 1.80 m on DFC19, and shows strong cross-dataset generalization on MVS3D.

Conclusion: SkySplat effectively addresses limitations of existing methods, offering a fast, accurate, and generalizable solution for satellite image reconstruction.

Abstract: Three-dimensional scene reconstruction from sparse-view satellite images is a long-standing and challenging task. While 3D Gaussian Splatting (3DGS) and its variants have recently attracted attention for its high efficiency, existing methods remain unsuitable for satellite images due to incompatibility with rational polynomial coefficient (RPC) models and limited generalization capability. Recent advances in generalizable 3DGS approaches show potential, but they perform poorly on multi-temporal sparse satellite images due to limited geometric constraints, transient objects, and radiometric inconsistencies. To address these limitations, we propose SkySplat, a novel self-supervised framework that integrates the RPC model into the generalizable 3DGS pipeline, enabling more effective use of sparse geometric cues for improved reconstruction. SkySplat relies only on RGB images and radiometric-robust relative height supervision, thereby eliminating the need for ground-truth height maps. Key components include a Cross-Self Consistency Module (CSCM), which mitigates transient object interference via consistency-based masking, and a multi-view consistency aggregation strategy that refines reconstruction results. Compared to per-scene optimization methods, SkySplat achieves an 86 times speedup over EOGS with higher accuracy. It also outperforms generalizable 3DGS baselines, reducing MAE from 13.18 m to 1.80 m on the DFC19 dataset significantly, and demonstrates strong cross-dataset generalization on the MVS3D benchmark.

[129] SARE: Semantic-Aware Reconstruction Error for Generalizable Diffusion-Generated Image Detection

Ju Yeon Kang, Jaehong Park, Semin Kim, Ji Won Yoon, Nam Soo Kim

Main category: cs.CV

TL;DR: The paper introduces Semantic-Aware Reconstruction Error (SARE) to detect fake images by measuring semantic differences between images and their caption-guided reconstructions, outperforming existing methods on unseen generative models.

Details

Motivation: Existing detection methods fail on out-of-distribution fake images due to reliance on model-specific artifacts. Fake images often align closely with captions, unlike real images.

Method: Proposes SARE, a measure of semantic difference between an image and its caption-guided reconstruction, leveraging the hypothesis that fake images show minimal semantic shifts.

Result: SARE demonstrates strong generalization, outperforming baselines on benchmarks like GenImage and CommunityForensics.

Conclusion: SARE provides a robust, generalizable solution for detecting fake images across diverse generative models.

Abstract: Recently, diffusion-generated image detection has gained increasing attention, as the rapid advancement of diffusion models has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts. To address this limitation, we explore a fundamental property commonly observed in fake images. Motivated by the observation that fake images tend to exhibit higher similarity to their captions than real images, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE can be utilized as a discriminative feature for robust detection across diverse generative models. We empirically demonstrate that the proposed method exhibits strong generalization, outperforming existing baselines on benchmarks including GenImage and CommunityForensics.

[130] CWFBind: Geometry-Awareness for Fast and Accurate Protein-Ligand Docking

Liyan Jia, Chuan-Xian Ren, Hong Yan

Main category: cs.CV

TL;DR: CWFBind is a fast, accurate docking method using local curvature features to improve geometric representation and address class imbalance in pocket prediction.

Details

Motivation: Traditional and some deep learning-based docking methods neglect geometric information, leading to inaccurate pocket localization and binding conformations.

Method: CWFBind integrates local curvature descriptors, degree-aware weighting, and a ligand-aware dynamic radius strategy with an enhanced loss function.

Result: CWFBind achieves competitive performance in docking benchmarks, balancing accuracy and efficiency.

Conclusion: CWFBind offers a robust solution for accurate ligand-protein binding prediction by leveraging geometric and dynamic features.

Abstract: Accurately predicting the binding conformation of small-molecule ligands to protein targets is a critical step in rational drug design. Although recent deep learning-based docking surpasses traditional methods in speed and accuracy, many approaches rely on graph representations and language model-inspired encoders while neglecting critical geometric information, resulting in inaccurate pocket localization and unrealistic binding conformations. In this study, we introduce CWFBind, a weighted, fast, and accurate docking method based on local curvature features. Specifically, we integrate local curvature descriptors during the feature extraction phase to enrich the geometric representation of both proteins and ligands, complementing existing chemical, sequence, and structural features. Furthermore, we embed degree-aware weighting mechanisms into the message passing process, enhancing the model’s ability to capture spatial structural distinctions and interaction strengths. To address the class imbalance challenge in pocket prediction, CWFBind employs a ligand-aware dynamic radius strategy alongside an enhanced loss function, facilitating more precise identification of binding regions and key residues. Comprehensive experimental evaluations demonstrate that CWFBind achieves competitive performance across multiple docking benchmarks, offering a balanced trade-off between accuracy and efficiency.

[131] Generation of Indian Sign Language Letters, Numbers, and Words

Ajeet Kumar Yadav, Nishant Kumar, Rathna G N

Main category: cs.CV

TL;DR: A GAN variant combining ProGAN and SAGAN improves Indian Sign Language image generation, outperforming ProGAN in quality metrics and releasing a new dataset.

Details

Motivation: Addressing the gap in high-quality sign language generation for better communication between hearing and hard-of-hearing individuals.

Method: Developed a modified Attention-based GAN combining ProGAN and SAGAN to generate high-resolution, feature-rich sign language images.

Result: Outperformed ProGAN in Inception Score (3.2 improvement) and Fréchet Inception Distance (30.12 improvement).

Conclusion: The proposed GAN variant enhances sign language image generation and contributes a new dataset for Indian Sign Language.

Abstract: Sign language, which contains hand movements, facial expressions and bodily gestures, is a significant medium for communicating with hard-of-hearing people. A well-trained sign language community communicates easily, but those who don’t know sign language face significant challenges. Recognition and generation are basic communication methods between hearing and hard-of-hearing individuals. Despite progress in recognition, sign language generation still needs to be explored. The Progressive Growing of Generative Adversarial Network (ProGAN) excels at producing high-quality images, while the Self-Attention Generative Adversarial Network (SAGAN) generates feature-rich images at medium resolutions. Balancing resolution and detail is crucial for sign language image generation. We are developing a Generative Adversarial Network (GAN) variant that combines both models to generate feature-rich, high-resolution, and class-conditional sign language images. Our modified Attention-based model generates high-quality images of Indian Sign Language letters, numbers, and words, outperforming the traditional ProGAN in Inception Score (IS) and Fr'echet Inception Distance (FID), with improvements of 3.2 and 30.12, respectively. Additionally, we are publishing a large dataset incorporating high-quality images of Indian Sign Language alphabets, numbers, and 129 words.

[132] SOI is the Root of All Evil: Quantifying and Breaking Similar Object Interference in Single Object Tracking

Yipei Wang, Shiyu Hu, Shukun Jia, Panxi Xu, Hongfei Ma, Yiping Ma, Jing Zhang, Xiaobo Lu, Xin Zhao

Main category: cs.CV

TL;DR: The paper investigates Similar Object Interference (SOI) in Single Object Tracking (SOT), introduces SOIBench for semantic cognitive guidance, and proposes a VLM-based paradigm for improved tracking performance.

Details

Motivation: To address the overlooked bottleneck of SOI in SOT and explore the potential of external cognitive guidance, particularly natural language, to enhance tracking robustness.

Method: Conducts controlled OIM experiments to quantify SOI, constructs SOIBench for semantic guidance evaluation, and integrates large-scale VLMs into RGB trackers.

Result: Eliminating SOI improves SOT performance (AUC up to 4.35). Existing VLT methods fail with semantic guidance, while the proposed VLM approach achieves AUC gains up to 0.93.

Conclusion: SOIBench sets a standard for semantic cognitive tracking research, and the VLM paradigm offers significant advancements over current methods.

Abstract: In this paper, we present the first systematic investigation and quantification of Similar Object Interference (SOI), a long-overlooked yet critical bottleneck in Single Object Tracking (SOT). Through controlled Online Interference Masking (OIM) experiments, we quantitatively demonstrate that eliminating interference sources leads to substantial performance improvements (AUC gains up to 4.35) across all SOTA trackers, directly validating SOI as a primary constraint for robust tracking and highlighting the feasibility of external cognitive guidance. Building upon these insights, we adopt natural language as a practical form of external guidance, and construct SOIBench-the first semantic cognitive guidance benchmark specifically targeting SOI challenges. It automatically mines SOI frames through multi-tracker collective judgment and introduces a multi-level annotation protocol to generate precise semantic guidance texts. Systematic evaluation on SOIBench reveals a striking finding: existing vision-language tracking (VLT) methods fail to effectively exploit semantic cognitive guidance, achieving only marginal improvements or even performance degradation (AUC changes of -0.26 to +0.71). In contrast, we propose a novel paradigm employing large-scale vision-language models (VLM) as external cognitive engines that can be seamlessly integrated into arbitrary RGB trackers. This approach demonstrates substantial improvements under semantic cognitive guidance (AUC gains up to 0.93), representing a significant advancement over existing VLT methods. We hope SOIBench will serve as a standardized evaluation platform to advance semantic cognitive tracking research and contribute new insights to the tracking research community.

[133] COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets

Lingyu Chen, Yawen Zeng, Yue Wang, Peng Wan, Guo-chen Ning, Hongen Liao, Daoqiang Zhang, Fang Chen

Main category: cs.CV

TL;DR: The paper proposes COME, a framework for multi-heterogeneous ultrasound datasets, addressing inter-dataset interference while preserving discriminative features for robust performance.

Details

Motivation: Overcoming limitations of single-dataset training in ultrasound image analysis due to data scarcity and noise, aiming for a universal solution.

Method: COME uses dual structure-semantic shared experts and source-specific experts to create a universal representation space and extract discriminative features.

Result: COME outperforms state-of-the-art methods, showing significant improvements in mean AP across three evaluation modes.

Conclusion: COME provides a robust and universal solution for ultrasound image analysis, leveraging cross-dataset experience and universal priors.

Abstract: Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise. Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME’s superiority, achieving significant mean AP improvements over state-of-the-art methods. Our project is available at: https://universalcome.github.io/UniversalCOME/.

[134] Learning Spatial Decay for Vision Transformers

Yuxin Mao, Zhen Qin, Jinxing Zhou, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai

Main category: cs.CV

TL;DR: SDT introduces a Context-Aware Gating mechanism for dynamic spatial decay in Vision Transformers, improving performance on spatially-structured tasks.

Details

Motivation: ViTs lack explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing methods use fixed spatial decay, limiting adaptability.

Method: SDT uses a Context-Aware Gating (CAG) mechanism to generate dynamic, data-dependent decay for patch interactions, combining spatial proximity and content relevance.

Result: SDT outperforms baselines on ImageNet-1K classification and generation tasks.

Conclusion: Data-dependent spatial decay is a promising paradigm for enhancing spatial attention in Vision Transformers.

Abstract: Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. We introduce \textbf{Spatial Decay Transformer (SDT)}, featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations. Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.

[135] Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li

Main category: cs.CV

TL;DR: GPT-4o’s synthetic image data complements real-world datasets by addressing rare scenarios and providing clean supervision, leading to improved open-source models like Echo-4o.

Details

Motivation: To explore why synthetic images from GPT-4o are valuable despite existing real-world datasets, focusing on rare scenarios and cleaner supervision.

Method: Created Echo-4o-Image, a 180K synthetic dataset from GPT-4o, and fine-tuned the Bagel model. Introduced GenEval++ and Imagine-Bench for better evaluation.

Result: Echo-4o outperforms on benchmarks and improves other models like OmniGen2 and BLIP3-o, showing strong transferability.

Conclusion: Synthetic data from GPT-4o effectively enhances open-source models by filling gaps in real-world datasets and improving text-to-image alignment.

Abstract: Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.

[136] Physics-guided Deep Unfolding Network for Enhanced Kronecker Compressive sensing

Gang Qu, Ping Wang, Siming Zheng, Xin Yuan

Main category: cs.CV

TL;DR: The paper proposes an asymmetric Kronecker CS (AKCS) model and a measurement-aware cross attention (MACA) mechanism to improve image compressed sensing (CS) by enhancing measurement incoherence and learning implicit representations, resulting in a state-of-the-art MEUNet.

Details

Motivation: Existing deep networks for image CS lack incoherent compressed measurements and implicit measurement representations, limiting performance. The paper addresses these gaps.

Method: The authors introduce AKCS for better measurement incoherence and MACA for learning implicit representations, integrating them into an unfolding network (MEUNet).

Result: MEUNet achieves state-of-the-art performance in reconstruction accuracy and inference speed.

Conclusion: The proposed AKCS and MACA mechanisms significantly enhance CS performance, demonstrating the effectiveness of MEUNet.

Abstract: Deep networks have achieved remarkable success in image compressed sensing (CS) task, namely reconstructing a high-fidelity image from its compressed measurement. However, existing works are deficient inincoherent compressed measurement at sensing phase and implicit measurement representations at reconstruction phase, limiting the overall performance. In this work, we answer two questions: 1) how to improve the measurement incoherence for decreasing the ill-posedness; 2) how to learn informative representations from measurements. To this end, we propose a novel asymmetric Kronecker CS (AKCS) model and theoretically present its better incoherence than previous Kronecker CS with minimal complexity increase. Moreover, we reveal that the unfolding networks’ superiority over non-unfolding ones result from sufficient gradient descents, called explicit measurement representations. We propose a measurement-aware cross attention (MACA) mechanism to learn implicit measurement representations. We integrate AKCS and MACA into widely-used unfolding architecture to get a measurement-enhanced unfolding network (MEUNet). Extensive experiences demonstrate that our MEUNet achieves state-of-the-art performance in reconstruction accuracy and inference speed.

[137] COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li

Main category: cs.CV

TL;DR: COXNet improves RGBT tiny object detection with cross-layer fusion, dynamic alignment, and optimized label assignment, achieving a 3.32% mAP improvement.

Details

Motivation: Addressing challenges in detecting tiny objects in RGBT imagery, especially in drone-based scenarios with misalignment, low-light, and cluttered backgrounds.

Method: Proposes COXNet with Cross-Layer Fusion, Dynamic Alignment and Scale Refinement, and GeoShape Similarity Measure for label assignment.

Result: Achieves a 3.32% mAP improvement on the RGBTDronePerson dataset.

Conclusion: COXNet effectively enhances detection in complex environments.

Abstract: Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32% mAP$_{50}$ improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.

[138] Iterative Volume Fusion for Asymmetric Stereo Matching

Yuanting Gao, Linghao Shen

Main category: cs.CV

TL;DR: The paper addresses stereo matching challenges in asymmetric multi-camera systems by proposing a two-phase Iterative Volume Fusion network (IVF-AStereo) to enhance matching accuracy.

Details

Motivation: Traditional stereo matching assumes symmetric visual properties, but asymmetric systems (e.g., tele-wide cameras) disrupt this, complicating cost volume computation.

Method: The proposed IVF-AStereo method refines correlation volume with aggregated concatenation, then fuses volumes to improve details.

Result: The method performs robustly in asymmetric scenarios, handling resolution and color degradation effectively.

Conclusion: Experiments confirm IVF-AStereo’s effectiveness in asymmetric stereo matching.

Abstract: Stereo matching is vital in 3D computer vision, with most algorithms assuming symmetric visual properties between binocular visions. However, the rise of asymmetric multi-camera systems (e.g., tele-wide cameras) challenges this assumption and complicates stereo matching. Visual asymmetry disrupts stereo matching by affecting the crucial cost volume computation. To address this, we explore the matching cost distribution of two established cost volume construction methods in asymmetric stereo. We find that each cost volume experiences distinct information distortion, indicating that both should be comprehensively utilized to solve the issue. Based on this, we propose the two-phase Iterative Volume Fusion network for Asymmetric Stereo matching (IVF-AStereo). Initially, the aggregated concatenation volume refines the correlation volume. Subsequently, both volumes are fused to enhance fine details. Our method excels in asymmetric scenarios and shows robust performance against significant visual asymmetry. Extensive comparative experiments on benchmark datasets, along with ablation studies, confirm the effectiveness of our approach in asymmetric stereo with resolution and color degradation.

Fengyi Wu, Yifei Dong, Zhi-Qi Cheng, Yilong Dai, Guangyu Chen, Hang Wang, Qi Dai, Alexander G. Hauptmann

Main category: cs.CV

TL;DR: GoViG generates navigation instructions from raw egocentric visual data, improving adaptability to unseen environments. It uses visual forecasting and instruction generation, integrated in a multimodal model, and outperforms state-of-the-art methods.

Details

Motivation: To create precise navigation instructions without relying on structured inputs like maps or annotations, enhancing adaptability to unstructured environments.

Method: Decomposes the task into visual forecasting (predicting intermediate states) and instruction generation, using an autoregressive multimodal model with tailored objectives and multimodal reasoning strategies.

Result: Achieves superior BLEU-4 and CIDEr scores, with robust cross-domain generalization, outperforming existing methods.

Conclusion: GoViG effectively generates contextually coherent navigation instructions from raw visual data, demonstrating strong performance and adaptability.

Abstract: We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.

[140] Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification

Haowen Wang, Guowei Zhang, Xiang Zhang, Zeyuan Chen, Haiyang Xu, Dou Hoon Kwark, Zhuowen Tu

Main category: cs.CV

TL;DR: The paper explores using synthetic images from generative models for data augmentation in image classification, comparing their effectiveness to real images and open-set augmentation.

Details

Motivation: To determine if synthetic data from generative models can enhance classification performance and quantify its equivalence to real data augmentation.

Method: The study compares real images, closed-set synthetic images, and open-set synthetic images through experiments, measuring their impact on classification performance.

Result: Synthetic data can augment classification but requires more volume to match real data’s performance. The effect varies with baseline dataset size and synthetic data amount.

Conclusion: While real images are preferred, synthetic data can be effectively used for augmentation with quantified scaling guidelines.

Abstract: In this paper, we address a key scientific problem in machine learning: Given a training set for an image classification task, can we train a generative model on this dataset to enhance the classification performance? (i.e., closed-set generative data augmentation). We start by exploring the distinctions and similarities between real images and closed-set synthetic images generated by advanced generative models. Through extensive experiments, we offer systematic insights into the effective use of closed-set synthetic data for augmentation. Notably, we empirically determine the equivalent scale of synthetic images needed for augmentation. In addition, we also show quantitative equivalence between the real data augmentation and open-set generative augmentation (generative models trained using data beyond the given training set). While it aligns with the common intuition that real images are generally preferred, our empirical formulation also offers a guideline to quantify the increased scale of synthetic data augmentation required to achieve comparable image classification performance. Our results on natural and medical image datasets further illustrate how this effect varies with the baseline training set size and the amount of synthetic data incorporated.

[141] Hierarchical Brain Structure Modeling for Predicting Genotype of Glioma

Haotian Tang, Jianwei Chen, Xinrui Tang, Yunjia Wu, Zhengyang Miao, Chao Li

Main category: cs.CV

TL;DR: Hi-SMGNN, a hierarchical framework, improves IDH mutation prediction in gliomas by integrating structural and morphological connectomes, outperforming existing methods.

Details

Motivation: Current prediction methods for IDH mutation status in gliomas are limited by low availability and noise in functional MRI, and ignore the brain's hierarchical organization.

Method: Hi-SMGNN integrates structural and morphological connectomes hierarchically, using a multimodal interaction module (Siamese network and cross-modal attention), multiscale feature fusion, and personalized modular partitioning.

Result: Hi-SMGNN outperforms baseline and state-of-the-art models on the UCSF-PDGM dataset, showing improved robustness and effectiveness.

Conclusion: Hi-SMGNN offers a superior, interpretable, and non-invasive approach for predicting IDH mutation status in gliomas.

Abstract: Isocitrate DeHydrogenase (IDH) mutation status is a crucial biomarker for glioma prognosis. However, current prediction methods are limited by the low availability and noise of functional MRI. Structural and morphological connectomes offer a non-invasive alternative, yet existing approaches often ignore the brain’s hierarchical organisation and multiscale interactions. To address this, we propose Hi-SMGNN, a hierarchical framework that integrates structural and morphological connectomes from regional to modular levels. It features a multimodal interaction module with a Siamese network and cross-modal attention, a multiscale feature fusion mechanism for reducing redundancy, and a personalised modular partitioning strategy to enhance individual specificity and interpretability. Experiments on the UCSF-PDGM dataset demonstrate that Hi-SMGNN outperforms baseline and state-of-the-art models, showing improved robustness and effectiveness in IDH mutation prediction.

[142] Topological Invariant-Based Iris Identification via Digital Homology and Machine Learning

Ahmet Öztel, İsmet Karaca

Main category: cs.CV

TL;DR: A novel iris recognition method using topological invariants (Betti numbers) outperforms deep learning, offering interpretability and efficiency.

Details

Motivation: To develop a compact, interpretable, and accurate biometric identification method using topological invariants, addressing limitations of deep learning in explainability and data efficiency.

Method: Iris images are divided into grids, and Betti0, Betti1, and their ratio are computed for each subregion. These features are used with logistic regression, KNN, and SVM, compared against a CNN.

Result: Logistic regression achieved 97.78% accuracy, surpassing CNN (96.44%) and other models, with low variance.

Conclusion: Topological invariants provide a robust, explainable alternative to deep learning, applicable beyond iris recognition to fields like medical imaging and interpretable AI.

Abstract: Objective - This study presents a biometric identification method based on topological invariants from 2D iris images, representing iris texture via formally defined digital homology and evaluating classification performance. Methods - Each normalized iris image (48x482 pixels) is divided into grids (e.g., 6x54 or 3x27). For each subregion, we compute Betti0, Betti1, and their ratio using a recent algorithm for homology groups in 2D digital images. The resulting invariants form a feature matrix used with logistic regression, KNN, and SVM (with PCA and 100 randomized repetitions). A convolutional neural network (CNN) is trained on raw images for comparison. Results - Logistic regression achieved 97.78 +/- 0.82% accuracy, outperforming CNN (96.44 +/- 1.32%) and other feature-based models. The topological features showed high accuracy with low variance. Conclusion - This is the first use of topological invariants from formal digital homology for iris recognition. The method offers a compact, interpretable, and accurate alternative to deep learning, useful when explainability or limited data is important. Beyond iris recognition, it can apply to other biometrics, medical imaging, materials science, remote sensing, and interpretable AI. It runs efficiently on CPU-only systems and produces robust, explainable features valuable for security-critical domains.

[143] WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization

Jiahao Wen, Hang Yu, Zhedong Zheng

Main category: cs.CV

TL;DR: WeatherPrompt improves drone geo-localization under diverse weather by fusing image-text modalities and dynamic gating, outperforming existing methods in challenging conditions.

Details

Motivation: Existing drone geo-localization methods struggle with weather perturbations due to limited weather categories and poor feature disentanglement.

Method: WeatherPrompt uses multi-modality learning, training-free weather reasoning, and dynamic gating to create weather-invariant representations.

Result: Achieves significant recall improvements: +13.37% under night and +18.69% under fog/snow conditions.

Conclusion: WeatherPrompt effectively addresses weather-related challenges in drone geo-localization, offering scalable and robust performance.

Abstract: Visual geo-localization for drones faces critical degradation under weather perturbations, \eg, rain and fog, where existing methods struggle with two inherent limitations: 1) Heavy reliance on limited weather categories that constrain generalization, and 2) Suboptimal disentanglement of entangled scene-weather features through pseudo weather categories. We present WeatherPrompt, a multi-modality learning paradigm that establishes weather-invariant representations through fusing the image embedding with the text context. Our framework introduces two key contributions: First, a Training-free Weather Reasoning mechanism that employs off-the-shelf large multi-modality models to synthesize multi-weather textual descriptions through human-like reasoning. It improves the scalability to unseen or complex weather, and could reflect different weather strength. Second, to better disentangle the scene and weather feature, we propose a multi-modality framework with the dynamic gating mechanism driven by the text embedding to adaptively reweight and fuse visual features across modalities. The framework is further optimized by the cross-modal objectives, including image-text contrastive learning and image-text matching, which maps the same scene with different weather conditions closer in the respresentation space. Extensive experiments validate that, under diverse weather conditions, our method achieves competitive recall rates compared to state-of-the-art drone geo-localization methods. Notably, it improves Recall@1 by +13.37% under night conditions and by 18.69% under fog and snow conditions.

[144] MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

Daniel Barco, Marc Stadelmann, Martin Oswald, Ivo Herzig, Lukas Lichtensteiger, Pascal Paysan, Igor Peterlik, Michal Walczak, Bjoern Menze, Frank-Peter Schilling

Main category: cs.CV

TL;DR: MInDI-3D is a 3D diffusion-based model for CBCT artefact removal, reducing radiation exposure by refining sparse-view inputs. It outperforms uncorrected scans and matches 3D U-Net performance, validated by clinicians.

Details

Motivation: To reduce radiation exposure in CBCT imaging by improving artefact removal from sparse-view inputs, extending 2D methods to 3D.

Method: Extends InDI to 3D, uses iterative denoising, and trains on a large pseudo-CBCT dataset (16,182 volumes). Evaluated with quantitative metrics, scalability tests, and clinician assessments.

Result: Achieves 12.96 dB PSNR gain, enables 8x radiation reduction, matches 3D U-Net performance, and generalizes to new scanner geometries. Clinicians rated it sufficient for patient positioning.

Conclusion: MInDI-3D effectively removes CBCT artefacts, reduces radiation, and performs comparably to existing methods, with strong clinical validation.

Abstract: We present MInDI-3D (Medical Inversion by Direct Iteration in 3D), the first 3D conditional diffusion-based model for real-world sparse-view Cone Beam Computed Tomography (CBCT) artefact removal, aiming to reduce imaging radiation exposure. A key contribution is extending the “InDI” concept from 2D to a full 3D volumetric approach for medical images, implementing an iterative denoising process that refines the CBCT volume directly from sparse-view input. A further contribution is the generation of a large pseudo-CBCT dataset (16,182) from chest CT volumes of the CT-RATE public dataset to robustly train MInDI-3D. We performed a comprehensive evaluation, including quantitative metrics, scalability analysis, generalisation tests, and a clinical assessment by 11 clinicians. Our results show MInDI-3D’s effectiveness, achieving a 12.96 (6.10) dB PSNR gain over uncorrected scans with only 50 projections on the CT-RATE pseudo-CBCT (independent real-world) test set and enabling an 8x reduction in imaging radiation exposure. We demonstrate its scalability by showing that performance improves with more training data. Importantly, MInDI-3D matches the performance of a 3D U-Net on real-world scans from 16 cancer patients across distortion and task-based metrics. It also generalises to new CBCT scanner geometries. Clinicians rated our model as sufficient for patient positioning across all anatomical sites and found it preserved lung tumour boundaries well.

[145] WEC-DG: Multi-Exposure Wavelet Correction Method Guided by Degradation Description

Ming Zhao, Pingping Liu, Tongshun Zhang, Zhe Zhang

Main category: cs.CV

TL;DR: The paper proposes WEC-DG, a wavelet-based method for multi-exposure correction, addressing challenges like intra-class variability and exposure anomalies. It introduces degradation guidance and wavelet transforms for precise light correction and detail recovery, outperforming existing methods.

Details

Motivation: Current multi-exposure correction methods struggle with intra-class variability and exposure anomalies, especially in single-exposure images. The paper aims to improve adaptability under complex conditions.

Method: The WEC-DG method uses a degradation descriptor in the ECAM for exposure consistency and alignment. It leverages wavelet transforms in the EDRM for light-detail decoupling, enhancing exposure restoration and detail reconstruction.

Result: Experiments on public datasets show WEC-DG outperforms existing methods, achieving significant improvements in performance.

Conclusion: WEC-DG effectively addresses exposure correction challenges, offering superior adaptability and detail recovery, validated by experimental results.

Abstract: Multi-exposure correction technology is essential for restoring images affected by insufficient or excessive lighting, enhancing the visual experience by improving brightness, contrast, and detail richness. However, current multi-exposure correction methods often encounter challenges in addressing intra-class variability caused by diverse lighting conditions, shooting environments, and weather factors, particularly when processing images captured at a single exposure level. To enhance the adaptability of these models under complex imaging conditions, this paper proposes a Wavelet-based Exposure Correction method with Degradation Guidance (WEC-DG). Specifically, we introduce a degradation descriptor within the Exposure Consistency Alignment Module (ECAM) at both ends of the processing pipeline to ensure exposure consistency and achieve final alignment. This mechanism effectively addresses miscorrected exposure anomalies caused by existing methods’ failure to recognize ‘blurred’ exposure degradation. Additionally, we investigate the light-detail decoupling properties of the wavelet transform to design the Exposure Restoration and Detail Reconstruction Module (EDRM), which processes low-frequency information related to exposure enhancement before utilizing high-frequency information as a prior guide for reconstructing spatial domain details. This serial processing strategy guarantees precise light correction and enhances detail recovery. Extensive experiments conducted on multiple public datasets demonstrate that the proposed method outperforms existing algorithms, achieving significant performance improvements and validating its effectiveness and practical applicability.

[146] A Chain of Diagnosis Framework for Accurate and Explainable Radiology Report Generation

Haibo Jin, Haoxuan Che, Sunan He, Hao Chen

Main category: cs.CV

TL;DR: The paper proposes a trustworthy radiology report generation (RRG) framework called Chain of Diagnosis (CoD) to improve clinical efficacy and explainability by generating QA pairs, grounding diagnoses, and leveraging omni-supervised learning.

Details

Motivation: Existing RRG models lack clinical efficacy in lesion description and explainability, making them untrustworthy for radiologists.

Method: CoD generates QA pairs for key findings, uses a large language model for accurate generation, and includes diagnosis and lesion grounding modules for explainability. Omni-supervised learning is used for training.

Result: CoD outperforms specialist and generalist models on RRG benchmarks, provides accurate grounding of sentences to QA diagnoses and images, and improves radiologists’ efficiency.

Conclusion: The CoD framework enhances trustworthiness in RRG by combining accuracy, explainability, and efficiency, supported by a new dataset and evaluation tool.

Abstract: Despite the progress of radiology report generation (RRG), existing works face two challenges: 1) The performances in clinical efficacy are unsatisfactory, especially for lesion attributes description; 2) the generated text lacks explainability, making it difficult for radiologists to trust the results. To address the challenges, we focus on a trustworthy RRG model, which not only generates accurate descriptions of abnormalities, but also provides basis of its predictions. To this end, we propose a framework named chain of diagnosis (CoD), which maintains a chain of diagnostic process for clinically accurate and explainable RRG. It first generates question-answer (QA) pairs via diagnostic conversation to extract key findings, then prompts a large language model with QA diagnoses for accurate generation. To enhance explainability, a diagnosis grounding module is designed to match QA diagnoses and generated sentences, where the diagnoses act as a reference. Moreover, a lesion grounding module is designed to locate abnormalities in the image, further improving the working efficiency of radiologists. To facilitate label-efficient training, we propose an omni-supervised learning strategy with clinical consistency to leverage various types of annotations from different datasets. Our efforts lead to 1) an omni-labeled RRG dataset with QA pairs and lesion boxes; 2) a evaluation tool for assessing the accuracy of reports in describing lesion location and severity; 3) extensive experiments to demonstrate the effectiveness of CoD, where it outperforms both specialist and generalist models consistently on two RRG benchmarks and shows promising explainability by accurately grounding generated sentences to QA diagnoses and images.

[147] Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion

Jiwon Kim, Pureum Kim, SeonHwa Kim, Soobin Park, Eunju Cha, Kyong Hwan Jin

Main category: cs.CV

TL;DR: The paper introduces a training-free Dual Recursive Feedback (DRF) system to improve spatial and appearance control in text-to-image diffusion models, addressing limitations in existing methods like Ctrl-X and FreeControl.

Details

Motivation: Existing controllable T2I models struggle with preserving spatial structures and capturing fine-grained conditions like object poses and scene layouts.

Method: The proposed DRF system uses appearance and generation feedback to recursively refine intermediate latents, integrating structural and appearance attributes without additional training.

Result: DRF enables fine-grained, high-quality image generation, including class-invariant structure-appearance fusion (e.g., human motion on a tiger).

Conclusion: The method effectively improves structural and semantic coherence in controllable T2I models, with code available for public use.

Abstract: Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user’s intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger’s form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at https://github.com/jwonkm/DRF.

[148] Preacher: Paper-to-Video Agentic System

Jingwei Liu, Ling Yang, Hao Luo, Fan Wang Hongyan Li, Mengdi Wang

Main category: cs.CV

TL;DR: Preacher is a paper-to-video system that overcomes limitations of current video generation models by decomposing, summarizing, and reformulating papers, then synthesizing diverse video segments into coherent abstracts.

Details

Motivation: Current video generation models lack context, flexibility, and domain-specific knowledge, limiting their effectiveness for paper-to-video tasks.

Method: Preacher uses a top-down decomposition and summarization approach, followed by bottom-up video generation with Progressive Chain of Thought (P-CoT) for iterative planning.

Result: Preacher generates high-quality video abstracts across five research fields, outperforming existing models.

Conclusion: Preacher addresses key limitations of current models and demonstrates superior performance in generating structured video abstracts.

Abstract: The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top-down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/GenVerse/Paper2Video

[149] SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs

Bei Yan, Zhiyuan Chen, Yuecong Min, Jie Zhang, Jiahao Wang, Xiaozhen Wang, Shiguang Shan

Main category: cs.CV

TL;DR: The paper introduces SHALE, a scalable benchmark for evaluating hallucinations in Large Vision-Language Models (LVLMs), addressing limitations of prior studies by automating data construction and providing fine-grained analysis.

Details

Motivation: Current LVLMs suffer from hallucinations (inconsistent content), but existing evaluations are coarse, costly, or prone to data leakage. A scalable, automated solution is needed.

Method: Proposes an automated data pipeline and hierarchical hallucination induction framework with input perturbations to create SHALE, a benchmark with 30K+ image-instruction pairs.

Result: Experiments on 20+ LVLMs show significant factuality hallucinations and sensitivity to semantic perturbations.

Conclusion: SHALE offers a scalable, fine-grained solution for evaluating LVLM hallucinations, highlighting their limitations and sensitivity to noise.

Abstract: Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.

[150] Offline Auto Labeling: BAAS

Stefan Haag, Bharanidhar Duraisamy, Felix Govaers, Wolfgang Koch, Martin Fritzsche, Juergen Dickmann

Main category: cs.CV

TL;DR: BAAS is a framework for radar detection in autonomous driving, combining Bayesian tracking, smoothing, and fusion to provide accurate object trajectories and shape estimation for annotation labels. It evaluates tracking and annotation performance, supports continuous improvement, and is tested in urban scenarios.

Details

Motivation: To improve object tracking and label annotation for radar detections in autonomous driving, ensuring accuracy and adaptability under varying supervision levels.

Method: Uses Bayesian-based tracking, smoothing, and fusion methods to generate object trajectories and shape estimates, with optional manual label integration for closed-loop improvements.

Result: Demonstrated effective tracking and annotation in challenging urban scenarios, adaptable to different dynamic objects and class types.

Conclusion: BAAS offers a robust solution for radar-based object tracking and annotation, with potential for continuous enhancement through modular analysis and integration.

Abstract: This paper introduces BAAS, a new Extended Object Tracking (EOT) and fusion-based label annotation framework for radar detections in autonomous driving. Our framework utilizes Bayesian-based tracking, smoothing and eventually fusion methods to provide veritable and precise object trajectories along with shape estimation to provide annotation labels on the detection level under various supervision levels. Simultaneously, the framework provides evaluation of tracking performance and label annotation. If manually labeled data is available, each processing module can be analyzed independently or combined with other modules to enable closed-loop continuous improvements. The framework performance is evaluated in a challenging urban real-world scenario in terms of tracking performance and the label annotation errors. We demonstrate the functionality of the proposed approach for varying dynamic objects and class types

[151] SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing

Heyi Sun, Cong Wang, Tian-Xing Xu, Jingwei Huang, Di Kang, Chunchao Guo, Song-Hai Zhang

Main category: cs.CV

TL;DR: SVG-Head introduces a hybrid representation for editable head avatars, combining surface and volumetric Gaussians for high-fidelity rendering and real-time appearance editing.

Details

Motivation: The challenge lies in achieving photorealistic and editable head avatars due to implicit representations and entangled geometry-appearance modeling.

Method: SVG-Head uses surface Gaussians for explicit appearance modeling with texture images and volumetric Gaussians for non-Lambertian regions. A mesh-aware Gaussian UV mapping ensures sharp textures and real-time rendering.

Result: Experiments on the NeRSemble dataset demonstrate high-fidelity rendering and real-time appearance editing capabilities.

Conclusion: SVG-Head is the first method to provide explicit texture images for Gaussian head avatars, enabling real-time editing while maintaining quality.

Abstract: Creating high-fidelity and editable head avatars is a pivotal challenge in computer vision and graphics, boosting many AR/VR applications. While recent advancements have achieved photorealistic renderings and plausible animation, head editing, especially real-time appearance editing, remains challenging due to the implicit representation and entangled modeling of the geometry and global appearance. To address this, we propose Surface-Volumetric Gaussian Head Avatar (SVG-Head), a novel hybrid representation that explicitly models the geometry with 3D Gaussians bound on a FLAME mesh and leverages disentangled texture images to capture the global appearance. Technically, it contains two types of Gaussians, in which surface Gaussians explicitly model the appearance of head avatars using learnable texture images, facilitating real-time texture editing, while volumetric Gaussians enhance the reconstruction quality of non-Lambertian regions (e.g., lips and hair). To model the correspondence between 3D world and texture space, we provide a mesh-aware Gaussian UV mapping method, which leverages UV coordinates given by the FLAME mesh to obtain sharp texture images and real-time rendering speed. A hierarchical optimization strategy is further designed to pursue the optimal performance in both reconstruction quality and editing flexibility. Experiments on the NeRSemble dataset show that SVG-Head not only generates high-fidelity rendering results, but also is the first method to obtain explicit texture images for Gaussian head avatars and support real-time appearance editing.

[152] Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality

Jie Shao, Ke Zhu, Minghao Fu, Guo-hua Wang, Jianxin Wu

Main category: cs.CV

TL;DR: FaME improves perceptual quality in diffusion models by using failure modes as negative guidance, without harming FID scores.

Details

Motivation: Addressing the gap where FID scores don't reflect perceptual quality, and CFG introduces artifacts.

Method: FaME uses an image quality assessment model to identify and avoid low-quality generations via negative guidance.

Result: Consistent improvements in visual quality on ImageNet without FID compromise; potential for text-to-image extension.

Conclusion: FaME is a training-free, efficient solution for enhancing perceptual quality in diffusion models.

Abstract: Diffusion models have achieved remarkable progress in class-to-image generation. However, we observe that despite impressive FID scores, state-of-the-art models often generate distorted or low-quality images, especially in certain classes. This gap arises because FID evaluates global distribution alignment, while ignoring the perceptual quality of individual samples. We further examine the role of CFG, a common technique used to enhance generation quality. While effective in improving metrics and suppressing outliers, CFG can introduce distribution shift and visual artifacts due to its misalignment with both training objectives and user expectations. In this work, we propose FaME, a training-free and inference-efficient method for improving perceptual quality. FaME uses an image quality assessment model to identify low-quality generations and stores their sampling trajectories. These failure modes are then used as negative guidance to steer future sampling away from poor-quality regions. Experiments on ImageNet demonstrate that FaME brings consistent improvements in visual quality without compromising FID. FaME also shows the potential to be extended to improve text-to-image generation.

[153] Surg-InvNeRF: Invertible NeRF for 3D tracking and reconstruction in surgical vision

Gerardo Loza, Junlei Hu, Dominic Jones, Sharib Ali, Pietro Valdastri

Main category: cs.CV

TL;DR: A novel test-time optimisation (TTO) approach using an invertible Neural Radiance Field (InvNeRF) for 2D and 3D point tracking in surgical scenarios, outperforming state-of-the-art methods.

Details

Motivation: Current point tracking methods struggle with consistent motion or are limited to 2D. The paper aims to address these limitations by leveraging a NeRF-based architecture for improved tracking.

Method: Proposes InvNeRF for parametrising a function that aggregates correspondences, enabling bidirectional deformable-canonical mapping, efficient workspace handling, and guided ray density. Includes multi-scale HexPlanes for fast inference and a new pixel sampling algorithm.

Result: Outperforms TTO state-of-the-art methods by nearly 50% in 2D tracking and is the first TTO approach for 3D tracking, surpassing feed-forward methods.

Conclusion: The InvNeRF-based TTO approach significantly improves precision and accuracy in point tracking, especially in surgical scenarios, while integrating deformable NeRF-based reconstruction benefits.

Abstract: We proposed a novel test-time optimisation (TTO) approach framed by a NeRF-based architecture for long-term 3D point tracking. Most current methods in point tracking struggle to obtain consistent motion or are limited to 2D motion. TTO approaches frame the solution for long-term tracking as optimising a function that aggregates correspondences from other specialised state-of-the-art methods. Unlike the state-of-the-art on TTO, we propose parametrising such a function with our new invertible Neural Radiance Field (InvNeRF) architecture to perform both 2D and 3D tracking in surgical scenarios. Our approach allows us to exploit the advantages of a rendering-based approach by supervising the reprojection of pixel correspondences. It adapts strategies from recent rendering-based methods to obtain a bidirectional deformable-canonical mapping, to efficiently handle a defined workspace, and to guide the rays’ density. It also presents our multi-scale HexPlanes for fast inference and a new algorithm for efficient pixel sampling and convergence criteria. We present results in the STIR and SCARE datasets, for evaluating point tracking and testing the integration of kinematic data in our pipeline, respectively. In 2D point tracking, our approach surpasses the precision and accuracy of the TTO state-of-the-art methods by nearly 50% on average precision, while competing with other approaches. In 3D point tracking, this is the first TTO approach, surpassing feed-forward methods while incorporating the benefits of a deformable NeRF-based reconstruction.

[154] BridgeTA: Bridging the Representation Gap in Knowledge Distillation via Teacher Assistant for Bird’s Eye View Map Segmentation

Beomjun Kim, Suhan Woo, Sejong Heo, Euntai Kim

Main category: cs.CV

TL;DR: BridgeTA is a cost-effective distillation framework that improves BEV map segmentation for camera-only models by using a Teacher Assistant network, outperforming other KD methods.

Details

Motivation: Camera-only BEV segmentation lags behind LiDAR-Camera fusion methods. Existing KD approaches increase model size and cost, prompting the need for a lightweight solution.

Method: BridgeTA introduces a Teacher Assistant network to bridge the gap between teacher and student models, using a shared latent space and a derived distillation loss based on Young’s Inequality.

Result: The method improves the Camera-only baseline by 4.2% mIoU on the nuScenes dataset, outperforming other KD methods by up to 45%.

Conclusion: BridgeTA effectively enhances camera-only BEV segmentation without increasing inference cost, offering a practical solution for autonomous driving.

Abstract: Bird’s-Eye-View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teacher’s architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost-effective distillation framework to bridge the representation gap between LC fusion and Camera-only models through a Teacher Assistant (TA) network while keeping the student’s architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate representation. To ground the framework theoretically, we derive a distillation loss using Young’s Inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA-student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods.

[155] Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection

Zhiqiu Zhang, Dongqi Fan, Mingjie Wang, Qiang Tang, Jian Yang, Zili Yi

Main category: cs.CV

TL;DR: The paper introduces Region-to-Region (R2R) transformation for image harmonization, addressing detail preservation and dataset limitations with Clear-VAE and Harmony Controller, and proposes a new synthetic dataset, RPHarmony.

Details

Motivation: Current LDM-based harmonization struggles with detail preservation and lacks realistic synthetic datasets, limiting performance.

Method: Proposes R2R transformation, Clear-VAE for detail preservation, Harmony Controller with MACA, and Random Poisson Blending for dataset creation.

Result: Outperforms existing methods in metrics and visual harmony; RPHarmony dataset enhances realism.

Conclusion: R2R and RPHarmony advance harmonization capabilities, with released code, dataset, and model weights for open access.

Abstract: The goal of image harmonization is to adjust the foreground in a composite image to achieve visual consistency with the background. Recently, latent diffusion model (LDM) are applied for harmonization, achieving remarkable results. However, LDM-based harmonization faces challenges in detail preservation and limited harmonization ability. Additionally, current synthetic datasets rely on color transfer, which lacks local variations and fails to capture complex real-world lighting conditions. To enhance harmonization capabilities, we propose the Region-to-Region transformation. By injecting information from appropriate regions into the foreground, this approach preserves original details while achieving image harmonization or, conversely, generating new composite data. From this perspective, We propose a novel model R2R. Specifically, we design Clear-VAE to preserve high-frequency details in the foreground using Adaptive Filter while eliminating disharmonious elements. To further enhance harmonization, we introduce the Harmony Controller with Mask-aware Adaptive Channel Attention (MACA), which dynamically adjusts the foreground based on the channel importance of both foreground and background regions. To address the limitation of existing datasets, we propose Random Poisson Blending, which transfers color and lighting information from a suitable region to the foreground, thereby generating more diverse and challenging synthetic images. Using this method, we construct a new synthetic dataset, RPHarmony. Experiments demonstrate the superiority of our method over other methods in both quantitative metrics and visual harmony. Moreover, our dataset helps the model generate more realistic images in real examples. Our code, dataset, and model weights have all been released for open access.

[156] Plane Detection and Ranking via Model Information Optimization

Daoxin Zhong, Jun Li, Meng Yee Michael Chuah

Main category: cs.CV

TL;DR: A generalized framework for plane detection using model information optimization is proposed to address RANSAC’s susceptibility to false positives in complex scenes.

Details

Motivation: RANSAC's inlier threshold ambiguity leads to false positives in plane detection, especially in complex real-world scenes with unknown plane counts.

Method: Treats depth readings as discrete random variables, generates models with candidate plane constraints, and optimizes information to determine the most likely ground truth.

Result: Outperforms Open3D RANSAC in accuracy for plane parameter estimation and ranks plane quality by information reduction.

Conclusion: The framework provides an objective mechanism for plane detection, validated by synthetic data, and is accelerated using neural network segmentation for real-world applications.

Abstract: Plane detection from depth images is a crucial subtask with broad robotic applications, often accomplished by iterative methods such as Random Sample Consensus (RANSAC). While RANSAC is a robust strategy with strong probabilistic guarantees, the ambiguity of its inlier threshold criterion makes it susceptible to false positive plane detections. This issue is particularly prevalent in complex real-world scenes, where the true number of planes is unknown and multiple planes coexist. In this paper, we aim to address this limitation by proposing a generalised framework for plane detection based on model information optimization. Building on previous works, we treat the observed depth readings as discrete random variables, with their probability distributions constrained by the ground truth planes. Various models containing different candidate plane constraints are then generated through repeated random sub-sampling to explain our observations. By incorporating the physics and noise model of the depth sensor, we can calculate the information for each model, and the model with the least information is accepted as the most likely ground truth. This information optimization process serves as an objective mechanism for determining the true number of planes and preventing false positive detections. Additionally, the quality of each detected plane can be ranked by summing the information reduction of inlier points for each plane. We validate these properties through experiments with synthetic data and find that our algorithm estimates plane parameters more accurately compared to the default Open3D RANSAC plane segmentation. Furthermore, we accelerate our algorithm by partitioning the depth map using neural network segmentation, which enhances its ability to generate more realistic plane parameters in real-world data.

[157] Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation

Xu Tang, Junan Jia, Yijing Wang, Jingjing Ma, Xiangrong Zhang

Main category: cs.CV

TL;DR: SAD-Splat improves 3D aerial-view scene segmentation by addressing semantic ambiguity with a Gaussian point drop module and pseudo-label generation, validated on a new benchmark dataset.

Details

Motivation: Traditional methods fail to handle semantic ambiguity from scale variations and occlusions in aerial images, limiting segmentation accuracy.

Method: Introduces a Gaussian point drop module with semantic confidence estimation and a pseudo-label generation pipeline using 2D foundation models.

Result: Achieves a balance between segmentation accuracy and compactness, outperforming traditional methods.

Conclusion: SAD-Splat provides an efficient, scalable solution for 3D aerial scene understanding, validated by the new 3D-AS dataset.

Abstract: In the task of 3D Aerial-view Scene Semantic Segmentation (3D-AVS-SS), traditional methods struggle to address semantic ambiguity caused by scale variations and structural occlusions in aerial images. This limits their segmentation accuracy and consistency. To tackle these challenges, we propose a novel 3D-AVS-SS approach named SAD-Splat. Our method introduces a Gaussian point drop module, which integrates semantic confidence estimation with a learnable sparsity mechanism based on the Hard Concrete distribution. This module effectively eliminates redundant and semantically ambiguous Gaussian points, enhancing both segmentation performance and representation compactness. Furthermore, SAD-Splat incorporates a high-confidence pseudo-label generation pipeline. It leverages 2D foundation models to enhance supervision when ground-truth labels are limited, thereby further improving segmentation accuracy. To advance research in this domain, we introduce a challenging benchmark dataset: 3D Aerial Semantic (3D-AS), which encompasses diverse real-world aerial scenes with sparse annotations. Experimental results demonstrate that SAD-Splat achieves an excellent balance between segmentation accuracy and representation compactness. It offers an efficient and scalable solution for 3D aerial scene understanding.

[158] Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors

Giorgos Karvounas, Nikolaos Kyriazis, Iason Oikonomidis, Georgios Pavlakos, Antonis A. Argyros

Main category: cs.CV

TL;DR: The paper explores texture as a key cue for improving 3D hand reconstruction, proposing a lightweight texture module for better alignment between predicted and observed hand appearances.

Details

Motivation: Texture alignment is often imperfect in high-performing models, suggesting its underuse as a supervisory signal for pose and shape estimation.

Method: A texture module embeds per-pixel observations into UV texture space, using a dense alignment loss between predicted and observed appearances, integrated into existing pipelines.

Result: The approach improves accuracy and realism in hand reconstruction, validated by augmenting the HaMeR transformer architecture.

Conclusion: Texture-guided supervision enhances 3D hand reconstruction, demonstrating the value of appearance alignment.

Abstract: We revisit the role of texture in monocular 3D hand reconstruction, not as an afterthought for photorealism, but as a dense, spatially grounded cue that can actively support pose and shape estimation. Our observation is simple: even in high-performing models, the overlay between predicted hand geometry and image appearance is often imperfect, suggesting that texture alignment may be an underused supervisory signal. We propose a lightweight texture module that embeds per-pixel observations into UV texture space and enables a novel dense alignment loss between predicted and observed hand appearances. Our approach assumes access to a differentiable rendering pipeline and a model that maps images to 3D hand meshes with known topology, allowing us to back-project a textured hand onto the image and perform pixel-based alignment. The module is self-contained and easily pluggable into existing reconstruction pipelines. To isolate and highlight the value of texture-guided supervision, we augment HaMeR, a high-performing yet unadorned transformer architecture for 3D hand pose estimation. The resulting system improves both accuracy and realism, demonstrating the value of appearance-guided alignment in hand reconstruction.

[159] Multi-Contrast Fusion Module: An attention mechanism integrating multi-contrast features for fetal torso plane classification

Shengjun Zhu, Siyu Liu, Runqing Xiong, Liping Zheng, Duo Ma, Rongshang Chen, Jiaxin Cai

Main category: cs.CV

TL;DR: A novel Multi-Contrast Fusion Module (MCFM) improves fetal torso plane recognition in ultrasound imaging by enhancing feature extraction with minimal parameter overhead.

Details

Motivation: Accurate identification of fetal torso planes in ultrasound is crucial for prenatal care, but low contrast and unclear textures hinder fine-grained recognition.

Method: MCFM processes raw ultrasound data in lower neural network layers, using attention weights for multi-contrast feature enhancement with low complexity.

Result: MCFM significantly boosts recognition performance and classification accuracy on fetal torso plane images, with minimal added model complexity.

Conclusion: MCFM enhances clinical reliability in prenatal screening by improving feature representation, showing strong potential for adoption.

Abstract: Purpose: Prenatal ultrasound is a key tool in evaluating fetal structural development and detecting abnormalities, contributing to reduced perinatal complications and improved neonatal survival. Accurate identification of standard fetal torso planes is essential for reliable assessment and personalized prenatal care. However, limitations such as low contrast and unclear texture details in ultrasound imaging pose significant challenges for fine-grained anatomical recognition. Methods: We propose a novel Multi-Contrast Fusion Module (MCFM) to enhance the model’s ability to extract detailed information from ultrasound images. MCFM operates exclusively on the lower layers of the neural network, directly processing raw ultrasound data. By assigning attention weights to image representations under different contrast conditions, the module enhances feature modeling while explicitly maintaining minimal parameter overhead. Results: The proposed MCFM was evaluated on a curated dataset of fetal torso plane ultrasound images. Experimental results demonstrate that MCFM substantially improves recognition performance, with a minimal increase in model complexity. The integration of multi-contrast attention enables the model to better capture subtle anatomical structures, contributing to higher classification accuracy and clinical reliability. Conclusions: Our method provides an effective solution for improving fetal torso plane recognition in ultrasound imaging. By enhancing feature representation through multi-contrast fusion, the proposed approach supports clinicians in achieving more accurate and consistent diagnoses, demonstrating strong potential for clinical adoption in prenatal screening. The codes are available at https://github.com/sysll/MCFM.

[160] Combinative Matching for Geometric Shape Assembly

Nahyuk Lee, Juhong Min, Junhong Lee, Chunghyun Park, Minsu Cho

Main category: cs.CV

TL;DR: A new shape-matching method, combinative matching, improves geometric assembly by modeling identical surface shapes and opposite volume occupancy, reducing ambiguities and outperforming existing methods.

Details

Motivation: Existing methods for geometric assembly rely on identical surface alignment, missing the distinct properties of interlocking shapes. This paper addresses this gap.

Method: The method models ‘identical surface shape’ and ‘opposite volume occupancy,’ using equivariant neural networks to align regions by estimating shape orientations.

Result: The approach reduces local matching ambiguities and robustly combines parts, outperforming state-of-the-art methods on benchmarks.

Conclusion: Combinative matching effectively addresses interlocking shape assembly, demonstrating superior performance and robustness.

Abstract: This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: ‘identical surface shape’ and ‘opposite volume occupancy.’ Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: https://nahyuklee.github.io/cmnet.

[161] Multi-Sequence Parotid Gland Lesion Segmentation via Expert Text-Guided Segment Anything Model

Zhongyuan Wu, Chuan-Xian Ren, Yu Wang, Xiaohua Ban, Jianning Xiao, Xiaohui Duan

Main category: cs.CV

TL;DR: PG-SAM integrates expert diagnosis text and cross-sequence attention to improve parotid gland lesion segmentation, outperforming existing methods.

Details

Motivation: Accurate parotid gland lesion segmentation is challenging due to variable lesion sizes and complex boundaries. Current methods lack expert domain knowledge and rely on hard-to-obtain precise prompts.

Method: PG-SAM uses expert diagnosis text to generate prompts, a cross-sequence attention module for multi-modal integration, and a decoder for segmentation.

Result: PG-SAM achieves state-of-the-art performance across three clinical centers, proving its clinical applicability.

Conclusion: PG-SAM effectively leverages expert knowledge and multi-modal data for superior parotid gland lesion segmentation.

Abstract: Parotid gland lesion segmentation is essential for the treatment of parotid gland diseases. However, due to the variable size and complex lesion boundaries, accurate parotid gland lesion segmentation remains challenging. Recently, the Segment Anything Model (SAM) fine-tuning has shown remarkable performance in the field of medical image segmentation. Nevertheless, SAM’s interaction segmentation model relies heavily on precise lesion prompts (points, boxes, masks, etc.), which are very difficult to obtain in real-world applications. Besides, current medical image segmentation methods are automatically generated, ignoring the domain knowledge of medical experts when performing segmentation. To address these limitations, we propose the parotid gland segment anything model (PG-SAM), an expert diagnosis text-guided SAM incorporating expert domain knowledge for cross-sequence parotid gland lesion segmentation. Specifically, we first propose an expert diagnosis report guided prompt generation module that can automatically generate prompt information containing the prior domain knowledge to guide the subsequent lesion segmentation process. Then, we introduce a cross-sequence attention module, which integrates the complementary information of different modalities to enhance the segmentation effect. Finally, the multi-sequence image features and generated prompts are feed into the decoder to get segmentation result. Experimental results demonstrate that PG-SAM achieves state-of-the-art performance in parotid gland lesion segmentation across three independent clinical centers, validating its clinical applicability and the effectiveness of diagnostic text for enhancing image segmentation in real-world clinical settings.

[162] The Brain Resection Multimodal Image Registration (ReMIND2Reg) 2025 Challenge

Reuben Dorent, Laura Rigolo, Colin P. Galvin, Junyu Chen, Mattias P. Heinrich, Aaron Carass, Olivier Colliot, Demian Wassermann, Alexandra Golby, Tina Kapur, William Wells

Main category: cs.CV

TL;DR: The paper discusses the ReMIND2Reg 2025 Challenge, which aims to improve intraoperative image guidance for brain tumor surgery by addressing brain shift using multimodal registration of MRI and ultrasound.

Details

Motivation: Brain shift during surgery reduces the accuracy of neuronavigation systems based on preoperative MRI. Aligning intraoperative ultrasound with MRI can restore accuracy, but it's challenging due to anatomical changes and modality differences.

Method: The ReMIND2Reg Challenge provides a large benchmark dataset (99 training, 5 validation, 10 test cases) of paired 3D MRI and ultrasound volumes. Performance is evaluated using manually annotated landmarks and metrics like TRE and TRE30.

Result: The challenge establishes a standardized framework to evaluate and improve multimodal registration algorithms for neurosurgery.

Conclusion: ReMIND2Reg aims to advance robust and clinically deployable registration methods for image-guided brain tumor surgery.

Abstract: Accurate intraoperative image guidance is critical for achieving maximal safe resection in brain tumor surgery, yet neuronavigation systems based on preoperative MRI lose accuracy during the procedure due to brain shift. Aligning post-resection intraoperative ultrasound (iUS) with preoperative MRI can restore spatial accuracy by estimating brain shift deformations, but it remains a challenging problem given the large anatomical and topological changes and substantial modality intensity gap. The ReMIND2Reg 2025 Challenge provides the largest public benchmark for this task, built upon the ReMIND dataset. It offers 99 training cases, 5 validation cases, and 10 private test cases comprising paired 3D ceT1 MRI, T2 MRI, and post-resection 3D iUS volumes. Data are provided without annotations for training, while validation and test performance are evaluated on manually annotated anatomical landmarks. Metrics include target registration error (TRE), robustness to worst-case landmark misalignment (TRE30), and runtime. By establishing a standardized evaluation framework for this clinically critical and technically complex problem, ReMIND2Reg aims to accelerate the development of robust, generalizable, and clinically deployable multimodal registration algorithms for image-guided neurosurgery.

[163] TOTNet: Occlusion-Aware Temporal Tracking for Robust Ball Detection in Sports Videos

Hao Xu, Arbind Agrahari Baniya, Sam Wells, Mohamed Reda Bouadjenek, Richard Dazely, Sunil Aryal

Main category: cs.CV

TL;DR: TOTNet, a Temporal Occlusion Tracking Network, improves ball tracking under occlusion in sports videos using 3D convolutions and visibility-weighted loss, outperforming prior methods.

Details

Motivation: Robust ball tracking under occlusion is crucial for sports video analysis, impacting tasks like event detection and officiating.

Method: TOTNet employs 3D convolutions, visibility-weighted loss, and occlusion augmentation to handle partial and full occlusions.

Result: TOTNet reduces RMSE from 37.30 to 7.19 and improves accuracy on fully occluded frames from 0.63 to 0.80, outperforming state-of-the-art methods.

Conclusion: TOTNet is effective for offline sports analytics in fast-paced scenarios, as demonstrated by its performance on a new occlusion-rich dataset.

Abstract: Robust ball tracking under occlusion remains a key challenge in sports video analysis, affecting tasks like event detection and officiating. We present TOTNet, a Temporal Occlusion Tracking Network that leverages 3D convolutions, visibility-weighted loss, and occlusion augmentation to improve performance under partial and full occlusions. Developed in collaboration with Paralympics Australia, TOTNet is designed for real-world sports analytics. We introduce TTA, a new occlusion-rich table tennis dataset collected from professional-level Paralympic matches, comprising 9,159 samples with 1,996 occlusion cases. Evaluated on four datasets across tennis, badminton, and table tennis, TOTNet significantly outperforms prior state-of-the-art methods, reducing RMSE from 37.30 to 7.19 and improving accuracy on fully occluded frames from 0.63 to 0.80. These results demonstrate TOTNets effectiveness for offline sports analytics in fast-paced scenarios. Code and data access:\href{https://github.com/AugustRushG/TOTNet}{AugustRushG/TOTNet}.

[164] Noise-adapted Neural Operator for Robust Non-Line-of-Sight Imaging

Lianfang Wang, Kuilin Qin, Xueying Liu, Huibin Chang, Yong Wang, Yuping Duan

Main category: cs.CV

TL;DR: A framework for 3D NLOS imaging uses noise estimation and a parameterized neural operator for robust, accurate reconstruction, validated by experiments.

Details

Motivation: NLOS imaging faces challenges due to weak, noisy signals; this work aims to improve reconstruction accuracy and robustness.

Method: Combines noise estimation, a parameterized neural operator, and deep algorithm unfolding for end-to-end reconstruction, with global-local feature fusion.

Result: Effective for fast scanning and sparse data, validated by simulations and real datasets.

Conclusion: The method enhances NLOS imaging in complex scenarios with dynamic noise adaptation.

Abstract: Computational imaging, especially non-line-of-sight (NLOS) imaging, the extraction of information from obscured or hidden scenes is achieved through the utilization of indirect light signals resulting from multiple reflections or scattering. The inherently weak nature of these signals, coupled with their susceptibility to noise, necessitates the integration of physical processes to ensure accurate reconstruction. This paper presents a parameterized inverse problem framework tailored for large-scale linear problems in 3D imaging reconstruction. Initially, a noise estimation module is employed to adaptively assess the noise levels present in transient data. Subsequently, a parameterized neural operator is developed to approximate the inverse mapping, facilitating end-to-end rapid image reconstruction. Our 3D image reconstruction framework, grounded in operator learning, is constructed through deep algorithm unfolding, which not only provides commendable model interpretability but also enables dynamic adaptation to varying noise levels in the acquired data, thereby ensuring consistently robust and accurate reconstruction outcomes. Furthermore, we introduce a novel method for the fusion of global and local spatiotemporal data features. By integrating structural and detailed information, this method significantly enhances both accuracy and robustness. Comprehensive numerical experiments conducted on both simulated and real datasets substantiate the efficacy of the proposed method. It demonstrates remarkable performance with fast scanning data and sparse illumination point data, offering a viable solution for NLOS imaging in complex scenarios.

[165] Automated Segmentation of Coronal Brain Tissue Slabs for 3D Neuropathology

Jonathan Williams Ramirez, Dina Zemlyanker, Lucas Deden-Binder, Rogeny Herisse, Erendira Garcia Pallares, Karthik Gopinath, Harshvardhan Gazula, Christopher Mount, Liana N. Kozanno, Michael S. Marshall, Theresa R. Connors, Matthew P. Frosch, Mark Montine, Derek H. Oakley, Christine L. Mac Donald, C. Dirk Keene, Bradley T. Hyman, Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: A deep learning model using U-Net automates brain tissue segmentation from photographs, achieving high accuracy comparable to manual methods.

Details

Motivation: Manual segmentation of brain tissue from photographs is costly, prompting the need for an automated solution.

Method: A U-Net architecture trained on 1,414 manually segmented images and 2,000 synthetic images for generalizability.

Result: Achieved median Dice score >0.98, mean surface distance <0.4mm, and 95% Hausdorff distance <1.60mm.

Conclusion: The tool is highly accurate, publicly available, and approaches inter-/intra-rater variability levels.

Abstract: Advances in image registration and machine learning have recently enabled volumetric analysis of \emph{postmortem} brain tissue from conventional photographs of coronal slabs, which are routinely collected in brain banks and neuropathology laboratories worldwide. One caveat of this methodology is the requirement of segmentation of the tissue from photographs, which currently requires costly manual intervention. In this article, we present a deep learning model to automate this process. The automatic segmentation tool relies on a U-Net architecture that was trained with a combination of \textit{(i)}1,414 manually segmented images of both fixed and fresh tissue, from specimens with varying diagnoses, photographed at two different sites; and \textit{(ii)}2,000 synthetic images with randomized contrast and corresponding masks generated from MRI scans for improved generalizability to unseen photographic setups. Automated model predictions on a subset of photographs not seen in training were analyzed to estimate performance compared to manual labels – including both inter- and intra-rater variability. Our model achieved a median Dice score over 0.98, mean surface distance under 0.4mm, and 95% Hausdorff distance under 1.60~mm, which approaches inter-/intra-rater levels. Our tool is publicly available at surfer.nmr.mgh.harvard.edu/fswiki/PhotoTools.

[166] NegFaceDiff: The Power of Negative Context in Identity-Conditioned Diffusion for Synthetic Face Generation

Eduarda Caldeira, Naser Damer, Fadi Boutros

Main category: cs.CV

TL;DR: NegFaceDiff improves synthetic face data generation by using negative conditions to enhance identity separability, boosting FR model performance.

Details

Motivation: Addresses identity overlap in synthetic face data from diffusion models, which hampers FR system performance.

Method: Introduces NegFaceDiff, a sampling method adding negative conditions to identity-conditioned diffusion models to enforce inter-class separability.

Result: Increases identity separability (FDR from 2.427 to 5.687) and improves FR model performance on benchmarks.

Conclusion: NegFaceDiff effectively enhances synthetic data quality for FR training by mitigating identity overlap.

Abstract: The use of synthetic data as an alternative to authentic datasets in face recognition (FR) development has gained significant attention, addressing privacy, ethical, and practical concerns associated with collecting and using authentic data. Recent state-of-the-art approaches have proposed identity-conditioned diffusion models to generate identity-consistent face images, facilitating their use in training FR models. However, these methods often lack explicit sampling mechanisms to enforce inter-class separability, leading to identity overlap in the generated data and, consequently, suboptimal FR performance. In this work, we introduce NegFaceDiff, a novel sampling method that incorporates negative conditions into the identity-conditioned diffusion process. NegFaceDiff enhances identity separation by leveraging negative conditions that explicitly guide the model away from unwanted features while preserving intra-class consistency. Extensive experiments demonstrate that NegFaceDiff significantly improves the identity consistency and separability of data generated by identity-conditioned diffusion models. Specifically, identity separability, measured by the Fisher Discriminant Ratio (FDR), increases from 2.427 to 5.687. These improvements are reflected in FR systems trained on the NegFaceDiff dataset, which outperform models trained on data generated without negative conditions across multiple benchmarks.

[167] TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos

Jinxi Li, Ziyang Song, Bo Yang

Main category: cs.CV

TL;DR: TRACE models 3D scene motion physics from videos without labels, outperforming baselines in frame extrapolation and enabling object segmentation via learned physical parameters.

Details

Motivation: Existing methods struggle with complex motion physics or require additional labels. TRACE aims to overcome these limitations by learning physical parameters directly.

Method: TRACE formulates 3D points as rigid particles, learning a dynamics system for each to estimate physical parameters governing motion.

Result: TRACE outperforms baselines in future frame extrapolation and allows easy object segmentation by clustering physical parameters.

Conclusion: TRACE effectively models complex 3D scene physics without labels, offering superior performance and practical segmentation capabilities.

Abstract: In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle’s motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters.

[168] GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors

Xingyilang Yin, Qi Zhang, Jiahao Chang, Ying Feng, Qingnan Fan, Xi Yang, Chi-Man Pun, Huaqi Zhang, Xiaodong Cun

Main category: cs.CV

TL;DR: GSFixer improves 3DGS reconstruction from sparse views using a reference-guided video restoration model, outperforming state-of-the-art methods.

Details

Motivation: Sparse views in 3DGS reconstruction lead to artifacts; existing generative priors fail to maintain consistency with inputs.

Method: Uses a DiT-based video diffusion model with 2D/3D features from reference views for artifact restoration.

Result: Outperforms current methods in 3DGS artifact restoration and sparse-view reconstruction.

Conclusion: GSFixer effectively enhances 3DGS quality from sparse inputs, validated by the new DL3DV-Res benchmark.

Abstract: Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an ill-posed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic features and 3D geometric features of reference views extracted from the visual geometry foundation model, enhancing the semantic coherence and 3D consistency when fixing artifact novel views. Furthermore, considering the lack of suitable benchmarks for 3DGS artifact restoration evaluation, we present DL3DV-Res which contains artifact frames rendered using low-quality 3DGS. Extensive experiments demonstrate our GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. Project page: https://github.com/GVCLab/GSFixer.

[169] RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians

Shenxing Wei, Jinxi Li, Yafei Yang, Siyuan Zhou, Bo Yang

Main category: cs.CV

TL;DR: RayletDF is a novel method for 3D surface reconstruction from point clouds or 3D Gaussians, using raylet distance fields for efficient and precise results.

Details

Motivation: Existing coordinate-based methods are computationally intensive for rendering explicit surfaces, prompting the need for a more efficient solution.

Method: The pipeline includes a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender to predict and aggregate surface points.

Result: Superior performance on public datasets, with exceptional generalization in single-forward passes on unseen data.

Conclusion: RayletDF offers an efficient, generalizable solution for high-quality 3D surface reconstruction.

Abstract: In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.

[170] PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

Yin Xie, Zhichao Chen, Xiaoze Yu, Yongle Zhao, Xiang An, Kaicheng Yang, Zimin Ran, Jia Guo, Ziyong Feng, Jiankang Deng

Main category: cs.CV

TL;DR: PaCo-FR introduces an unsupervised framework for facial representation learning, addressing challenges like feature capture, spatial structure, and data efficiency with masked image modeling and patch-pixel alignment.

Details

Motivation: Existing methods struggle with capturing distinct facial features, ignoring spatial structure, and inefficient use of labeled data.

Method: Combines masked image modeling with patch-pixel alignment, using structured masking, a patch-based codebook, and spatial consistency constraints.

Result: Achieves state-of-the-art performance with just 2 million unlabeled images, excelling in varied poses, occlusions, and lighting.

Conclusion: Advances facial representation learning, offering a scalable, efficient solution that reduces reliance on annotated data.

Abstract: Facial representation pre-training is crucial for tasks like facial recognition, expression analysis, and virtual reality. However, existing methods face three key challenges: (1) failing to capture distinct facial features and fine-grained semantics, (2) ignoring the spatial structure inherent to facial anatomy, and (3) inefficiently utilizing limited labeled data. To overcome these, we introduce PaCo-FR, an unsupervised framework that combines masked image modeling with patch-pixel alignment. Our approach integrates three innovative components: (1) a structured masking strategy that preserves spatial coherence by aligning with semantically meaningful facial regions, (2) a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens, and (3) spatial consistency constraints that preserve geometric relationships between facial components. PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training. Our method demonstrates significant improvements, particularly in scenarios with varying poses, occlusions, and lighting conditions. We believe this work advances facial representation learning and offers a scalable, efficient solution that reduces reliance on expensive annotated datasets, driving more effective facial analysis systems.

[171] Slot Attention-based Feature Filtering for Few-Shot Learning

Javier Rodenas, Eduardo Aguilar, Petia Radeva

Main category: cs.CV

TL;DR: SAFF uses slot attention to filter irrelevant features in few-shot learning, improving classification by focusing on meaningful similarities.

Details

Motivation: Irrelevant features degrade few-shot learning performance, causing confusion and misclassification.

Method: SAFF integrates slot attention with patch embeddings to filter weak features and uses a similarity matrix for relevance quantification.

Result: SAFF outperforms other attention mechanisms and state-of-the-art methods on benchmarks like CIFAR-FS, FC100, miniImageNet, and tieredImageNet.

Conclusion: Slot attention effectively filters irrelevant features, enhancing few-shot learning performance.

Abstract: Irrelevant features can significantly degrade few-shot learn ing performance. This problem is used to match queries and support images based on meaningful similarities despite the limited data. However, in this process, non-relevant fea tures such as background elements can easily lead to confu sion and misclassification. To address this issue, we pro pose Slot Attention-based Feature Filtering for Few-Shot Learning (SAFF) that leverages slot attention mechanisms to discriminate and filter weak features, thereby improving few-shot classification performance. The key innovation of SAFF lies in its integration of slot attention with patch em beddings, unifying class-aware slots into a single attention mechanism to filter irrelevant features effectively. We intro duce a similarity matrix that computes across support and query images to quantify the relevance of filtered embed dings for classification. Through experiments, we demon strate that Slot Attention performs better than other atten tion mechanisms, capturing discriminative features while reducing irrelevant information. We validate our approach through extensive experiments on few-shot learning bench marks: CIFAR-FS, FC100, miniImageNet and tieredIma geNet, outperforming several state-of-the-art methods.

[172] MangaDiT: Reference-Guided Line Art Colorization with Hierarchical Attention in Diffusion Transformers

Qianru Qiu, Jiafeng Mao, Kento Masui, Xueting Wang

Main category: cs.CV

TL;DR: MangaDiT improves reference-guided line art colorization using Diffusion Transformers and a hierarchical attention mechanism for better region-level color consistency.

Details

Motivation: Existing methods lack region-level color consistency when reference and target images differ in pose or motion. MangaDiT addresses this by implicitly discovering semantic correspondences.

Method: MangaDiT uses Diffusion Transformers (DiT) with a hierarchical attention mechanism and dynamic weighting strategy, leveraging pooled spatial features for context-awareness.

Result: MangaDiT outperforms state-of-the-art methods on benchmark datasets in both qualitative and quantitative evaluations.

Conclusion: MangaDiT enhances region-level color alignment and achieves superior performance in reference-guided line art colorization.

Abstract: Recent advances in diffusion models have significantly improved the performance of reference-guided line art colorization. However, existing methods still struggle with region-level color consistency, especially when the reference and target images differ in character pose or motion. Instead of relying on external matching annotations between the reference and target, we propose to discover semantic correspondences implicitly through internal attention mechanisms. In this paper, we present MangaDiT, a powerful model for reference-guided line art colorization based on Diffusion Transformers (DiT). Our model takes both line art and reference images as conditional inputs and introduces a hierarchical attention mechanism with a dynamic attention weighting strategy. This mechanism augments the vanilla attention with an additional context-aware path that leverages pooled spatial features, effectively expanding the model’s receptive field and enhancing region-level color alignment. Experiments on two benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches, achieving superior performance in both qualitative and quantitative evaluations.

[173] NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation

Devvrat Joshi, Islem Rekik

Main category: cs.CV

TL;DR: NEURAL is a framework for compressing multimodal medical imaging data using semantics-guided compression, reducing data size significantly while maintaining diagnostic accuracy.

Details

Motivation: Address storage and transmission challenges of medical imaging data in resource-constrained clinical settings.

Method: Uses cross-attention scores from a vision-language model to prune diagnostically critical regions of chest X-rays, creating a compressed graph representation fused with a knowledge graph.

Result: Achieves 93.4-97.7% reduction in image data size with 0.88-0.95 AUC for pneumonia detection, outperforming baselines.

Conclusion: NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows without sacrificing performance.

Abstract: The rapid growth of multimodal medical imaging data presents significant storage and transmission challenges, particularly in resource-constrained clinical settings. We propose NEURAL, a novel framework that addresses this by using semantics-guided data compression. Our approach repurposes cross-attention scores between the image and its radiological report from a fine-tuned generative vision-language model to structurally prune chest X-rays, preserving only diagnostically critical regions. This process transforms the image into a highly compressed, graph representation. This unified graph-based representation fuses the pruned visual graph with a knowledge graph derived from the clinical report, creating a universal data structure that simplifies downstream modeling. Validated on the MIMIC-CXR and CheXpert Plus dataset for pneumonia detection, NEURAL achieves a 93.4-97.7% reduction in image data size while maintaining a high diagnostic performance of 0.88-0.95 AUC, outperforming other baseline models that use uncompressed data. By creating a persistent, task-agnostic data asset, NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows and teleradiology without sacrificing performance. Our NEURAL code is available at https://github.com/basiralab/NEURAL.

[174] Multimodal Sheaf-based Network for Glioblastoma Molecular Subtype Prediction

Shekhnaz Idrissova, Islem Rekik

Main category: cs.CV

TL;DR: A sheaf-based framework for fusing MRI and histopathology data improves glioblastoma subtype classification, addressing limitations of current methods.

Details

Motivation: Current glioblastoma subtype classification relies on invasive tissue extraction, and existing multimodal fusion methods lack robustness in preserving structural information and handling incomplete data.

Method: Proposes a sheaf-based framework for structure-aware fusion of MRI and histopathology data, ensuring consistency and discriminative feature retention.

Result: Outperforms baseline methods and shows robustness in incomplete or missing data scenarios.

Conclusion: The framework advances virtual biopsy tools for rapid diagnostics, with potential clinical impact.

Abstract: Glioblastoma is a highly invasive brain tumor with rapid progression rates. Recent studies have shown that glioblastoma molecular subtype classification serves as a significant biomarker for effective targeted therapy selection. However, this classification currently requires invasive tissue extraction for comprehensive histopathological analysis. Existing multimodal approaches combining MRI and histopathology images are limited and lack robust mechanisms for preserving shared structural information across modalities. In particular, graph-based models often fail to retain discriminative features within heterogeneous graphs, and structural reconstruction mechanisms for handling missing or incomplete modality data are largely underexplored. To address these limitations, we propose a novel sheaf-based framework for structure-aware and consistent fusion of MRI and histopathology data. Our model outperforms baseline methods and demonstrates robustness in incomplete or missing data scenarios, contributing to the development of virtual biopsy tools for rapid diagnostics. Our source code is available at https://github.com/basiralab/MMSN/.

[175] Predictive Uncertainty for Runtime Assurance of a Real-Time Computer Vision-Based Landing System

Romeo Valentin, Sydney M. Katz, Artur B. Carneiro, Don Walker, Mykel J. Kochenderfer

Main category: cs.CV

TL;DR: A vision-based pipeline for aircraft pose estimation from runway images, featuring innovations in neural architecture, loss function, and fault detection, aiming to meet aviation safety standards.

Details

Motivation: Ensuring robustness and safety in data-driven computer vision systems for aviation applications, particularly for autonomous navigation.

Method: Proposes a pipeline with: (i) a neural architecture using spatial Soft Argmax for keypoint regression, (ii) a calibrated uncertainty loss function, and (iii) Residual-based RAIM for fault detection.

Result: Outperforms baselines in accuracy, provides well-calibrated uncertainty estimates, and enables fault detection with sub-pixel precision.

Conclusion: The pipeline advances the certification potential of vision-based systems for safety-critical aviation applications.

Abstract: Recent advances in data-driven computer vision have enabled robust autonomous navigation capabilities for civil aviation, including automated landing and runway detection. However, ensuring that these systems meet the robustness and safety requirements for aviation applications remains a major challenge. In this work, we present a practical vision-based pipeline for aircraft pose estimation from runway images that represents a step toward the ability to certify these systems for use in safety-critical aviation applications. Our approach features three key innovations: (i) an efficient, flexible neural architecture based on a spatial Soft Argmax operator for probabilistic keypoint regression, supporting diverse vision backbones with real-time inference; (ii) a principled loss function producing calibrated predictive uncertainties, which are evaluated via sharpness and calibration metrics; and (iii) an adaptation of Residual-based Receiver Autonomous Integrity Monitoring (RAIM), enabling runtime detection and rejection of faulty model outputs. We implement and evaluate our pose estimation pipeline on a dataset of runway images. We show that our model outperforms baseline architectures in terms of accuracy while also producing well-calibrated uncertainty estimates with sub-pixel precision that can be used downstream for fault detection.

[176] January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis

Amir Hosseinian, Ashkan Dehghani Zahedani, Umer Mansoor, Noosheen Hashemi, Mark Woodward

Main category: cs.CV

TL;DR: The paper introduces the January Food Benchmark (JFB), a dataset of 1,000 food images with human-validated annotations, a benchmarking framework, and baseline results showing a specialized model outperforming general-purpose ones.

Details

Motivation: The lack of standardized evaluation methodologies and high-quality datasets in AI for nutritional analysis hinders progress.

Method: The authors present the JFB dataset, a benchmarking framework with robust metrics, and evaluate general-purpose VLMs and their specialized model.

Result: The specialized model achieves an Overall Score of 86.2, a 12.1-point improvement over general-purpose models.

Conclusion: This work provides a valuable dataset and framework for future research in automated nutritional analysis.

Abstract: Progress in AI for automated nutritional analysis is critically hampered by the lack of standardized evaluation methodologies and high-quality, real-world benchmark datasets. To address this, we introduce three primary contributions. First, we present the January Food Benchmark (JFB), a publicly available collection of 1,000 food images with human-validated annotations. Second, we detail a comprehensive benchmarking framework, including robust metrics and a novel, application-oriented overall score designed to assess model performance holistically. Third, we provide baseline results from both general-purpose Vision-Language Models (VLMs) and our own specialized model, january/food-vision-v1. Our evaluation demonstrates that the specialized model achieves an Overall Score of 86.2, a 12.1-point improvement over the best-performing general-purpose configuration. This work offers the research community a valuable new evaluation dataset and a rigorous framework to guide and benchmark future developments in automated nutritional analysis.

[177] Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li

Main category: cs.CV

TL;DR: M3-Agent is a multimodal framework with long-term memory, outperforming baselines in memory-based reasoning tasks on M3-Bench.

Details

Motivation: To advance multimodal agents with human-like long-term memory for deeper environmental understanding.

Method: Uses reinforcement learning to train M3-Agent, which processes real-time visual/auditory inputs and organizes memory in an entity-centric, multimodal format.

Result: M3-Agent achieves 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively, compared to baselines.

Conclusion: M3-Agent advances multimodal agents with human-like memory and provides practical design insights.

Abstract: We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent

[178] MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei

Main category: cs.CV

TL;DR: The paper proposes MoIIE, a sparse Mixture of Experts architecture for LVLMs, improving efficiency and performance by routing tokens to intra- and inter-modality experts.

Details

Motivation: To address the computational costs of dense LVLMs and the challenge of effectively applying MoE to multi-modal tasks.

Method: Introduces MoIIE, guiding expert routing by modality, and a two-stage training strategy for MoE and multi-modal capabilities.

Result: MoIIE models with 5.5B and 11.3B parameters match or surpass larger open-source MoE-LLMs in performance.

Conclusion: MoIIE offers an efficient and effective solution for multi-modal learning, balancing parameter efficiency and performance.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.

[179] DSS-Prompt: Dynamic-Static Synergistic Prompting for Few-Shot Class-Incremental Learning

Linpu He, Yanan Li, Bingze Li, Elvis Han Cui, Donghui Wang

Main category: cs.CV

TL;DR: DSS-Prompt leverages pre-trained Vision Transformers with static and dynamic prompts for few-shot class-incremental learning, outperforming state-of-the-art methods without additional training.

Details

Motivation: Address the underexplored challenge of few-shot class-incremental learning (FSCIL) using pre-trained models to learn new concepts from limited samples without forgetting old ones.

Method: Introduces DSS-Prompt, combining static prompts for domain adaptation and dynamic prompts for instance-aware semantics, generated via a pre-trained multi-modal model.

Result: Outperforms existing methods on four benchmarks, alleviating catastrophic forgetting and achieving better performance.

Conclusion: DSS-Prompt is a simple yet effective solution for FSCIL, demonstrating strong generalization and adaptability.

Abstract: Learning from large-scale pre-trained models with strong generalization ability has shown remarkable success in a wide range of downstream tasks recently, but it is still underexplored in the challenging few-shot class-incremental learning (FSCIL) task. It aims to continually learn new concepts from limited training samples without forgetting the old ones at the same time. In this paper, we introduce DSS-Prompt, a simple yet effective approach that transforms the pre-trained Vision Transformer with minimal modifications in the way of prompts into a strong FSCIL classifier. Concretely, we synergistically utilize two complementary types of prompts in each Transformer block: static prompts to bridge the domain gap between the pre-training and downstream datasets, thus enabling better adaption; and dynamic prompts to capture instance-aware semantics, thus enabling easy transfer from base to novel classes. Specially, to generate dynamic prompts, we leverage a pre-trained multi-modal model to extract input-related diverse semantics, thereby generating complementary input-aware prompts, and then adaptively adjust their importance across different layers. In this way, on top of the prompted visual embeddings, a simple prototype classifier can beat state-of-the-arts without further training on the incremental tasks. We conduct extensive experiments on four benchmarks to validate the effectiveness of our DSS-Prompt and show that it consistently achieves better performance than existing approaches on all datasets and can alleviate the catastrophic forgetting issue as well.

[180] MeMoSORT: Memory-Assisted Filtering and Motion-Adaptive Association Metric for Multi-Person Tracking

Yingjie Wang, Zhixing Wang, Le Zheng, Tianxiao Liu, Roujing Li, Xueyao Hu

Main category: cs.CV

TL;DR: MeMoSORT is a real-time multi-object tracker addressing limitations of traditional methods with a memory-assisted Kalman filter and motion-adaptive IoU, achieving top performance on DanceTrack and SportsMOT.

Details

Motivation: Overcome challenges in MOT like complex motion and occlusions, where conventional methods fail due to rigid motion models and association rules.

Method: Introduces MeKF for better motion modeling and Mo-IoU for adaptive association, both lightweight and efficient.

Result: Achieves HOTA scores of 67.9% (DanceTrack) and 82.1% (SportsMOT), outperforming existing methods.

Conclusion: MeMoSORT effectively addresses MOT challenges with innovative yet simple components, proving robust in real-world scenarios.

Abstract: Multi-object tracking (MOT) in human-dominant scenarios, which involves continuously tracking multiple people within video sequences, remains a significant challenge in computer vision due to targets’ complex motion and severe occlusions. Conventional tracking-by-detection methods are fundamentally limited by their reliance on Kalman filter (KF) and rigid Intersection over Union (IoU)-based association. The motion model in KF often mismatches real-world object dynamics, causing filtering errors, while rigid association struggles under occlusions, leading to identity switches or target loss. To address these issues, we propose MeMoSORT, a simple, online, and real-time MOT tracker with two key innovations. First, the Memory-assisted Kalman filter (MeKF) uses memory-augmented neural networks to compensate for mismatches between assumed and actual object motion. Second, the Motion-adaptive IoU (Mo-IoU) adaptively expands the matching space and incorporates height similarity to reduce the influence of detection errors and association failures, while remaining lightweight. Experiments on DanceTrack and SportsMOT show that MeMoSORT achieves state-of-the-art performance, with HOTA scores of 67.9% and 82.1%, respectively.

[181] Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions

Lucas Möller, Pascal Tilli, Ngoc Thang Vu, Sebastian Padó

Main category: cs.CV

TL;DR: The paper introduces a second-order method to analyze feature interactions in dual encoder models like Clip, revealing their fine-grained visual-linguistic grounding abilities and limitations.

Details

Motivation: Dual encoder models, such as Clip, lack understanding of how they compare inputs due to limitations of first-order feature-attribution methods. This paper aims to address this gap.

Method: The authors derive a second-order attribution method to analyze feature interactions in dual encoders and apply it to Clip models.

Result: Clip models learn fine-grained correspondences between captions and image regions, but performance varies by object class and exhibits out-of-domain effects.

Conclusion: The study highlights Clip’s visual-linguistic grounding capabilities and identifies systematic errors, providing insights for future improvements.

Abstract: Dual encoder architectures like Clip models map two types of inputs into a shared embedding space and predict similarities between them. Despite their wide application, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods explain importances of individual features and can, thus, only provide limited insights into dual encoders, whose predictions depend on interactions between features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to Clip models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. This intrinsic visual-linguistic grounding ability, however, varies heavily between object classes, exhibits pronounced out-of-domain effects and we can identify individual errors as well as systematic failure categories. Code is publicly available: https://github.com/lucasmllr/exCLIP

[182] MUJICA: Reforming SISR Models for PBR Material Super-Resolution via Cross-Map Attention

Xin Du, Maoyuan Xu, Zhi Ying

Main category: cs.CV

TL;DR: MUJICA is a method to enhance PBR material upscaling by integrating cross-map attention into pre-trained SISR models, improving consistency and performance.

Details

Motivation: Existing SISR methods for PBR materials face cross-map inconsistency and limited generalization, necessitating a better approach.

Method: MUJICA adapts pre-trained Swin-transformer-based SISR models using cross-map attention to fuse features while preserving reconstruction ability.

Result: MUJICA improves PSNR, SSIM, and LPIPS scores, ensuring cross-map consistency and efficient training with limited resources.

Conclusion: MUJICA delivers state-of-the-art performance for PBR material super-resolution, addressing key limitations of existing methods.

Abstract: Physically Based Rendering (PBR) materials are typically characterized by multiple 2D texture maps such as basecolor, normal, metallic, and roughness which encode spatially-varying bi-directional reflectance distribution function (SVBRDF) parameters to model surface reflectance properties and microfacet interactions. Upscaling SVBRDF material is valuable for modern 3D graphics applications. However, existing Single Image Super-Resolution (SISR) methods struggle with cross-map inconsistency, inadequate modeling of modality-specific features, and limited generalization due to data distribution shifts. In this work, we propose Multi-modal Upscaling Joint Inference via Cross-map Attention (MUJICA), a flexible adapter that reforms pre-trained Swin-transformer-based SISR models for PBR material super-resolution. MUJICA is seamlessly attached after the pre-trained and frozen SISR backbone. It leverages cross-map attention to fuse features while preserving remarkable reconstruction ability of the pre-trained SISR model. Applied to SISR models such as SwinIR, DRCT, and HMANet, MUJICA improves PSNR, SSIM, and LPIPS scores while preserving cross-map consistency. Experiments demonstrate that MUJICA enables efficient training even with limited resources and delivers state-of-the-art performance on PBR material datasets.

[183] Poaching Hotspot Identification Using Satellite Imagery

Aryan Pandhi, Shrey Baid, Sanjali Jha

Main category: cs.CV

TL;DR: The paper discusses the rise in elephant poaching in Africa, driven by ivory demand, and proposes a Computer Vision (CV) model to dynamically identify poaching hotspots using geographic indicators and satellite imagery.

Details

Motivation: Elephant poaching remains a critical issue, with poachers adapting to avoid patrols and shifting hotspots. Current anti-poaching efforts are ineffective in remote areas, necessitating an automated solution.

Method: The proposed method involves using a CV model to analyze geographic indicators (e.g., watering holes, seasons, altitude) and satellite imagery to identify poaching hotspots without manual tracking.

Result: The CV model can survey large areas efficiently, avoiding disturbances to wildlife and aviation restrictions, and dynamically adapt to shifting poaching grounds.

Conclusion: A CV-based approach offers a scalable, non-invasive solution to combat elephant poaching by targeting resources effectively in dynamic hotspots.

Abstract: Elephant Poaching in African countries has been a decade-old problem. So much so that African Forest Elephants are now listed as an endangered species, and African Savannah Elephants as critically endangered by the IUCN (International Union for Conservation of Nature). [1] Elephants are hunted primarily for their ivory tusks which caused many elephants to be born tuskless as a genetic modification for survival. [2] Data gathered by recent studies shows that though poaching methods remain the same, the poaching grounds are rather dynamic. Poachers have shifted to areas with less ranger patrols and several other factors like watering holes, seasons, altitude etc. cause constant shifts in poaching hotspot locations. [3] After a period of low poaching from 2000-2014, poaching numbers in African countries are now on the rise again – WWF (World Wildlife Foundation) says there are 20,000 elephants poached annually [4]. In African countries, anti-poaching efforts are concentrated near towns, while a majority of poaching occurs in the deserted regions. All of these factors result in the need for a Computer Vision Model to identify poaching hotspots through locating the geographic indicators of favorable poaching regions. A CV model eliminates the need to manually track poachers and account for the environmental factors to deploy resources and its combination with satellite imagery allows us to survey large areas without disturbing local species or cross border aviation restrictions.

[184] Evolution of Low-Level and Texture Human-CLIP Alignment

Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Jorge Vila-Tomás, Jesus Malo, Valero Laparra

Main category: cs.CV

TL;DR: CLIP models initially align with low-level human image quality assessments but lose this alignment as training progresses due to a shift toward shape-based representations.

Details

Motivation: To understand why CLIP's correlation with low-level human image quality assessments peaks early in training and then declines.

Method: Investigates two factors: shape-texture bias alignment and classification accuracy drop under noise.

Result: CLIP first learns low-level visual features, aligning with human perception but increasing noise sensitivity. Later, it shifts to shape-based representations, improving robustness but reducing low-level alignment.

Conclusion: The findings reveal a trade-off between perceptual alignment and robustness, offering insights for optimizing vision-language models.

Abstract: During the training of multi-modal models like CLIP, we observed an intriguing phenomenon: the correlation with low-level human image quality assessments peaks in the early epochs before gradually declining. This study investigates this observation and seeks to understand its causes through two key factors: shape-texture bias alignment and classification accuracy drop under noise. Our findings suggest that CLIP initially learn low-level visual features, enhancing its alignment with low-level human perception but also increasing its sensitivity to noise and its texture bias. As training progresses, the model shifts toward more abstract shape-based representations, improving noise robustness but reducing alignment with low-level human perception. These results suggest that these factors shared an underlying learning mechanism and provide new insights into optimizing the trade-off between perceptual alignment and robustness in vision-language models.

[185] ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Rajan Das Gupta, Md Yeasin Rahat, Nafiz Fahad, Abir Ahmed, Liew Tze Hui

Main category: cs.CV

TL;DR: ViMoNet combines motion and video data to better understand human behavior, outperforming existing methods in caption generation and behavior interpretation.

Details

Motivation: Current models focus only on motion or video data, missing nuanced human actions. Combining both types is essential for a complete understanding.

Method: ViMoNet uses joint training with motion-text and video-text data, leveraging their strengths. A new dataset, VIMOS, and benchmark, ViMoNet-Bench, are introduced.

Result: ViMoNet excels in caption generation, motion understanding, and behavior interpretation compared to existing methods.

Conclusion: Combining motion and video data in ViMoNet provides a richer understanding of human behavior, validated by superior performance on benchmarks.

Abstract: This study investigates how large language models (LLMs) can be used to understand human behavior using motion and video data. We think that mixing both types is essential to completely capture the nuanced movements and meanings of human actions, in contrast to recent models that simply concentrate on motion data or films. To address this, we provide ViMoNet, a straightforward yet effective framework for comprehending, characterizing, and deducing human action. ViMoNet employs a joint training strategy that leverages the advantages of two data types: detailed motion-text data, which is more exact, and generic video-text data, which is more comprehensive but less detailed. This aids in the model’s acquisition of rich data regarding time and space in human behavior. Additionally, we provide a brand new dataset named VIMOS that contains a variety of films, motion sequences, instructions, and subtitles. We developed ViMoNet-Bench, a standardized benchmark with carefully labeled samples, to evaluate how well models understand human behavior. Our tests show that ViMoNet outperforms existing methods in caption generation, motion understanding, and behavior interpretation.

[186] Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang

Main category: cs.CV

TL;DR: PAR, a Physical Autoregressive Model, leverages video pretraining for robotic manipulation, achieving high success rates and accurate video predictions without action pretraining.

Details

Motivation: The scarcity of manipulation data drives the need to utilize pretrained models from other modalities like video to understand physical dynamics in robotics.

Method: PAR combines frames and actions as physical tokens, uses a DiT-based de-tokenizer, and incorporates causal masks, inverse kinematics, parallel training, and KV-cache for efficiency.

Result: PAR achieves a 100% success rate on the PushCube task, matches action-pretrained baselines, and predicts future videos with aligned action trajectories.

Conclusion: PAR demonstrates the potential of transferring world knowledge from video pretraining to improve robotic manipulation.

Abstract: The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.

[187] KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging

Valentin Boussot, Jean-Louis Dillenseger

Main category: cs.CV

TL;DR: KonfAI is a configurable deep learning framework for medical imaging, enabling workflow definition via YAML files for reproducibility and efficiency.

Details

Motivation: To simplify and standardize deep learning workflows in medical imaging while enhancing reproducibility and reducing development time.

Method: Uses structured YAML configuration files for declarative workflow definition, supports advanced strategies like patch-based learning and model ensembling, and allows custom components.

Result: Successfully applied to segmentation, registration, and image synthesis, achieving top results in medical imaging challenges.

Conclusion: KonfAI is a versatile, open-source framework that improves workflow efficiency and reproducibility in medical imaging tasks.

Abstract: KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at \href{https://github.com/vboussot/KonfAI}{https://github.com/vboussot/KonfAI}.

[188] Reverse Convolution and Its Applications to Image Restoration

Xuhong Huang, Shiqi Liu, Kai Zhang, Ying Tai, Jian Yang, Hui Zeng, Lei Zhang

Main category: cs.CV

TL;DR: The paper introduces a novel depthwise reverse convolution operator to effectively reverse depthwise convolution, addressing the lack of a true inverse for convolution in neural networks.

Details

Motivation: Existing transposed convolution does not serve as a true inverse of convolution, limiting neural network design. The paper aims to fill this gap.

Method: Proposes a depthwise reverse convolution operator via regularized least-squares optimization, investigates implementation details, and constructs a Transformer-like block.

Result: ConverseNet variants outperform conventional methods in tasks like denoising, super-resolution, and deblurring.

Conclusion: The reverse convolution operator is effective and could inspire new deep learning operators.

Abstract: Convolution and transposed convolution are fundamental operators widely used in neural networks. However, transposed convolution (a.k.a. deconvolution) does not serve as a true inverse of convolution due to inherent differences in their mathematical formulations. To date, no reverse convolution operator has been established as a standard component in neural architectures. In this paper, we propose a novel depthwise reverse convolution operator as an initial attempt to effectively reverse depthwise convolution by formulating and solving a regularized least-squares optimization problem. We thoroughly investigate its kernel initialization, padding strategies, and other critical aspects to ensure its effective implementation. Building upon this operator, we further construct a reverse convolution block by combining it with layer normalization, 1$\times$1 convolution, and GELU activation, forming a Transformer-like structure. The proposed operator and block can directly replace conventional convolution and transposed convolution layers in existing architectures, leading to the development of ConverseNet. Corresponding to typical image restoration models such as DnCNN, SRResNet and USRNet, we train three variants of ConverseNet for Gaussian denoising, super-resolution and deblurring, respectively. Extensive experiments demonstrate the effectiveness of the proposed reverse convolution operator as a basic building module. We hope this work could pave the way for developing new operators in deep model design and applications.

[189] Hierarchical Graph Attention Network for No-Reference Omnidirectional Image Quality Assessment

Hao Yang, Xu Zhang, Jiaqi Ma, Linwei Zhu, Yun Zhang, Huan Zhang

Main category: cs.CV

TL;DR: A graph neural network-based OIQA framework improves quality assessment by modeling spatial distortion non-uniformity through structured viewports and advanced feature extraction.

Details

Motivation: Existing OIQA methods fail to address locally non-uniform distortions due to poor modeling of spatial quality variations and inadequate feature representation.

Method: Proposes a framework using Fibonacci sphere sampling for viewport generation, multi-stage feature extraction, and integrates Graph Attention Network (GAT) with a graph transformer for spatial dependency modeling.

Result: Outperforms existing methods on large-scale OIQA databases, showing strong generalization and effectiveness.

Conclusion: The proposed framework successfully addresses spatial distortion non-uniformity, offering superior performance in omnidirectional image quality assessment.

Abstract: Current Omnidirectional Image Quality Assessment (OIQA) methods struggle to evaluate locally non-uniform distortions due to inadequate modeling of spatial variations in quality and ineffective feature representation capturing both local details and global context. To address this, we propose a graph neural network-based OIQA framework that explicitly models structural relationships between viewports to enhance perception of spatial distortion non-uniformity. Our approach employs Fibonacci sphere sampling to generate viewports with well-structured topology, representing each as a graph node. Multi-stage feature extraction networks then derive high-dimensional node representation. To holistically capture spatial dependencies, we integrate a Graph Attention Network (GAT) modeling fine-grained local distortion variations among adjacent viewports, and a graph transformer capturing long-range quality interactions across distant regions. Extensive experiments on two large-scale OIQA databases with complex spatial distortions demonstrate that our method significantly outperforms existing approaches, confirming its effectiveness and strong generalization capability.

[190] Enhancing Diffusion Face Generation with Contrastive Embeddings and SegFormer Guidance

Dhruvraj Singh Rawat, Enggen Sherpa, Rishikesan Kirupanantha, Tin Hoang

Main category: cs.CV

TL;DR: A benchmark study on diffusion models for human face generation, comparing UNet and DiT architectures and exploring LoRA-based fine-tuning. Key contributions include InfoNCE loss for attribute embedding and SegFormer-based segmentation, improving controllability.

Details

Motivation: To evaluate and enhance diffusion models for controlled face generation, especially in limited data settings like the CelebAMask-HQ dataset.

Method: Compares UNet and DiT for unconditional generation, uses LoRA for fine-tuning Stable Diffusion, and integrates InfoNCE loss and SegFormer for better attribute and segmentation encoding.

Result: Improved semantic alignment and controllability in attribute-guided face generation, demonstrating the effectiveness of contrastive embedding and advanced segmentation.

Conclusion: The study successfully enhances diffusion models for controlled face generation, proving the value of InfoNCE loss and SegFormer in limited-data scenarios.

Abstract: We present a benchmark of diffusion models for human face generation on a small-scale CelebAMask-HQ dataset, evaluating both unconditional and conditional pipelines. Our study compares UNet and DiT architectures for unconditional generation and explores LoRA-based fine-tuning of pretrained Stable Diffusion models as a separate experiment. Building on the multi-conditioning approach of Giambi and Lisanti, which uses both attribute vectors and segmentation masks, our main contribution is the integration of an InfoNCE loss for attribute embedding and the adoption of a SegFormer-based segmentation encoder. These enhancements improve the semantic alignment and controllability of attribute-guided synthesis. Our results highlight the effectiveness of contrastive embedding learning and advanced segmentation encoding for controlled face generation in limited data settings.

[191] ARI3D: A Software for Interactive Quantification of Regions in X-Ray CT 3D Images

Jan Phillipp Albrecht, Jose R. A. Godinho, Christina Hübers, Deborah Schmidt

Main category: cs.CV

TL;DR: ARI3D is a software tool designed to assist in the interactive analysis of 3D X-ray CT images, improving phase identification, addressing partial volume effects, and enhancing quantification accuracy.

Details

Motivation: The challenges in quantitative analysis of 3D X-ray CT images due to imaging artifacts like beam hardening and partial volume effects necessitate user-driven decisions, prompting the development of ARI3D.

Method: ARI3D provides an interactive protocol to classify and quantify objects in 3D CT images, focusing on phase identification, partial volume correction, and accurate quantification.

Result: ARI3D improves phase identification, handles partial volume effects, increases detection limits, and standardizes quantitative 3D analysis across scientific fields.

Conclusion: ARI3D offers a robust solution for analyzing 3D CT images, addressing key challenges and enhancing accuracy and usability in microstructure quantification.

Abstract: X-ray computed tomography (CT) is the main 3D technique for imaging the internal microstructures of materials. Quantitative analysis of the microstructures is usually achieved by applying a sequence of steps that are implemented to the entire 3D image. This is challenged by various imaging artifacts inherent from the technique, e.g., beam hardening and partial volume. Consequently, the analysis requires users to make a number of decisions to segment and classify the microstructures based on the voxel gray-values. In this context, a software tool, here called ARI3D, is proposed to interactively analyze regions in three-dimensional X-ray CT images, assisting users through the various steps of a protocol designed to classify and quantify objects within regions of a three-dimensional image. ARI3D aims to 1) Improve phase identification; 2) Account for partial volume effect; 3) Increase the detection limit and accuracy of object quantification; and 4) Harmonize quantitative 3D analysis that can be implemented in different fields of science.

[192] Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment

Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Jorge Vila-Tomás, Valero Laparra, Jesus Malo

Main category: cs.CV

TL;DR: The study examines how Vision Transformers (ViTs) align with human perception, finding that larger models, repeated image exposure, and strong data augmentation reduce alignment.

Details

Motivation: To understand how ViTs' perceptual alignment with humans is affected by model size, dataset size, augmentation, and regularization.

Method: Systematic analysis of ViTs on the TID2013 dataset, varying model size, dataset diversity, training cycles, and augmentation/regularization.

Result: Larger models and repeated training reduce alignment; dataset diversity has minimal impact, while strong augmentation/regularization worsens alignment.

Conclusion: There’s a trade-off between model complexity, training strategies, and human-like perception, important for applications needing human-like vision.

Abstract: Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored. This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset. Our findings confirm that larger models exhibit lower perceptual alignment, consistent with previous works. Increasing dataset diversity has a minimal impact, but exposing models to the same images more times reduces alignment. Stronger data augmentation and regularization further decrease alignment, especially in models exposed to repeated training cycles. These results highlight a trade-off between model complexity, training strategies, and alignment with human perception, raising important considerations for applications requiring human-like visual understanding.

[193] OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

Yupeng Zhou, Zhen Li, Ziheng Ouyang, Yuming Chen, Ruoyi Du, Daquan Zhou, Bin Fu, Yihao Liu, Peng Gao, Ming-Ming Cheng, Qibin Hou

Main category: cs.CV

TL;DR: OneVAE enhances discrete video VAEs by leveraging continuous VAE priors, improving training stability, speed, and performance, while introducing structural improvements like multi-token quantization and joint optimization.

Details

Motivation: To address unstable training, long training time, and degraded reconstruction quality in discrete video VAEs by leveraging continuous VAE priors and improving structural design.

Method: Uses FSQ to preserve continuous VAE priors, introduces multi-token quantization for better reconstruction, strengthens first-frame reconstruction, and proposes joint discrete-continuous optimization.

Result: Achieves faster convergence, superior performance, and competitive results in both continuous and discrete representations within a single network.

Conclusion: OneVAE successfully bridges discrete and continuous video representations, offering improved efficiency and performance.

Abstract: Encoding videos into discrete tokens could align with text tokens to facilitate concise and unified multi-modal LLMs, yet introducing significant spatiotemporal compression compared to continuous video representation. Previous discrete video VAEs experienced unstable training, long training time, and degraded reconstruction quality. Given the easier training and superior performance of continuous VAEs, an intuitive idea is to enhance discrete video VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between discrete and continuous representations, we found that FSQ could effectively preserve pre-trained continuous VAE priors compared to other quantization methods. By leveraging continuous VAE priors, it converges several times faster than training from scratch and achieves superior performance at convergence. Meanwhile, two structural improvements are proposed. First, inspired by how continuous VAEs enhance reconstruction via enlarged latent dimensions, we introduce a multi-token quantization mechanism, which achieves nearly a 1 dB improvement in PSNR without compromising the token compression ratio. Second, to tackle reconstruction challenges in high-compression video VAEs, we strengthen first-frame reconstruction, enabling the causal VAE to leverage this information in subsequent frames and markedly improving the performance of 4 x 16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous optimization scheme that unifies the two paradigms and, for the first time, achieves competitive performance on both continuous and discrete representations within a single network. We name our method OneVAE to reflect this connection.

[194] Stable Diffusion Models are Secretly Good at Visual In-Context Learning

Trevine Oorloff, Vishwanath Sindagi, Wele Gedara Chaminda Bandara, Ali Shafahi, Amin Ghiasi, Charan Prakash, Reza Ardekani

Main category: cs.CV

TL;DR: The paper demonstrates that off-the-shelf Stable Diffusion models can be adapted for visual in-context learning (V-ICL) without fine-tuning, achieving strong performance across six vision tasks.

Details

Motivation: To simplify and generalize visual in-context learning (V-ICL) by avoiding specialized training or additional data, leveraging existing Stable Diffusion models.

Method: Repurposes Stable Diffusion by modifying self-attention layers to incorporate context between query and example prompts, enabling adaptation to tasks without fine-tuning.

Result: Improves performance on six tasks, e.g., boosting foreground segmentation mIoU by 8.9% and 3.2% over recent methods. Ensembling further enhances results.

Conclusion: Stable Diffusion can be effectively repurposed for V-ICL, offering a simple, generalizable solution for diverse vision tasks without additional training.

Abstract: Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) – the ability to leverage a few sets of example prompts to adapt to various tasks without having to explicitly update the model weights. ICL has recently been explored for computer vision tasks with promising early outcomes. These approaches involve specialized training and/or additional data that complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this repurposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task better and further improve the performance.

[195] HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics

Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang

Main category: cs.CV

TL;DR: HumanGenesis is a framework addressing geometric inconsistency and motion generalization in synthetic human dynamics by integrating geometric and generative modeling through four collaborative agents.

Details

Motivation: Current approaches struggle with geometric inconsistency, coarse reconstruction, motion generalization limitations, and scene inharmonization. HumanGenesis aims to overcome these challenges.

Method: The framework uses four agents: Reconstructor for 3D-consistent representations, Critique Agent for refining fidelity, Pose Guider for motion generalization, and Video Harmonizer for photorealistic synthesis.

Result: HumanGenesis achieves state-of-the-art performance in text-guided synthesis, video reenactment, and novel-pose generalization.

Conclusion: The framework significantly improves expressiveness, geometric fidelity, and scene integration in synthetic human dynamics.

Abstract: \textbf{Synthetic human dynamics} aims to generate photorealistic videos of human subjects performing expressive, intention-driven motions. However, current approaches face two core challenges: (1) \emph{geometric inconsistency} and \emph{coarse reconstruction}, due to limited 3D modeling and detail preservation; and (2) \emph{motion generalization limitations} and \emph{scene inharmonization}, stemming from weak generative capabilities. To address these, we present \textbf{HumanGenesis}, a framework that integrates geometric and generative modeling through four collaborative agents: (1) \textbf{Reconstructor} builds 3D-consistent human-scene representations from monocular video using 3D Gaussian Splatting and deformation decomposition. (2) \textbf{Critique Agent} enhances reconstruction fidelity by identifying and refining poor regions via multi-round MLLM-based reflection. (3) \textbf{Pose Guider} enables motion generalization by generating expressive pose sequences using time-aware parametric encoders. (4) \textbf{Video Harmonizer} synthesizes photorealistic, coherent video via a hybrid rendering pipeline with diffusion, refining the Reconstructor through a Back-to-4D feedback loop. HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization, significantly improving expressiveness, geometric fidelity, and scene integration.

[196] Story2Board: A Training-Free Approach for Expressive Storyboard Generation

David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, Dani Lischinski

Main category: cs.CV

TL;DR: Story2Board is a training-free framework for generating expressive storyboards from natural language, focusing on visual storytelling aspects like spatial composition and narrative pacing. It uses latent panel anchoring and reciprocal attention value mixing for coherence without fine-tuning, outperforming existing methods in dynamic and engaging storyboard generation.

Details

Motivation: Existing storyboard generation methods overlook key visual storytelling aspects like spatial composition and narrative pacing, focusing narrowly on subject identity. Story2Board aims to address these gaps.

Method: The framework includes Latent Panel Anchoring for shared character references and Reciprocal Attention Value Mixing for blending visual features. It uses an off-the-shelf language model to convert stories into panel-level prompts.

Result: Story2Board generates visually diverse yet consistent storyboards, outperforming baselines in coherence, layout diversity, and narrative engagement, as shown by qualitative, quantitative, and user study results.

Conclusion: Story2Board advances storyboard generation by enhancing visual storytelling coherence and diversity without architectural changes, setting a new standard for expressive and engaging storyboards.

Abstract: We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.

[197] E-4DGS: High-Fidelity Dynamic Reconstruction from the Multi-view Event Cameras

Chaoran Feng, Zhenyu Tang, Wangbo Yu, Yatian Pang, Yian Zhao, Jianbin Zhao, Li Yuan, Yonghong Tian

Main category: cs.CV

TL;DR: E-4DGS is an event-driven dynamic Gaussian Splatting method for novel view synthesis from multi-view event streams, addressing limitations of RGB cameras in high-speed and low-light scenes.

Details

Motivation: Overcome RGB camera limitations (lighting dependence, motion blur, dynamic range) by leveraging event cameras' advantages (low power, high temporal resolution, high dynamic range) for scene reconstruction in challenging conditions.

Method: Proposes event-based initialization, event-adaptive slicing splatting, intensity importance pruning, and adaptive contrast threshold for stable training and precise optimization. Uses a synthetic multi-view event camera setup.

Result: Outperforms event-only and event-RGB fusion baselines, demonstrating effectiveness in rapid scene capture.

Conclusion: E-4DGS advances multi-view event-based reconstruction, offering a robust solution for high-speed and low-light scenarios.

Abstract: Novel view synthesis and 4D reconstruction techniques predominantly rely on RGB cameras, thereby inheriting inherent limitations such as the dependence on adequate lighting, susceptibility to motion blur, and a limited dynamic range. Event cameras, offering advantages of low power, high temporal resolution and high dynamic range, have brought a new perspective to addressing the scene reconstruction challenges in high-speed motion and low-light scenes. To this end, we propose E-4DGS, the first event-driven dynamic Gaussian Splatting approach, for novel view synthesis from multi-view event streams with fast-moving cameras. Specifically, we introduce an event-based initialization scheme to ensure stable training and propose event-adaptive slicing splatting for time-aware reconstruction. Additionally, we employ intensity importance pruning to eliminate floating artifacts and enhance 3D consistency, while incorporating an adaptive contrast threshold for more precise optimization. We design a synthetic multi-view camera setup with six moving event cameras surrounding the object in a 360-degree configuration and provide a benchmark multi-view event stream dataset that captures challenging motion scenarios. Our approach outperforms both event-only and event-RGB fusion baselines and paves the way for the exploration of multi-view event-based reconstruction as a novel approach for rapid scene capture.

[198] SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

Yachao Liang, Min Yu, Gang Li, Jianguo Jiang, Boquan Li, Feng Yu, Ning Zhang, Xiang Meng, Weiqing Huang

Main category: cs.CV

TL;DR: The paper proposes a novel audio-visual speech representation learning method for detecting face forgery videos, achieving superior cross-dataset generalization and robustness without using fake videos in training.

Details

Motivation: The challenge of detecting face forgery videos, especially generalizing to unseen datasets and perturbations, motivates leveraging audio-visual speech elements for precise facial movement reflection.

Method: The approach involves self-supervised masked prediction on real videos to learn audio-visual speech representations, encoding local and global semantic information, then transferring the model to forgery detection.

Result: The method outperforms state-of-the-art techniques in cross-dataset generalization and robustness, validated through extensive experiments.

Conclusion: The proposed audio-visual speech representation learning effectively addresses face forgery detection, demonstrating strong performance without relying on fake training data.

Abstract: Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at https://github.com/Eleven4AI/SpeechForensics.

[199] Towards Comprehensive Cellular Characterisation of H&E slides

Benjamin Adjadj, Pierre-Antoine Bannier, Guillaume Horent, Sebastien Mandela, Aurore Lyon, Kathryn Schutte, Ulysse Marteau, Valentin Gaury, Laura Dumont, Thomas Mathieu, Reda Belbahri, Benoît Schmauch, Eric Durand, Katharina Von Loga, Lucie Gillet

Main category: cs.CV

TL;DR: HistoPLUS is a state-of-the-art model for cell analysis in tumor microenvironments, outperforming existing methods in detection and classification while using fewer parameters. It also enables the study of understudied cell types and shows strong cross-domain generalization.

Details

Motivation: Existing methods for cell detection, segmentation, and classification in tumor microenvironments perform poorly on understudied cell types and lack cross-domain generalization.

Method: HistoPLUS is trained on a novel pan-cancer dataset of 108,722 nuclei covering 13 cell types and evaluated on 4 independent cohorts.

Result: HistoPLUS improves detection quality by 5.2% and classification F1 score by 23.7%, while using 5x fewer parameters. It also enables study of 7 understudied cell types and shows robust transfer to unseen oncology indications.

Conclusion: HistoPLUS advances cell analysis in tumor microenvironments, offering superior performance, efficiency, and broader applicability, with model weights and code publicly released.

Abstract: Cell detection, segmentation and classification are essential for analyzing tumor microenvironments (TME) on hematoxylin and eosin (H&E) slides. Existing methods suffer from poor performance on understudied cell types (rare or not present in public datasets) and limited cross-domain generalization. To address these shortcomings, we introduce HistoPLUS, a state-of-the-art model for cell analysis, trained on a novel curated pan-cancer dataset of 108,722 nuclei covering 13 cell types. In external validation across 4 independent cohorts, HistoPLUS outperforms current state-of-the-art models in detection quality by 5.2% and overall F1 classification score by 23.7%, while using 5x fewer parameters. Notably, HistoPLUS unlocks the study of 7 understudied cell types and brings significant improvements on 8 of 13 cell types. Moreover, we show that HistoPLUS robustly transfers to two oncology indications unseen during training. To support broader TME biomarker research, we release the model weights and inference code at https://github.com/owkin/histoplus/.

[200] Quo Vadis Handwritten Text Generation for Handwritten Text Recognition?

Vittorio Pippi, Konstantina Nikolaidou, Silvia Cascianelli, George Retsinas, Giorgos Sfikas, Rita Cucchiara, Marcus Liwicki

Main category: cs.CV

TL;DR: The paper evaluates three HTG models (GAN, diffusion, autoregressive) to improve HTR performance in low-resource settings, providing guidelines for model selection.

Details

Motivation: Challenges in digitizing historical manuscripts due to small, author-specific collections not matching training data distributions.

Method: Systematic comparison of three HTG models to assess their impact on HTR fine-tuning, analyzing visual/linguistic characteristics.

Result: Quantitative guidelines for selecting effective HTG models, insights into HTG capabilities, and areas for improvement.

Conclusion: HTG methods show promise for low-resource HTR but require further refinement for optimal application.

Abstract: The digitization of historical manuscripts presents significant challenges for Handwritten Text Recognition (HTR) systems, particularly when dealing with small, author-specific collections that diverge from the training data distributions. Handwritten Text Generation (HTG) techniques, which generate synthetic data tailored to specific handwriting styles, offer a promising solution to address these challenges. However, the effectiveness of various HTG models in enhancing HTR performance, especially in low-resource transcription settings, has not been thoroughly evaluated. In this work, we systematically compare three state-of-the-art styled HTG models (representing the generative adversarial, diffusion, and autoregressive paradigms for HTG) to assess their impact on HTR fine-tuning. We analyze how visual and linguistic characteristics of synthetic data influence fine-tuning outcomes and provide quantitative guidelines for selecting the most effective HTG model. The results of our analysis provide insights into the current capabilities of HTG methods and highlight key areas for further improvement in their application to low-resource HTR.

[201] Integrating Clinical Knowledge Graphs and Gradient-Based Neural Systems for Enhanced Melanoma Diagnosis via the 7-Point Checklist

Yuheng Wang, Tianze Yu, Jiayue Cai, Sunil Kalia, Harvey Lui, Z. Jane Wang, Tim K. Lee

Main category: cs.CV

TL;DR: The paper introduces a novel diagnostic framework combining a clinical knowledge-based topological graph (CKTG) and a gradient diagnostic strategy (GD-DDW) to improve melanoma detection beyond the traditional 7-point checklist (7PCL).

Details

Motivation: The traditional 7PCL is limited to distinguishing melanoma from melanocytic Nevi and fails in complex scenarios with multiple similar skin diseases.

Method: The proposed framework integrates CKTG to model attribute relationships and GD-DDW with a data-driven weighting system. It also uses a dual-attention mechanism for multimodal feature extraction.

Result: The method achieved an average AUC of 88.6% on the EDRA dataset, outperforming traditional approaches.

Conclusion: The integrated system enhances melanoma diagnosis precision, offering data-driven benchmarks for clinicians.

Abstract: The 7-point checklist (7PCL) is a widely used diagnostic tool in dermoscopy for identifying malignant melanoma by assigning point values to seven specific attributes. However, the traditional 7PCL is limited to distinguishing between malignant melanoma and melanocytic Nevi, and falls short in scenarios where multiple skin diseases with appearances similar to melanoma coexist. To address this limitation, we propose a novel diagnostic framework that integrates a clinical knowledge-based topological graph (CKTG) with a gradient diagnostic strategy featuring a data-driven weighting system (GD-DDW). The CKTG captures both the internal and external relationships among the 7PCL attributes, while the GD-DDW emulates dermatologists’ diagnostic processes, prioritizing visual observation before making predictions. Additionally, we introduce a multimodal feature extraction approach leveraging a dual-attention mechanism to enhance feature extraction through cross-modal interaction and unimodal collaboration. This method incorporates meta-information to uncover interactions between clinical data and image features, ensuring more accurate and robust predictions. Our approach, evaluated on the EDRA dataset, achieved an average AUC of 88.6%, demonstrating superior performance in melanoma detection and feature prediction. This integrated system provides data-driven benchmarks for clinicians, significantly enhancing the precision of melanoma diagnosis.

[202] AST-n: A Fast Sampling Approach for Low-Dose CT Reconstruction using Diffusion Models

Tomás de la Sotta, José M. Saavedra, Héctor Henríquez, Violeta Chang, Aline Xavier

Main category: cs.CV

TL;DR: AST-n is an accelerated inference framework for LDCT denoising using diffusion models, reducing steps and maintaining image quality.

Details

Motivation: LDCT reduces radiation but increases noise; diffusion models can improve image quality.

Method: AST-n initiates reverse diffusion from intermediate noise levels and uses high-order ODE solvers to reduce steps.

Result: AST-25 achieves PSNR >38 dB and SSIM >0.95, cutting inference time from ~16s to <1s per slice.

Conclusion: AST-n enables fast, high-quality LDCT reconstruction, making diffusion methods clinically feasible.

Abstract: Low-dose CT (LDCT) protocols reduce radiation exposure but increase image noise, compromising diagnostic confidence. Diffusion-based generative models have shown promise for LDCT denoising by learning image priors and performing iterative refinement. In this work, we introduce AST-n, an accelerated inference framework that initiates reverse diffusion from intermediate noise levels, and integrate high-order ODE solvers within conditioned models to further reduce sampling steps. We evaluate two acceleration paradigms–AST-n sampling and standard scheduling with high-order solvers – on the Low Dose CT Grand Challenge dataset, covering head, abdominal, and chest scans at 10-25 % of standard dose. Conditioned models using only 25 steps (AST-25) achieve peak signal-to-noise ratio (PSNR) above 38 dB and structural similarity index (SSIM) above 0.95, closely matching standard baselines while cutting inference time from ~16 seg to under 1 seg per slice. Unconditional sampling suffers substantial quality loss, underscoring the necessity of conditioning. We also assess DDIM inversion, which yields marginal PSNR gains at the cost of doubling inference time, limiting its clinical practicality. Our results demonstrate that AST-n with high-order samplers enables rapid LDCT reconstruction without significant loss of image fidelity, advancing the feasibility of diffusion-based methods in clinical workflows.

[203] Towards flexible perception with visual memory

Robert Geirhos, Priyank Jaini, Austin Stone, Sourabh Medapati, Xi Yi, George Toderici, Abhijit Ogale, Jonathon Shlens

Main category: cs.CV

TL;DR: The paper proposes a flexible alternative to traditional neural network training by combining deep neural networks with a database-like visual memory, enabling easy editing, scaling, and interpretability.

Details

Motivation: Traditional neural networks make editing knowledge difficult due to distributed weights. The paper aims to provide a more flexible and interpretable solution.

Method: Decomposes image classification into image similarity (using pre-trained embeddings) and search (via nearest neighbor retrieval from a knowledge database).

Result: Demonstrates capabilities like scalable data addition, removal through unlearning, and interpretable decision-making.

Conclusion: The approach highlights the benefits of an explicit visual memory and encourages rethinking knowledge representation in deep vision models.

Abstract: Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is hard, since all information is distributed across the network’s weights. We here explore a simple, compelling alternative by marrying the representational power of deep neural networks with the flexibility of a database. Decomposing the task of image classification into image similarity (from a pre-trained embedding) and search (via fast nearest neighbor retrieval from a knowledge database), we build on well-established components to construct a simple and flexible visual memory that has the following key capabilities: (1.) The ability to flexibly add data across scales: from individual samples all the way to entire classes and billion-scale data; (2.) The ability to remove data through unlearning and memory pruning; (3.) An interpretable decision-mechanism on which we can intervene to control its behavior. Taken together, these capabilities comprehensively demonstrate the benefits of an explicit visual memory. We hope that it might contribute to a conversation on how knowledge should be represented in deep vision models – beyond carving it in “stone” weights.

[204] LIA-X: Interpretable Latent Portrait Animator

Yaohui Wang, Di Yang, Xinyuan Chen, Francois Bremond, Yu Qiao, Antitza Dantcheva

Main category: cs.CV

TL;DR: LIA-X is an interpretable portrait animator using a Sparse Motion Dictionary for fine-grained control, outperforming previous methods in facial dynamics transfer.

Details

Motivation: To enable precise and interpretable control over facial dynamics transfer in portrait animation, addressing limitations of prior 'warp-render' approaches.

Method: Uses an autoencoder with a Sparse Motion Dictionary to model motion transfer as linear navigation in latent space, supporting an ’edit-warp-render’ strategy.

Result: Outperforms previous methods in self- and cross-reenactment tasks, scalable to 1B parameters, and enables practical applications like video editing.

Conclusion: LIA-X offers interpretable, controllable, and scalable facial dynamics transfer, advancing portrait animation and editing.

Abstract: We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous ‘warp-render’ approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable ’edit-warp-render’ strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation.

[205] SpectralEarth: Training Hyperspectral Foundation Models at Scale

Nassim Ait Ali Braham, Conrad M Albrecht, Julien Mairal, Jocelyn Chanussot, Yi Wang, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: The paper introduces SpectralEarth, a large-scale hyperspectral dataset, and pretrains foundation models for HSI analysis, demonstrating versatility and efficiency.

Details

Motivation: The lack of comprehensive hyperspectral datasets limits the potential of foundation models in HSI, prompting the creation of SpectralEarth.

Method: Leverages EnMAP data to create SpectralEarth, uses self-supervised learning for pretraining, and integrates a spectral adapter into vision backbones.

Result: Pretrained models show versatility and generalizability across tasks and sensors, with efficient fine-tuning.

Conclusion: SpectralEarth and the pretrained models advance HSI analysis, addressing the dataset gap and enabling broader applications.

Abstract: Foundation models have triggered a paradigm shift in computer vision and are increasingly being adopted in remote sensing, particularly for multispectral imagery. Yet, their potential in hyperspectral imaging (HSI) remains untapped due to the absence of comprehensive and globally representative hyperspectral datasets. To close this gap, we introduce SpectralEarth, a large-scale multitemporal dataset designed to pretrain hyperspectral foundation models leveraging data from the environmental mapping and analysis program (EnMAP). SpectralEarth comprises 538 974 image patches covering 415 153 unique locations from 11 636 globally distributed EnMAP scenes spanning two years of archive. In addition, 17.5% of these locations include multiple timestamps, enabling multitemporal HSI analysis. Utilizing state-of-the-art self-supervised learning algorithms, we pretrain a series of foundation models on SpectralEarth, integrating a spectral adapter into classical vision backbones to accommodate the unique characteristics of HSI. In tandem, we construct nine downstream datasets for land-cover, crop-type mapping, and tree-species classification, providing benchmarks for model evaluation. Experimental results support the versatility of our models and their generalizability across different tasks and sensors. We also highlight computational efficiency during model fine-tuning.

[206] MOC: Meta-Optimized Classifier for Few-Shot Whole Slide Image Classification

Tianqi Xiang, Yi Li, Qixiang Zhang, Xiaomeng Li

Main category: cs.CV

TL;DR: The paper introduces a Meta-Optimized Classifier (MOC) to improve few-shot learning for WSI classification, outperforming existing methods.

Details

Motivation: Addressing the limitations of current few-shot VLFM-based methods, which are vulnerable to data scarcity.

Method: Proposes MOC with a meta-learner and classifier bank for optimized classifier configuration.

Result: MOC achieves significant performance gains, e.g., 10.4% higher AUC on TCGA-NSCLC benchmark.

Conclusion: MOC offers a robust solution for clinical applications with limited training data.

Abstract: Recent advances in histopathology vision-language foundation models (VLFMs) have shown promise in addressing data scarcity for whole slide image (WSI) classification via zero-shot adaptation. However, these methods remain outperformed by conventional multiple instance learning (MIL) approaches trained on large datasets, motivating recent efforts to enhance VLFM-based WSI classification through fewshot learning paradigms. While existing few-shot methods improve diagnostic accuracy with limited annotations, their reliance on conventional classifier designs introduces critical vulnerabilities to data scarcity. To address this problem, we propose a Meta-Optimized Classifier (MOC) comprising two core components: (1) a meta-learner that automatically optimizes a classifier configuration from a mixture of candidate classifiers and (2) a classifier bank housing diverse candidate classifiers to enable a holistic pathological interpretation. Extensive experiments demonstrate that MOC outperforms prior arts in multiple few-shot benchmarks. Notably, on the TCGA-NSCLC benchmark, MOC improves AUC by 10.4% over the state-of-the-art few-shot VLFM-based methods, with gains up to 26.25% under 1-shot conditions, offering a critical advancement for clinical deployments where diagnostic training data is severely limited. Code is available at https://github.com/xmed-lab/MOC.

[207] PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image

Geonhee Sim, Gyeongsik Moon

Main category: cs.CV

TL;DR: PERSONA combines 3D-based and diffusion-based approaches to create personalized 3D human avatars from a single image, addressing identity preservation and pose-driven deformations.

Details

Motivation: Existing methods either require costly pose-rich videos or struggle with identity preservation. PERSONA aims to overcome these limitations.

Method: PERSONA uses diffusion to generate pose-rich videos from a single image, then optimizes a 3D avatar with balanced sampling and geometry-weighted optimization.

Result: The framework achieves high authenticity and sharp renderings across diverse poses while preserving identity.

Conclusion: PERSONA successfully bridges the gap between 3D-based and diffusion-based approaches for animatable human avatars.

Abstract: Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses.

[208] Pediatric brain tumor classification using digital histopathology and deep learning: evaluation of SOTA methods on a multi-center Swedish cohort

Iulian Emil Tampu, Per Nyman, Christoforos Spyretos, Ida Blystad, Alia Shamikh, Gabriela Prochazka, Teresita Díaz de Ståhl, Johanna Sandgren, Peter Lundberg, Neda Haj-Hosseini

Main category: cs.CV

TL;DR: The study applies weakly supervised multiple-instance learning (MIL) to classify pediatric brain tumors in whole slide images using pre-trained feature extractors, achieving fair generalizability on a multi-center dataset.

Details

Motivation: Brain tumors are common in children, but limited histopathology datasets hinder computational pathology applications. This study aims to address this gap.

Method: Two MIL approaches (ABMIL and CLAM) were used on patch-features from pre-trained models (ResNet50, UNI, CONCH) to classify tumors in a multi-center Swedish cohort.

Result: UNI features with ABMIL achieved the highest performance (MCC: 0.76 for tumor category). Models using UNI and CONCH outperformed ResNet50 in generalization.

Conclusion: The study demonstrates the potential of computational pathology for pediatric brain tumor diagnosis with fair generalizability across centers.

Abstract: Brain tumors are the most common solid tumors in children and young adults, but the scarcity of large histopathology datasets has limited the application of computational pathology in this group. This study implements two weakly supervised multiple-instance learning (MIL) approaches on patch-features obtained from state-of-the-art histology-specific foundation models to classify pediatric brain tumors in hematoxylin and eosin whole slide images (WSIs) from a multi-center Swedish cohort. WSIs from 540 subjects (age 8.5$\pm$4.9 years) diagnosed with brain tumor were gathered from the six Swedish university hospitals. Instance (patch)-level features were obtained from WSIs using three pre-trained feature extractors: ResNet50, UNI, and CONCH. Instances were aggregated using attention-based MIL (ABMIL) or clustering-constrained attention MIL (CLAM) for patient-level classification. Models were evaluated on three classification tasks based on the hierarchical classification of pediatric brain tumors: tumor category, family, and type. Model generalization was assessed by training on data from two of the centers and testing on data from four other centers. Model interpretability was evaluated through attention mapping. The highest classification performance was achieved using UNI features and ABMIL aggregation, with Matthew’s correlation coefficient of 0.76$\pm$0.04, 0.63$\pm$0.04, and 0.60$\pm$0.05 for tumor category, family, and type classification, respectively. When evaluating generalization, models utilizing UNI and CONCH features outperformed those using ResNet50. However, the drop in performance from the in-site to out-of-site testing was similar across feature extractors. These results show the potential of state-of-the-art computational pathology methods in diagnosing pediatric brain tumors at different hierarchical levels with fair generalizability on a multi-center national dataset.

[209] A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation

Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding

Main category: cs.CV

TL;DR: A survey on 3D Gaussian Splatting (3DGS) applications, comparing it to NeRF, covering semantic understanding, segmentation, editing, generation, and benchmarks.

Details

Motivation: To provide a comprehensive overview of 3DGS applications, highlighting its advantages over NeRF and its potential in downstream tasks.

Method: Categorizes 3DGS applications, reviews 2D foundation models and NeRF-based methods, and summarizes datasets, evaluation protocols, and benchmarks.

Result: Identifies shared design principles, emerging trends, and comparative analyses of methods.

Conclusion: 3DGS is a versatile tool for 3D scene representation, with ongoing research supported by a maintained repository.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, offering high-fidelity photorealistic rendering with real-time performance. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into segmentation, editing, generation, and other functional tasks. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.

[210] LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang

Main category: cs.CV

TL;DR: LLMC+ is a VLM compression benchmark addressing limitations of current methods by decomposing techniques, evaluating multi-turn tasks, and combining token and model compression.

Details

Motivation: Current VLM compression methods lack fair evaluation, realistic task testing, and joint technique exploration.

Method: LLMC+ introduces a toolkit with over 20 algorithms across five VLM families for systematic study of token-level and model-level compression.

Result: Findings show distinct strategies for spatial/temporal redundancy, token reduction struggles in multi-turn tasks, and combined compression works best.

Conclusion: LLMC+ enables fair evaluation and inspires efficient VLM research, with code available for public use.

Abstract: Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.

Aman Anand, Elyas Rashno, Amir Eskandari, Farhana Zulkernine

Main category: cs.CV

TL;DR: Distill-DKP improves unsupervised keypoint detection by using depth maps and RGB images via cross-modal knowledge distillation, outperforming previous methods.

Details

Motivation: Existing methods lack depth information and detect keypoints on backgrounds due to artificial deformations and reconstruction objectives.

Method: Proposes Distill-DKP, a framework using depth-based teacher models to guide image-based student models in self-supervised learning.

Result: Reduces mean L2 error by 47.15% on Human3.6M, mean average error by 5.67% on Taichi, and improves accuracy by 1.3% on DeepFashion.

Conclusion: Distill-DKP effectively leverages depth information and knowledge distillation for superior keypoint detection.

Abstract: Existing unsupervised keypoint detection methods apply artificial deformations to images such as masking a significant portion of images and using reconstruction of original image as a learning objective to detect keypoints. However, this approach lacks depth information in the image and often detects keypoints on the background. To address this, we propose Distill-DKP, a novel cross-modal knowledge distillation framework that leverages depth maps and RGB images for keypoint detection in a self-supervised setting. During training, Distill-DKP extracts embedding-level knowledge from a depth-based teacher model to guide an image-based student model with inference restricted to the student. Experiments show that Distill-DKP significantly outperforms previous unsupervised methods by reducing mean L2 error by 47.15% on Human3.6M, mean average error by 5.67% on Taichi, and improving keypoints accuracy by 1.3% on DeepFashion dataset. Detailed ablation studies demonstrate the sensitivity of knowledge distillation across different layers of the network. Project Page: https://23wm13.github.io/distill-dkp/

[212] SLTNet: Efficient Event-based Semantic Segmentation with Spike-driven Lightweight Transformer-based Networks

Xianlei Long, Xiaxin Zhu, Fangming Guo, Wanyi Zhang, Qingyi Gu, Chao Chen, Fuqiang Gu

Main category: cs.CV

TL;DR: SLTNet is a spike-driven lightweight transformer network for event-based semantic segmentation, offering high efficiency, low energy consumption, and superior performance over SOTA SNN methods.

Details

Motivation: Address the inefficiencies of ANN-based segmentation methods, such as high computational demands and energy consumption, for resource-constrained platforms.

Method: Uses spike-driven convolution blocks (SCBs) and spike-driven transformer blocks (STBs) with binary mask operations in a single-branch architecture.

Result: Outperforms SOTA SNN methods by up to 9.39% mIoU with 4.58x lower energy consumption and 114 FPS speed.

Conclusion: SLTNet is a highly efficient and effective solution for event-based semantic segmentation, suitable for edge/mobile platforms.

Abstract: Event-based semantic segmentation has great potential in autonomous driving and robotics due to the advantages of event cameras, such as high dynamic range, low latency, and low power cost. Unfortunately, current artificial neural network (ANN)-based segmentation methods suffer from high computational demands, the requirements for image frames, and massive energy consumption, limiting their efficiency and application on resource-constrained edge/mobile platforms. To address these problems, we introduce SLTNet, a spike-driven lightweight transformer-based network designed for event-based semantic segmentation. Specifically, SLTNet is built on efficient spike-driven convolution blocks (SCBs) to extract rich semantic features while reducing the model’s parameters. Then, to enhance the long-range contextural feature interaction, we propose novel spike-driven transformer blocks (STBs) with binary mask operations. Based on these basic blocks, SLTNet employs a high-efficiency single-branch architecture while maintaining the low energy consumption of the Spiking Neural Network (SNN). Finally, extensive experiments on DDD17 and DSEC-Semantic datasets demonstrate that SLTNet outperforms state-of-the-art (SOTA) SNN-based methods by at most 9.06% and 9.39% mIoU, respectively, with extremely 4.58x lower energy consumption and 114 FPS inference speed. Our code is open-sourced and available at https://github.com/longxianlei/SLTNet-v1.0.

[213] GenAI Confessions: Black-box Membership Inference for Generative Image Models

Matyas Bohacek, Hany Farid

Main category: cs.CV

TL;DR: A method for detecting if generative-AI image models were trained on specific images, addressing copyright concerns without needing model details.

Details

Motivation: Concerns over unauthorized use of intellectual property in training generative-AI models and the need for fair use and copyright compliance.

Method: Black-box membership inference to determine if specific images were used in training, without requiring model architecture or weights.

Result: A computationally efficient method for auditing generative-AI models for unauthorized image use.

Conclusion: This method supports fairer development and deployment of generative-AI by enabling copyright compliance checks.

Abstract: From a simple text prompt, generative-AI image models can create stunningly realistic and creative images bounded, it seems, by only our imagination. These models have achieved this remarkable feat thanks, in part, to the ingestion of billions of images collected from nearly every corner of the internet. Many creators have understandably expressed concern over how their intellectual property has been ingested without their permission or a mechanism to opt out of training. As a result, questions of fair use and copyright infringement have quickly emerged. We describe a method that allows us to determine if a model was trained on a specific image or set of images. This method is computationally efficient and assumes no explicit knowledge of the model architecture or weights (so-called black-box membership inference). We anticipate that this method will be crucial for auditing existing models and, looking ahead, ensuring the fairer development and deployment of generative AI models.

[214] Prompt-aligned Gradient for Prompt Tuning

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, Hanwang Zhang

Main category: cs.CV

TL;DR: ProGrad (Prompt-aligned Gradient) is a method to fine-tune prompts in vision-language models (VLMs) like CLIP without forgetting general knowledge, outperforming existing prompt tuning methods.

Details

Motivation: Improper fine-tuning of prompts in VLMs can degrade performance not only for task-related classes but also other classes, and existing anti-overfitting techniques lack prompt-specific solutions.

Method: ProGrad updates prompts only when their gradients align with the general direction (gradient of KL loss of pre-defined prompt prediction), preserving VLM knowledge.

Result: ProGrad shows superior few-shot generalization compared to state-of-the-art prompt tuning methods.

Conclusion: ProGrad effectively prevents knowledge forgetting during prompt tuning in VLMs, enhancing few-shot performance.

Abstract: Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by “prompt”, e.g., the confidence score of an image being “[CLASS]” can be obtained by using the VLM provided similarity measure between the image and the prompt sentence “a photo of a [CLASS]”. Therefore, prompt shows a great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the prompt-based similarity measure. However, we find a common failure that improper fine-tuning may not only undermine the prompt’s inherent prediction for the task-related classes, but also for other classes in the VLM vocabulary. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompt. We present Prompt-aligned Gradient, dubbed ProGrad, to prevent prompt tuning from forgetting the the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the “general direction”, which is represented as the gradient of the KL loss of the pre-defined prompt prediction. Extensive experiments demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods. Codes are available at https://github.com/BeierZhu/Prompt-align.

[215] Debiased Fine-Tuning for Vision-language Models by Prompt Regularization

Beier Zhu, Yulei Niu, Saeil Lee, Minhoe Hur, Hanwang Zhang

Main category: cs.CV

TL;DR: ProReg introduces Prompt Regularization to fine-tune vision-language models, using pretrained knowledge to avoid overfitting and bias in downstream tasks.

Details

Motivation: Traditional fine-tuning overfits to biased downstream data; ProReg leverages pretrained knowledge via prompting for unbiased regularization.

Method: ProReg combines KL loss of prompt predictions and CE loss of ground-truth labels with adaptive weights for domain transfer.

Result: ProReg outperforms conventional fine-tuning, zero-shot prompt, and other methods on out-of-distribution benchmarks.

Conclusion: ProReg effectively balances pretrained and downstream knowledge, demonstrating robust performance across tasks.

Abstract: We present a new paradigm for fine-tuning large-scale visionlanguage pre-trained models on downstream task, dubbed Prompt Regularization (ProReg). Different from traditional fine-tuning which easily overfits to the downstream task data, ProReg uses the prediction by prompting the pretrained model to regularize the fine-tuning. The motivation is: by prompting the large model “a photo of a [CLASS]”, the fil-lin answer is only dependent on the pretraining encyclopedic knowledge while independent of the task data distribution, which is usually biased. Specifically, given a training sample prediction during fine-tuning, we first calculate its KullbackLeibler loss of the prompt prediction and Cross-Entropy loss of the ground-truth label, and then combine them with a proposed sample-wise adaptive trade-off weight, which automatically adjusts the transfer between the pretrained and downstream domains. On various out-of-distribution benchmarks, we show the consistently strong performance of ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning, and other state-of-the-art methods.

[216] Ear-Keeper: A Cross-Platform AI System for Rapid and Accurate Ear Disease Diagnosis

Feiyan Lu, Yubiao Yue, Zhenzhang Li, Meiping Zhang, Wen Luo, Fan Zhang, Tong Liu, Jingyong Shi, Guang Wang, Xinyu Zeng

Main category: cs.CV

TL;DR: Best-EarNet, a lightweight deep learning model, achieves high accuracy in diagnosing ear diseases using a large-scale otoendoscopy dataset, with real-time performance and clinical interpretability.

Details

Motivation: Early and accurate detection of ear diseases is crucial for preventing hearing impairment, but existing datasets lack diversity and AI models struggle with accuracy, efficiency, and size.

Method: Developed Best-EarNet, integrating a Local-Global Spatial Feature Fusion Module and multi-scale supervision, leveraging transfer learning for efficiency.

Result: Achieved 95.23% accuracy on internal and 92.14% on external test sets, with fast processing (80 FPS) and small model size (2.94 MB).

Conclusion: Best-EarNet and Ear-Keeper system enable real-time, accessible ear disease screening, enhancing early detection and clinical trust.

Abstract: Early and accurate detection systems for ear diseases, powered by deep learning, are essential for preventing hearing impairment and improving population health. However, the limited diversity of existing otoendoscopy datasets and the poor balance between diagnostic accuracy, computational efficiency, and model size have hindered the translation of artificial intelligence (AI) algorithms into healthcare applications. In this study, we constructed a large-scale, multi-center otoendoscopy dataset covering eight common ear diseases and healthy cases. Building upon this resource, we developed Best-EarNet, an ultrafast and lightweight deep learning architecture integrating a novel Local-Global Spatial Feature Fusion Module with a multi-scale supervision strategy, enabling real-time and accurate classification of ear conditions. Leveraging transfer learning, Best-EarNet, with a model size of only 2.94 MB, achieved diagnostic accuracies of 95.23% on an internal test set (22,581 images) and 92.14% on an external test set (1,652 images), while requiring only 0.0125 seconds (80 frames per second) to process a single image on a standard CPU. Further subgroup analysis by gender and age showed consistently excellent performance of Best-EarNet across all demographic groups. To enhance clinical interpretability and user trust, we incorporated Grad-CAM-based visualization, highlighting the specific abnormal ear regions contributing to AI predictions. Most importantly, we developed Ear-Keeper, a cross-platform intelligent diagnosis system built upon Best-EarNet, deployable on smartphones, tablets, and personal computers. Ear-Keeper enables public users and healthcare providers to perform comprehensive real-time video-based ear canal screening, supporting early detection and timely intervention of ear diseases.

[217] Simulating the Real World: A Unified Survey of Multimodal Generative Models

Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong

Main category: cs.CV

TL;DR: A survey unifying multimodal generative models for 2D, video, 3D, and 4D generation to advance real-world simulation in AGI research.

Details

Motivation: To address the gap in treating modalities independently and systematically integrate their interdependencies for more accurate real-world simulations.

Method: The survey reviews progression from 2D (appearance) to video (dynamics), 3D (geometry), and 4D (all dimensions), including datasets and metrics.

Result: First unified framework for studying multimodal generative models, offering insights and future directions.

Conclusion: The survey bridges gaps in AGI research by fostering a unified approach to multimodal generative models for real-world simulation.

Abstract: Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.

[218] Are you Struggling? Dataset and Baselines for Struggle Determination in Assembly Videos

Shijia Feng, Michael Wray, Brian Sullivan, Youngkyoon Jang, Casimir Ludwig, Iain Gilchrist, Walterio Mayol-Cuevas

Main category: cs.CV

TL;DR: The paper introduces a dataset and models for detecting struggle in video-based problem-solving activities, achieving high accuracy in binary classification but lower in multi-level struggle detection.

Details

Motivation: To enhance understanding of actions in videos and improve assistive systems by detecting struggle without explicit step or activity knowledge.

Method: Created a struggle dataset with labeled video segments, conducted experiments with deep learning models for struggle classification, regression, and label distribution learning.

Result: Achieved 88.24% accuracy in binary struggle classification and 52.45% in four-way classification.

Conclusion: Struggle detection is feasible and valuable for improving assistive systems and action understanding in videos.

Abstract: Determining when people are struggling allows for a finer-grained understanding of actions that complements conventional action classification and error detection. Struggle detection, as defined in this paper, is a distinct and important task that can be identified without explicit step or activity knowledge. We introduce the first struggle dataset with three real-world problem-solving activities that are labelled by both expert and crowd-source annotators. Video segments were scored w.r.t. their level of struggle using a forced choice 4-point scale. This dataset contains 5.1 hours of video from 73 participants. We conducted a series of experiments to identify the most suitable modelling approaches for struggle determination. Additionally, we compared various deep learning models, establishing baseline results for struggle classification, struggle regression, and struggle label distribution learning. Our results indicate that struggle detection in video can achieve up to $88.24%$ accuracy in binary classification, while detecting the level of struggle in a four-way classification setting performs lower, with an overall accuracy of $52.45%$. Our work is motivated toward a more comprehensive understanding of action in video and potentially the improvement of assistive systems that analyse struggle and can better support users during manual activities.

[219] Revisiting 3D Medical Scribble Supervision: Benchmarking Beyond Cardiac Segmentation

Karol Gotkowski, Klaus H. Maier-Hein, Fabian Isensee

Main category: cs.CV

TL;DR: The paper critiques the over-specialization of scribble supervision in medical 3D segmentation, introduces ScribbleBench for broader evaluation, and highlights a simpler, overlooked baseline (nnU-Net with partial loss) that outperforms specialized methods.

Details

Motivation: To address the overfitting and lack of generalization in scribble supervision methods, which are overly focused on cardiac datasets, and to establish a more practical and robust evaluation standard.

Method: Formulates key requirements for scribble supervision and introduces ScribbleBench, a benchmark spanning seven diverse medical imaging datasets, to evaluate these requirements systematically.

Result: Reveals that many specialized methods fail to generalize outside the cardiac domain, while simpler approaches (e.g., nnU-Net with partial loss) perform better across diverse tasks.

Conclusion: The work aims to redirect scribble supervision research toward more generalizable and practical methodologies by identifying limitations and setting a new benchmark-driven standard.

Abstract: Scribble supervision has emerged as a promising approach for reducing annotation costs in medical 3D segmentation by leveraging sparse annotations instead of voxel-wise labels. While existing methods report strong performance, a closer analysis reveals that the majority of research is confined to the cardiac domain, predominantly using ACDC and MSCMR datasets. This over-specialization has resulted in severe overfitting, misleading claims of performance improvements, and a lack of generalization across broader segmentation tasks. In this work, we formulate a set of key requirements for practical scribble supervision and introduce ScribbleBench, a comprehensive benchmark spanning over seven diverse medical imaging datasets, to systematically evaluate the fulfillment of these requirements. Consequently, we uncover a general failure of methods to generalize across tasks and that many widely used novelties degrade performance outside of the cardiac domain, whereas simpler overlooked approaches achieve superior generalization. Finally, we raise awareness for a strong yet overlooked baseline, nnU-Net coupled with a partial loss, which consistently outperforms specialized methods across a diverse range of tasks. By identifying fundamental limitations in existing research and establishing a new benchmark-driven evaluation standard, this work aims to steer scribble supervision toward more practical, robust, and generalizable methodologies for medical image segmentation.

[220] ProbRadarM3F: mmWave Radar based Human Skeletal Pose Estimation with Probability Map Guided Multi-Format Feature Fusion

Bing Zhu, Zixin He, Weiyi Xiong, Guanhua Ding, Tao Huang, Wei Xiang

Main category: cs.CV

TL;DR: The paper introduces ProbRadarM3F, a model for mmWave radar-based human pose estimation, fusing traditional heatmap and positional features to improve accuracy.

Details

Motivation: Overcome the limitations of mmWave radar in pose estimation by addressing underutilized signal information.

Method: Uses a probability map guided multi-format feature fusion model (ProbRadarM3F) combining FFT and positional encoding.

Result: Achieves 69.9% AP on HuPR dataset, outperforming existing methods.

Conclusion: Highlights the potential of exploiting non-redundant radar signal information for future research.

Abstract: Millimeter wave (mmWave) radar is a non-intrusive privacy and relatively convenient and inexpensive device, which has been demonstrated to be applicable in place of RGB cameras in human indoor pose estimation tasks. However, mmWave radar relies on the collection of reflected signals from the target, and the radar signals containing information is difficult to be fully applied. This has been a long-standing hindrance to the improvement of pose estimation accuracy. To address this major challenge, this paper introduces a probability map guided multi-format feature fusion model, ProbRadarM3F. This is a novel radar feature extraction framework using a traditional FFT method in parallel with a probability map based positional encoding method. ProbRadarM3F fuses the traditional heatmap features and the positional features, then effectively achieves the estimation of 14 keypoints of the human body. Experimental evaluation on the HuPR dataset proves the effectiveness of the model proposed in this paper, outperforming other methods experimented on this dataset with an AP of 69.9 %. The emphasis of our study is focusing on the position information that is not exploited before in radar singal. This provides direction to investigate other potential non-redundant information from mmWave rader.

[221] PrAViC: Probabilistic Adaptation Framework for Real-Time Video Classification

Magdalena Trędowicz, Marcin Mazur, Szymon Janusz, Arkadiusz Lewicki, Jacek Tabor, Łukasz Struski

Main category: cs.CV

TL;DR: PrAViC is a unified framework for online video classification, enabling faster decisions while maintaining accuracy by adapting offline models with recurrent operations.

Details

Motivation: Addressing the lack of well-defined methods for online video processing compared to offline methods.

Method: Establishes a mathematical foundation for early decision-making in sequential data, then adapts offline models to online settings using recurrent operations.

Result: PrAViC reduces decision time significantly while preserving or improving accuracy compared to state-of-the-art models.

Conclusion: PrAViC provides a theoretically grounded and practical solution for online video classification, bridging the gap between offline and online processing.

Abstract: Video processing is generally divided into two main categories: processing of the entire video, which typically yields optimal classification outcomes, and real-time processing, where the objective is to make a decision as promptly as possible. Although the models dedicated to the processing of entire videos are typically well-defined and clearly presented in the literature, this is not the case for online processing, where a~plethora of hand-devised methods exist. To address this issue, we present PrAViC, a novel, unified, and theoretically-based adaptation framework for tackling the online classification problem in video data. The initial phase of our study is to establish a mathematical background for the classification of sequential data, with the potential to make a decision at an early stage. This allows us to construct a natural function that encourages the model to return a result much faster. The subsequent phase is to present a straightforward and readily implementable method for adapting offline models to the online setting using recurrent operations. Finally, PrAViC is evaluated by comparing it with existing state-of-the-art offline and online models and datasets. This enables the network to significantly reduce the time required to reach classification decisions while maintaining, or even enhancing, accuracy.

[222] From Few to More: Scribble-based Medical Image Segmentation via Masked Context Modeling and Continuous Pseudo Labels

Zhisong Wang, Yiwen Ye, Ziyang Chen, Minglei Shu, Yanning Zhang, Yong Xia

Main category: cs.CV

TL;DR: MaCo introduces Masked Context Modeling (MCM) and Continuous Pseudo Labels (CPL) for scribble-based weakly supervised medical image segmentation, outperforming existing methods.

Details

Motivation: Existing methods rely on auxiliary tasks and hard pseudo labels, overlooking challenges of sparse annotations. MaCo addresses this by leveraging MCM and CPL.

Method: MaCo uses MCM for input perturbation and CPL to convert scribbles into continuous confidence maps, avoiding hard pseudo labels.

Result: MaCo outperforms other methods on three public datasets, setting a new benchmark.

Conclusion: MaCo effectively handles sparse annotations, improving weakly supervised medical image segmentation.

Abstract: Scribble-based weakly supervised segmentation methods have shown promising results in medical image segmentation, significantly reducing annotation costs. However, existing approaches often rely on auxiliary tasks to enforce semantic consistency and use hard pseudo labels for supervision, overlooking the unique challenges faced by models trained with sparse annotations. These models must predict pixel-wise segmentation maps from limited data, making it crucial to handle varying levels of annotation richness effectively. In this paper, we propose MaCo, a weakly supervised model designed for medical image segmentation, based on the principle of “from few to more.” MaCo leverages Masked Context Modeling (MCM) and Continuous Pseudo Labels (CPL). MCM employs an attention-based masking strategy to perturb the input image, ensuring that the model’s predictions align with those of the original image. CPL converts scribble annotations into continuous pixel-wise labels by applying an exponential decay function to distance maps, producing confidence maps that represent the likelihood of each pixel belonging to a specific category, rather than relying on hard pseudo labels. We evaluate MaCo on three public datasets, comparing it with other weakly supervised methods. Our results show that MaCo outperforms competing methods across all datasets, establishing a new record in weakly supervised medical image segmentation.

[223] ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs

Yin Xie, Kaicheng Yang, Peirou Liang, Xiang An, Yongle Zhao, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng

Main category: cs.CV

TL;DR: ViCToR is a pretraining framework for LMMs that addresses the modality representation gap by using a learnable visual token pool and Hungarian matching for token selection, achieving SOTA results.

Details

Motivation: LMMs struggle with unstable visual representations due to contextual noise, unlike stable language embeddings.

Method: ViCToR introduces a visual comprehension stage with a learnable token pool, Hungarian matching for token selection, and a reconstruction loss with dense semantic supervision.

Result: ViCToR improves performance by 10.4%, 3.2%, and 7.2% on MMStar, SEED$^I$, and RealWorldQA benchmarks, respectively.

Conclusion: ViCToR effectively bridges the modality gap in LMMs, enhancing visual understanding and achieving top-tier results.

Abstract: Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. ViCToR employs a learnable visual token pool and utilizes the Hungarian matching algorithm to select semantically relevant tokens from this pool for visual token replacement. Furthermore, by integrating a visual token reconstruction loss with dense semantic supervision, ViCToR can learn tokens which retain high visual detail, thereby enhancing the large language model’s (LLM’s) understanding of visual information. After pretraining on 3 million publicly accessible images and captions, ViCToR achieves state-of-the-art results, improving over LLaVA-NeXT-8B by 10.4%, 3.2%, and 7.2% on the MMStar, SEED$^I$, and RealWorldQA benchmarks, respectively. Code is available at https://github.com/deepglint/Victor.

[224] Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, Haohuan Fu, Jianxi Huang, Juepeng Zheng

Main category: cs.CV

TL;DR: AgroMind is a new benchmark for evaluating Large Multimodal Models (LMMs) in agricultural remote sensing, covering diverse tasks and revealing performance gaps, especially in spatial reasoning and fine-grained recognition.

Details

Motivation: Existing benchmarks for agricultural remote sensing lack scene diversity and task complexity, limiting the evaluation of LMMs in this domain.

Method: AgroMind integrates multiple datasets to create 27,247 QA pairs and 19,615 images, covering four task dimensions. It preprocesses data, defines tasks, and evaluates 24 LMMs.

Result: Significant performance gaps were found, with LMMs outperforming humans in some tasks but struggling with spatial reasoning and fine-grained recognition.

Conclusion: AgroMind provides a standardized framework for assessing LMMs in agricultural RS, highlighting domain knowledge limitations and future challenges.

Abstract: Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 27,247 QA pairs and 19,615 images. The pipeline begins with multi-source data pre-processing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 20 open-source LMMs and 4 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.

[225] Joint multi-dimensional dynamic attention and transformer for general image restoration

Huan Zhang, Xu Zhang, Nian Cai, Jianglei Di, Yun Zhang

Main category: cs.CV

TL;DR: A novel image restoration architecture combining multi-dimensional dynamic attention and self-attention in a U-Net framework, balancing performance and efficiency across multiple tasks.

Details

Motivation: Outdoor images often degrade due to rain, haze, and noise, challenging restoration methods. Current approaches struggle with complex degradation while maintaining efficiency.

Method: Integrates CNNs in encoder-decoder and transformers in the latent layer, using multi-dimensional dynamic attention and transposed self-attention for efficient feature extraction.

Result: Achieves better performance-complexity balance in deraining, deblurring, denoising, dehazing, and enhancement, with superior high-level vision task performance.

Conclusion: The proposed method effectively addresses complex degradation while maintaining efficiency, outperforming current approaches.

Abstract: Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at https://github.com/House-yuyu/MDDA-former.

[226] PAD-F: Prior-Aware Debiasing Framework for Long-Tailed X-ray Prohibited Item Detection

Haoyu Wang, Renshuai Tao, Wei Wang, Yunchao Wei

Main category: cs.CV

TL;DR: PAD-F improves long-tailed object detection in X-ray security imagery using material and co-occurrence priors, achieving significant performance gains.

Details

Motivation: Addressing the challenge of long-tailed distribution in prohibited item detection in X-ray images, where conventional methods fail due to unique imaging principles.

Method: Introduces PAD-F with Explicit Material-Aware Augmentation (EMAA) for data-level enhancement and Implicit Co-occurrence Aggregator (ICA) for feature-level improvement.

Result: Achieves up to +17.2% AP50 improvement for tail classes on HiXray and PIDray datasets, outperforming state-of-the-art methods.

Conclusion: PAD-F offers an effective solution for long-tailed detection in X-ray security, enhancing performance for tail classes.

Abstract: Detecting prohibited items in X-ray security imagery is a challenging yet crucial task. With the rapid advancement of deep learning, object detection algorithms have been widely applied in this area. However, the distribution of object classes in real-world prohibited item detection scenarios often exhibits a distinct long-tailed distribution. Due to the unique principles of X-ray imaging, conventional methods for long-tailed object detection are often ineffective in this domain. To tackle these challenges, we introduce the Prior-Aware Debiasing Framework (PAD-F), a novel approach that employs a two-pronged strategy leveraging both material and co-occurrence priors. At the data level, our Explicit Material-Aware Augmentation (EMAA) component generates numerous challenging training samples for tail classes. It achieves this through a placement strategy guided by material-specific absorption rates and a gradient-based Poisson blending technique. At the feature level, the Implicit Co-occurrence Aggregator (ICA) acts as a plug-in module that enhances features for ambiguous objects by implicitly learning and aggregating statistical co-occurrence relationships within the image. Extensive experiments on the HiXray and PIDray datasets demonstrate that PAD-F significantly boosts the performance of multiple popular detectors. It achieves an absolute improvement of up to +17.2% in AP50 for tail classes and comprehensively outperforms existing state-of-the-art methods. Our work provides an effective and versatile solution to the critical problem of long-tailed detection in X-ray security.

[227] MGDFIS: Multi-scale Global-detail Feature Integration Strategy for Small Object Detection

Yuxiang Wang, Xuecheng Bai, Boyu Hu, Chuanzhi Xu, Haodong Chen, Vera Chung, Tingxue Li, Xiaoming Chen

Main category: cs.CV

TL;DR: MGDFIS improves small object detection in UAV imagery by integrating global and local features efficiently, outperforming existing methods.

Details

Motivation: Small object detection in UAV imagery is challenging due to tiny object size, low signal-to-noise ratios, and limited feature extraction. Existing methods add computational burden and blur details.

Method: Proposes MGDFIS, a fusion framework with three modules: FusionLock-TSS Attention, Global-detail Integration, and Dynamic Pixel Attention, to enhance detection while maintaining efficiency.

Result: MGDFIS outperforms state-of-the-art methods on the VisDrone benchmark, achieving high precision and recall with low inference time.

Conclusion: MGDFIS provides a practical, efficient solution for small-object detection on UAV platforms, balancing accuracy and resource usage.

Abstract: Small object detection in UAV imagery is crucial for applications such as search-and-rescue, traffic monitoring, and environmental surveillance, but it is hampered by tiny object size, low signal-to-noise ratios, and limited feature extraction. Existing multi-scale fusion methods help, but add computational burden and blur fine details, making small object detection in cluttered scenes difficult. To overcome these challenges, we propose the Multi-scale Global-detail Feature Integration Strategy (MGDFIS), a unified fusion framework that tightly couples global context with local detail to boost detection performance while maintaining efficiency. MGDFIS comprises three synergistic modules: the FusionLock-TSS Attention Module, which marries token-statistics self-attention with DynamicTanh normalization to highlight spectral and spatial cues at minimal cost; the Global-detail Integration Module, which fuses multi-scale context via directional convolution and parallel attention while preserving subtle shape and texture variations; and the Dynamic Pixel Attention Module, which generates pixel-wise weighting maps to rebalance uneven foreground and background distributions and sharpen responses to true object regions. Extensive experiments on the VisDrone benchmark demonstrate that MGDFIS consistently outperforms state-of-the-art methods across diverse backbone architectures and detection frameworks, achieving superior precision and recall with low inference time. By striking an optimal balance between accuracy and resource usage, MGDFIS provides a practical solution for small-object detection on resource-constrained UAV platforms.

[228] ViewDelta: Scaling Scene Change Detection through Text-Conditioning

Subin Varghese, Joshua Gao, Vedhus Hoskere

Main category: cs.CV

TL;DR: A generalized framework, ViewDelta, uses text prompts to define relevant changes in Scene Change Detection (SCD), enabling joint training across diverse datasets. It outperforms dataset-specific models.

Details

Motivation: Existing SCD methods struggle with ambiguity in labeling changes (e.g., vegetation growth) across datasets. A unified approach is needed.

Method: Proposes ViewDelta, a text-conditioned SCD framework, and introduces the CSeg dataset with 500K image pairs and 300K textual prompts.

Result: A single ViewDelta model trained on multiple datasets performs competitively or better than dataset-specific models.

Conclusion: Text conditioning is a powerful approach for generalizable SCD, demonstrated by ViewDelta’s success.

Abstract: We introduce a generalized framework for Scene Change Detection (SCD) that addresses the core ambiguity of distinguishing “relevant” from “nuisance” changes, enabling effective joint training of a single model across diverse domains and applications. Existing methods struggle to generalize due to differences in dataset labeling, where changes such as vegetation growth or lane marking alterations may be labeled as relevant in one dataset and irrelevant in another. To resolve this ambiguity, we propose ViewDelta, a text conditioned change detection framework that uses natural language prompts to define relevant changes precisely, such as a single attribute, a specific set of classes, or all observable differences. To facilitate training in this paradigm, we release the Conditional Change Segmentation dataset (CSeg), the first large-scale synthetic dataset for text conditioned SCD, consisting of over 500,000 image pairs with more than 300,000 unique textual prompts describing relevant changes. Experiments demonstrate that a single ViewDelta model trained jointly on CSeg, SYSU-CD, PSCD, VL-CMU-CD, and their unaligned variants achieves performance competitive with or superior to dataset specific models, highlighting text conditioning as a powerful approach for generalizable SCD. Our code and dataset are available at https://joshuakgao.github.io/viewdelta/.

[229] UltraRay: Introducing Full-Path Ray Tracing in Physics-Based Ultrasound Simulation

Felix Duelmer, Mohammad Farid Azampour, Magdalena Wysocki, Nassir Navab

Main category: cs.CV

TL;DR: A novel ultrasound simulation pipeline, UltraRay, uses ray tracing to improve realism by tracing rays back to the sensor and incorporating advanced imaging techniques, reducing artifacts and enabling gradient-based optimization.

Details

Motivation: Traditional ultrasound simulators are computationally expensive, while existing ray tracing models oversimplify ray propagation, leading to unrealistic artifacts.

Method: The proposed pipeline employs a ray tracing algorithm, optimized for plane wave imaging, and integrates signal processing for end-to-end ultrasound image formation.

Result: UltraRay enhances visual quality and realism by accurately capturing secondary reflections and reducing artifacts, demonstrated in synthetic scenes with reflective objects like bones.

Conclusion: The pipeline provides a fast, differentiable framework for ultrasound simulation, supporting advanced applications like beamforming, neural networks, and inverse scene reconstruction.

Abstract: Traditional ultrasound simulators solve the wave equation to model pressure distribution fields, achieving high accuracy but requiring significant computational time and resources. To address this, ray tracing approaches have been introduced, modeling wave propagation as rays interacting with boundaries and scatterers. However, existing models simplify ray propagation, generating echoes at interaction points without considering return paths to the sensor. This can result in unrealistic artifacts and necessitates careful scene tuning for plausible results. We propose a novel ultrasound simulation pipeline that utilizes a ray tracing algorithm to generate echo data, tracing each ray from the transducer through the scene and back to the sensor. To replicate advanced ultrasound imaging, we introduce a ray emission scheme optimized for plane wave imaging, incorporating delay and steering capabilities. Furthermore, we integrate a standard signal processing pipeline to simulate end-to-end ultrasound image formation. We showcase the efficacy of the proposed pipeline by modeling synthetic scenes featuring highly reflective objects, such as bones. In doing so, our proposed approach, UltraRay, not only enhances the overall visual quality but also improves the realism of the simulated images by accurately capturing secondary reflections and reducing unnatural artifacts. By building on top of a differentiable framework, the proposed pipeline lays the groundwork for a fast and differentiable ultrasound simulation tool necessary for gradient-based optimization, enabling advanced ultrasound beamforming strategies, neural network integration, and accurate inverse scene reconstruction.

[230] Open-Set LiDAR Panoptic Segmentation Guided by Uncertainty-Aware Learning

Rohit Mohan, Julia Hindel, Florian Drews, Claudius Gläser, Daniele Cattaneo, Abhinav Valada

Main category: cs.CV

TL;DR: ULOPS is an uncertainty-guided open-set LiDAR panoptic segmentation framework that detects unknown objects using Dirichlet-based evidential learning and outperforms existing methods.

Details

Motivation: Existing LiDAR panoptic segmentation models fail to detect unknown objects due to closed-set assumptions, limiting their use in open-world environments.

Method: ULOPS uses separate decoders for semantic segmentation, embedding, and instance center prediction, along with three uncertainty-driven loss functions to differentiate known and unknown objects.

Result: ULOPS outperforms existing methods in open-set LiDAR panoptic segmentation, validated on KITTI-360 and nuScenes datasets.

Conclusion: ULOPS effectively addresses the challenge of detecting unknown objects in open-world environments, setting a new benchmark for open-set LiDAR panoptic segmentation.

Abstract: Autonomous vehicles that navigate in open-world environments may encounter previously unseen object classes. However, most existing LiDAR panoptic segmentation models rely on closed-set assumptions, failing to detect unknown object instances. In this work, we propose ULOPS, an uncertainty-guided open-set panoptic segmentation framework that leverages Dirichlet-based evidential learning to model predictive uncertainty. Our architecture incorporates separate decoders for semantic segmentation with uncertainty estimation, embedding with prototype association, and instance center prediction. During inference, we leverage uncertainty estimates to identify and segment unknown instances. To strengthen the model’s ability to differentiate between known and unknown objects, we introduce three uncertainty-driven loss functions. Uniform Evidence Loss to encourage high uncertainty in unknown regions. Adaptive Uncertainty Separation Loss ensures a consistent difference in uncertainty estimates between known and unknown objects at a global scale. Contrastive Uncertainty Loss refines this separation at the fine-grained level. To evaluate open-set performance, we extend benchmark settings on KITTI-360 and introduce a new open-set evaluation for nuScenes. Extensive experiments demonstrate that ULOPS consistently outperforms existing open-set LiDAR panoptic segmentation methods.

[231] HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, Xiang Bai

Main category: cs.CV

TL;DR: HERMES is a unified Driving World Model integrating 3D scene understanding and future scene generation, achieving state-of-the-art performance with reduced errors and improved metrics.

Details

Motivation: Existing Driving World Models lack scene understanding, limiting their ability to interpret and reason about driving environments.

Method: HERMES uses a Bird’s-Eye View (BEV) representation and world queries with causal attention in a Large Language Model to unify understanding and generation.

Result: HERMES reduces generation error by 32.4% and improves understanding metrics like CIDEr by 8.0%.

Conclusion: HERMES successfully integrates scene understanding and generation, setting a new benchmark for Driving World Models.

Abstract: Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird’s-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.

[232] HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais

Main category: cs.CV

TL;DR: The paper introduces HVL, a hierarchical vision-language framework for semi-supervised semantic segmentation under domain shift, leveraging domain-invariant text embeddings to improve generalization with limited supervision.

Details

Motivation: Addressing the challenge of semi-supervised semantic segmentation under domain shift by utilizing domain-invariant semantic knowledge from vision-language models.

Method: Proposes HVL, integrating domain-invariant text embeddings as object queries in a transformer-based segmentation network, with targeted regularization losses for vision-language alignment.

Result: Achieves significant improvements in mIoU on benchmark datasets (e.g., +9.3% on COCO) with minimal supervision, demonstrating superior performance.

Conclusion: Language-guided segmentation effectively bridges the label efficiency gap and enhances fine-grained generalization.

Abstract: In this paper, we address Semi-supervised Semantic Segmentation (SSS) under domain shift by leveraging domain-invariant semantic knowledge from text embeddings of Vision-Language Models (VLMs). We propose a unified Hierarchical Vision-Language framework (HVL) that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network to improve generalization and reduce misclassification under limited supervision. The mentioned textual queries are used for grouping pixels with shared semantics under SSS. HVL is designed to (1) generate textual queries that maximally encode domain-invariant semantics from VLM while capturing intra-class variations; (2) align these queries with spatial visual features to enhance their segmentation ability and improve the semantic clarity of visual features. We also introduce targeted regularization losses that maintain vision–language alignment throughout training to reinforce semantic understanding. HVL establishes a novel state-of-the-art by achieving a +9.3% improvement in mean Intersection over Union (mIoU) on COCO, utilizing 232 labelled images, +3.1% on Pascal VOC employing 92 labels, +4.8% on ADE20 using 316 labels, and +3.4% on Cityscapes with 100 labels, demonstrating superior performance with less than 1% supervision on four benchmark datasets. Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.

[233] LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

Yiren Song, Danze Chen, Mike Zheng Shou

Main category: cs.CV

TL;DR: LayerTracer is a diffusion transformer framework for generating layered SVGs by mimicking human design workflows, outperforming existing methods in quality and editability.

Details

Motivation: Existing methods for layered SVG generation either oversimplify or introduce redundancies, failing to align with professional design cognition.

Method: LayerTracer uses a text-conditioned diffusion transformer (DiT) to create multi-phase rasterized blueprints, followed by layer-wise vectorization with path deduplication. A conditional diffusion mechanism guides hierarchical reconstruction.

Result: LayerTracer outperforms optimization-based and neural baselines in generation quality and editability, aligning with professional design cognition.

Conclusion: LayerTracer effectively bridges the gap in cognitive-aligned layered SVG generation, offering superior performance and editability.

Abstract: Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a diffusion transformer based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments demonstrate LayerTracer’s superior performance against optimization-based and neural baselines in both generation quality and editability, effectively aligning AI-generated vectors with professional design cognition.

[234] Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

Vlad Hosu, Lorenzo Agnolucci, Daisuke Iso, Dietmar Saupe

Main category: cs.CV

TL;DR: The paper introduces Image Intrinsic Scale (IIS) and the IISA task to quantify how image scale affects perceived quality, proposing a dataset (IISA-DB) and a weak-labeling strategy (WIISA) to improve IQA methods.

Details

Motivation: To systematically quantify the relationship between image scale and perceived quality, which has been highlighted but not thoroughly studied.

Method: Defines IIS and the IISA task, develops a subjective annotation methodology, creates the IISA-DB dataset, and proposes WIISA for weak-labeling.

Result: WIISA improves the performance of IQA methods adapted for IISA compared to using only ground-truth labels.

Conclusion: The work bridges a gap in understanding image scale’s impact on quality, offering tools (dataset, WIISA) for future research.

Abstract: Image Quality Assessment (IQA) measures and predicts perceived image quality by human observers. Although recent studies have highlighted the critical influence that variations in the scale of an image have on its perceived quality, this relationship has not been systematically quantified. To bridge this gap, we introduce the Image Intrinsic Scale (IIS), defined as the largest scale where an image exhibits its highest perceived quality. We also present the Image Intrinsic Scale Assessment (IISA) task, which involves subjectively measuring and predicting the IIS based on human judgments. We develop a subjective annotation methodology and create the IISA-DB dataset, comprising 785 image-IIS pairs annotated by experts in a rigorously controlled crowdsourcing study. Furthermore, we propose WIISA (Weak-labeling for Image Intrinsic Scale Assessment), a strategy that leverages how the IIS of an image varies with downscaling to generate weak labels. Experiments show that applying WIISA during the training of several IQA methods adapted for IISA consistently improves the performance compared to using only ground-truth labels. The code, dataset, and pre-trained models are available at https://github.com/SonyResearch/IISA.

[235] SWA-SOP: Spatially-aware Window Attention for Semantic Occupancy Prediction in Autonomous Driving

Helin Cao, Rafael Materla, Sven Behnke

Main category: cs.CV

TL;DR: The paper proposes Spatially-aware Window Attention (SWA) to improve Semantic Occupancy Prediction (SOP) in autonomous driving by incorporating local spatial context into attention, addressing limitations of existing transformer-based methods.

Details

Motivation: Current transformer-based SOP methods lack explicit spatial structure modeling in attention, leading to poor performance in sparse or occluded areas.

Method: Introduces SWA, a mechanism that integrates local spatial context into attention computation.

Result: SWA achieves state-of-the-art results on LiDAR-based SOP benchmarks and shows consistent improvements in camera-based SOP pipelines.

Conclusion: SWA enhances geometric awareness and performance in SOP, demonstrating its effectiveness across sensor modalities.

Abstract: Perception systems in autonomous driving rely on sensors such as LiDAR and cameras to perceive the 3D environment. However, due to occlusions and data sparsity, these sensors often fail to capture complete information. Semantic Occupancy Prediction (SOP) addresses this challenge by inferring both occupancy and semantics of unobserved regions. Existing transformer-based SOP methods lack explicit modeling of spatial structure in attention computation, resulting in limited geometric awareness and poor performance in sparse or occluded areas. To this end, we propose Spatially-aware Window Attention (SWA), a novel mechanism that incorporates local spatial context into attention. SWA significantly improves scene completion and achieves state-of-the-art results on LiDAR-based SOP benchmarks. We further validate its generality by integrating SWA into a camera-based SOP pipeline, where it also yields consistent gains across modalities.

[236] Towards Synthesized and Editable Motion In-Betweening Through Part-Wise Phase Representation

Minyue Dai, Ke Fan, Bin Ji, Haoran Xu, Haoyu Zhao, Junting Dong, Jingbo Wang, Bo Dai

Main category: cs.CV

TL;DR: A novel framework for styled motion in-betweening models motion styles at the body-part level, improving flexibility and controllability in animations.

Details

Motivation: Existing methods overlook individual body part representation, limiting flexibility in adjusting motion styles for specific limbs.

Method: Uses periodic autoencoders to extract body-part phases and integrates motion manifold learning with conditional generation for decoupling motion source and control.

Result: Achieves superior speed, robust generalization, and effective generation of extended motion sequences.

Conclusion: The proposed framework enhances diversity and controllability in infilled motions, enabling nuanced and expressive animations.

Abstract: Styled motion in-betweening is crucial for computer animation and gaming. However, existing methods typically encode motion styles by modeling whole-body motions, often overlooking the representation of individual body parts. This limitation reduces the flexibility of infilled motion, particularly in adjusting the motion styles of specific limbs independently. To overcome this challenge, we propose a novel framework that models motion styles at the body-part level, enhancing both the diversity and controllability of infilled motions. Our approach enables more nuanced and expressive animations by allowing precise modifications to individual limb motions while maintaining overall motion coherence. Leveraging phase-related insights, our framework employs periodic autoencoders to automatically extract the phase of each body part, capturing distinctive local style features. Additionally, we effectively decouple the motion source from synthesis control by integrating motion manifold learning and conditional generation techniques from both image and motion domains. This allows the motion source to generate high-quality motions across various styles, with extracted motion and style features readily available for controlled synthesis in subsequent tasks. Comprehensive evaluations demonstrate that our method achieves superior speed, robust generalization, and effective generation of extended motion sequences.

[237] OC-SOP: Enhancing Vision-Based 3D Semantic Occupancy Prediction by Object-Centric Awareness

Helin Cao, Sven Behnke

Main category: cs.CV

TL;DR: OC-SOP improves semantic occupancy prediction by integrating object-centric cues, enhancing accuracy for foreground objects.

Details

Motivation: Challenges in autonomous driving perception due to occlusions and incomplete data, especially for dynamic objects.

Method: Proposes Object-Centric SOP (OC-SOP), integrating object-centric cues from a detection branch into the SOP pipeline.

Result: Achieves state-of-the-art performance on SemanticKITTI, particularly for foreground objects.

Conclusion: OC-SOP effectively addresses limitations of conventional methods by leveraging object-centric features.

Abstract: Autonomous driving perception faces significant challenges due to occlusions and incomplete scene data in the environment. To overcome these issues, the task of semantic occupancy prediction (SOP) is proposed, which aims to jointly infer both the geometry and semantic labels of a scene from images. However, conventional camera-based methods typically treat all categories equally and primarily rely on local features, leading to suboptimal predictions, especially for dynamic foreground objects. To address this, we propose Object-Centric SOP (OC-SOP), a framework that integrates high-level object-centric cues extracted via a detection branch into the semantic occupancy prediction pipeline. This object-centric integration significantly enhances the prediction accuracy for foreground objects and achieves state-of-the-art performance among all categories on SemanticKITTI.

[238] GranQ: Granular Zero-Shot Quantization with Channel-Wise Activation Scaling in QAT

Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Kijung Lee, Sanghyun Park

Main category: cs.CV

TL;DR: GranQ is a novel activation quantization framework for zero-shot quantization (ZSQ) that introduces an efficient pre-scaling strategy, reducing computational overhead and improving accuracy, especially in low-bit settings.

Details

Motivation: To address the activation distortion and computational inefficiency in existing ZSQ methods that rely on synthetic inputs and per-channel scaling.

Method: GranQ applies scaling factors in a pre-scaling step through fully vectorized computation, eliminating runtime scaling overhead.

Result: GranQ outperforms state-of-the-art ZSQ methods, achieving up to 5.45% higher accuracy in 3-bit settings on CIFAR-100 and surpassing full-precision baselines on CIFAR-10. It also reduces quantization latency.

Conclusion: GranQ offers a more efficient and accurate alternative to conventional ZSQ methods, potentially inspiring future research beyond data generation and model fine-tuning.

Abstract: Zero-shot quantization (ZSQ) enables neural network compression without original training data, making it a promising solution for restricted data access scenarios. To compensate for the lack of data, recent ZSQ methods typically rely on synthetic inputs generated from the full-precision model. However, these synthetic inputs often lead to activation distortion, especially under low-bit settings. To mitigate this, existing methods typically employ per-channel scaling, but they still struggle due to the severe computational overhead during the accumulation process. To overcome this critical bottleneck, we propose GranQ, a novel activation quantization framework that introduces an efficient pre-scaling strategy. Unlike conventional channel-wise methods that repeatedly perform scaling operations during accumulation, GranQ applies scaling factors in a pre-scaling step through fully vectorized computation, eliminating runtime scaling overhead. This design enables GranQ to maintain fine-grained quantization accuracy while significantly reducing computational burden, particularly in low-bit quantization settings. Extensive experiments under quantization-aware training (QAT) settings demonstrate that GranQ consistently outperforms state-of-the-art ZSQ methods across CIFAR and ImageNet. In particular, our method achieves up to 5.45% higher accuracy in the 3-bit setting on CIFAR-100 and even surpasses the full-precision baseline on CIFAR-10. Furthermore, GranQ achieves significant speedup in quantization latency over conventional per-channel methods, demonstrating improved efficiency. With these findings, we anticipate that GranQ will inspire future research beyond conventional ZSQ approaches centered on data generation and model fine-tuning. The official code is available at https://github.com/anonymus-orange/GranQ.

[239] Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Chen Wang, Jiahua Dong, Wangbo Yu, Ge Zhang, Jun Song, Xiang Li, Bo Zheng, Ian Reid, Xiaodan Liang

Main category: cs.CV

TL;DR: Video SimpleQA is introduced as the first benchmark for evaluating factual grounding in Large Video Language Models (LVLMs), highlighting current model deficiencies and proposing multi-hop, knowledge-integrated questions for rigorous assessment.

Details

Motivation: Addressing the lack of a comprehensive benchmark for evaluating factual grounding in LVLMs, which is critical for multi-modal understanding.

Method: Introduces Video SimpleQA with features like multi-hop questions, external knowledge integration, temporal grounding, and definitive short-form answers. Evaluates 33 LVLMs.

Result: Current LVLMs show poor factual adherence (best F-score: 66.3%), overconfidence, and degraded performance in multi-hop QAs. Retrieval-augmented generation improves results but adds overhead.

Conclusion: Video SimpleQA serves as a foundational benchmark to guide LVLM development toward verifiable factual grounding in videos.

Abstract: Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in videos remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation in video contexts. Our work differs from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the video’s explicit narrative; 2) Multi-hop fact-seeking question: Each question involves multiple explicit facts and requires strict factual grounding without hypothetical or subjective inferences. We also include per-hop single-fact-based sub-QAs alongside final QAs to enable fine-grained, stepby-step evaluation; 3) Short-form definitive answer: Answers are crafted as unambiguous and definitively correct in a short format with minimal scoring variance; 4) Temporal grounded required: Requiring answers to rely on one or more temporal segments in videos, rather than single frames. We extensively evaluate 33 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, with the best-performing model o3 merely achieving an F-score of 66.3%; 2) Most LVLMs are overconfident in what they generate, with self-stated confidence exceeding actual accuracy; 3) Retrieval-augmented generation demonstrates consistent improvements at the cost of additional inference time overhead; 4) Multi-hop QA demonstrates substantially degraded performance compared to single-hop sub-QAs, with first-hop object or event recognition emerging as the primary bottleneck. We position Video SimpleQA as the cornerstone benchmark for video factuality assessment, aiming to steer LVLM development toward verifiable grounding in real-world contexts.

[240] NeuralGS: Bridging Neural Fields and 3D Gaussian Splatting for Compact 3D Representations

Zhenyu Tang, Chaoran Feng, Xinhua Cheng, Wangbo Yu, Junwu Zhang, Yuan Liu, Xiaoxiao Long, Wenping Wang, Li Yuan

Main category: cs.CV

TL;DR: NeuralGS compresses 3D Gaussian Splatting (3DGS) into a compact neural field representation using MLPs, reducing model size by 91 times without quality loss.

Details

Motivation: 3DGS has high storage and transmission costs due to millions of 3D Gaussians. Neural fields like NeRF offer compact representation, inspiring NeuralGS.

Method: Clustering strategy with tiny MLPs to encode Gaussian attributes, weighted by importance scores.

Result: 91-times average model size reduction while maintaining visual quality.

Conclusion: NeuralGS effectively compresses 3DGS, balancing efficiency and quality.

Abstract: 3D Gaussian Splatting (3DGS) achieves impressive quality and rendering speed, but with millions of 3D Gaussians and significant storage and transmission costs. In this paper, we aim to develop a simple yet effective method called NeuralGS that compresses the original 3DGS into a compact representation. Our observation is that neural fields like NeRF can represent complex 3D scenes with Multi-Layer Perceptron (MLP) neural networks using only a few megabytes. Thus, NeuralGS effectively adopts the neural field representation to encode the attributes of 3D Gaussians with MLPs, only requiring a small storage size even for a large-scale scene. To achieve this, we adopt a clustering strategy and fit the Gaussians within each cluster using different tiny MLPs, based on importance scores of Gaussians as fitting weights. We experiment on multiple datasets, achieving a 91-times average model size reduction without harming the visual quality.

[241] GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhanxiao Du, Zhenyu Hou, Zhao Xue, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

Main category: cs.CV

TL;DR: GLM-4.1V-Thinking and GLM-4.5V are vision-language models (VLMs) achieving state-of-the-art performance on diverse tasks through a reasoning-centric training framework and Reinforcement Learning with Curriculum Sampling (RLCS).

Details

Motivation: To advance general-purpose multimodal understanding and reasoning by developing capable vision-language models.

Method: Large-scale pre-training followed by RLCS to enhance model capabilities across tasks like STEM problem solving, video understanding, and coding.

Result: GLM-4.5V outperforms open-source models of similar size and competes with closed-source models like Gemini-2.5-Flash. GLM-4.1V-9B-Thinking surpasses larger models on 29 benchmarks.

Conclusion: The models demonstrate superior performance and are open-sourced, contributing to the field of multimodal AI.

Abstract: We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. Code, models and more information are released at https://github.com/zai-org/GLM-V.

[242] Cyc3D: Fine-grained Controllable 3D Generation via Cycle Consistency Regularization

Hongbin Xu, Chaohui Yu, Feng Xiao, Jiazheng Xing, Hai Ci, Weitao Chen, Fan Wang, Ming Li

Main category: cs.CV

TL;DR: The paper proposes a framework, \name{}, to enhance controllable 3D generation by enforcing cyclic consistency between generated 3D content and input conditions, improving alignment and detail preservation.

Details

Motivation: Existing methods struggle with maintaining accurate alignment between generated 3D content and input conditions, leading to discrepancies. The goal is to improve controllability in 3D generation.

Method: The framework uses a feed-forward backbone to generate 3D objects from input conditions and text prompts. It employs a cyclic process with two consistency constraints (view and condition consistency) to ensure coherence and alignment.

Result: Experiments show \name{} outperforms existing methods, with significant improvements in controllability (e.g., +14.17% PSNR for edge, +6.26% PSNR for sketch).

Conclusion: \name{} effectively enhances controllable 3D generation by enforcing cyclic consistency, addressing alignment challenges and preserving fine-grained details.

Abstract: Despite the remarkable progress of 3D generation, achieving controllability, i.e., ensuring consistency between generated 3D content and input conditions like edge and depth, remains a significant challenge. Existing methods often struggle to maintain accurate alignment, leading to noticeable discrepancies. To address this issue, we propose \name{}, a new framework that enhances controllable 3D generation by explicitly encouraging cyclic consistency between the second-order 3D content, generated based on extracted signals from the first-order generation, and its original input controls. Specifically, we employ an efficient feed-forward backbone that can generate a 3D object from an input condition and a text prompt. Given an initial viewpoint and a control signal, a novel view is rendered from the generated 3D content, from which the extracted condition is used to regenerate the 3D content. This re-generated output is then rendered back to the initial viewpoint, followed by another round of control signal extraction, forming a cyclic process with two consistency constraints. \emph{View consistency} ensures coherence between the two generated 3D objects, measured by semantic similarity to accommodate generative diversity. \emph{Condition consistency} aligns the final extracted signal with the original input control, preserving structural or geometric details throughout the process. Extensive experiments on popular benchmarks demonstrate that \name{} significantly improves controllability, especially for fine-grained details, outperforming existing methods across various conditions (e.g., +14.17% PSNR for edge, +6.26% PSNR for sketch).

Songsong Xiong, Hamidreza Kasaei

Main category: cs.CV

TL;DR: Proposes LM-MCVT, a novel network for 3D object recognition in robotics, using GEEF for multi-view fusion, achieving 95.6% accuracy on ModelNet40 and robust performance on OmniObject3D.

Details

Motivation: Address challenges in 3D object recognition in complex human-centered environments like restaurants and warehouses.

Method: LM-MCVT combines convolutional encoders and transformers, leveraging GEEF for multi-view fusion.

Result: Achieves 95.6% accuracy on ModelNet40 and superior performance on OmniObject3D.

Conclusion: LM-MCVT is robust and effective for 3D object recognition in both synthetic and real-world datasets.

Abstract: In human-centered environments such as restaurants, homes, and warehouses, robots often face challenges in accurately recognizing 3D objects. These challenges stem from the complexity and variability of these environments, including diverse object shapes. In this paper, we propose a novel Lightweight Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to enhance 3D object recognition in robotic applications. Our approach leverages the Globally Entropy-based Embeddings Fusion (GEEF) method to integrate multi-views efficiently. The LM-MCVT architecture incorporates pre- and mid-level convolutional encoders and local and global transformers to enhance feature extraction and recognition accuracy. We evaluate our method on the synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6% using a four-view setup, surpassing existing state-of-the-art methods. To further validate its effectiveness, we conduct 5-fold cross-validation on the real-world OmniObject3D dataset using the same configuration. Results consistently show superior performance, demonstrating the method’s robustness in 3D object recognition across synthetic and real-world 3D data.

[244] CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback

Chenhan Jiang, Yihan Zeng, Dit-Yan Yeung

Main category: cs.CV

TL;DR: The paper introduces Textual Coherent Score Distillation (TCSD) to improve text-to-3D generation by addressing semantic fidelity issues in SDS-based methods, leveraging MLLMs for alignment feedback.

Details

Motivation: Current SDS-based methods struggle with semantic fidelity for complex prompts and text-3D alignment degradation due to view-independent biases.

Method: Proposes TCSD, integrating MLLM feedback for text-3D alignment, and introduces 3DLLaVA-CRITIC for evaluating alignment and LLM-layout initialization for faster convergence.

Result: CoherenDream framework shows consistent improvements in text-3D alignment and optimization efficiency.

Conclusion: TCSD and MLLM integration significantly enhance text-to-3D generation, validated by extensive ablation studies.

Abstract: Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertently exacerbates text-3D alignment degradation. The limitation stems from SDS’s inherent accumulation of view-independent biases during optimization, which progressively diverges from the ideal text alignment direction. To alleviate this limitation, we propose a novel SDS objective, dubbed as Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs). Our TCSD leverages cross-modal understanding capabilities of MLLMs to assess and guide the text-3D correspondence during the optimization. We further develop 3DLLaVA-CRITIC - a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations. Additionally, we introduce an LLM-layout initialization that significantly accelerates optimization convergence through semantic-aware spatial configuration. Our framework, CoherenDream, achieves consistent improvement across multiple metrics on TIFA subset.As the first study to incorporate MLLMs into SDS optimization, we also conduct extensive ablation studies to explore optimal MLLM adaptations for 3D generation tasks.

[245] DRWKV: Focusing on Object Edges for Low-Light Image Enhancement

Xuecheng Bai, Yuxiang Wang, Boyu Hu, Qinyuan Jie, Chuanzhi Xu, Hongru Xiao, Kechen Li, Vera Chung

Main category: cs.CV

TL;DR: DRWKV model enhances low-light images by integrating GER theory, Evolving WKV Attention, and Bi-SAB with MS2-Loss, achieving top performance on benchmarks and improving downstream tasks.

Details

Motivation: Addressing challenges in preserving edge continuity and structural details in low-light image enhancement.

Method: Combines Global Edge Retinex (GER) theory, Evolving WKV Attention, and Bilateral Spectrum Aligner (Bi-SAB) with MS2-Loss.

Result: Achieves leading PSNR, SSIM, and NIQE scores on benchmarks and improves low-light multi-object tracking.

Conclusion: DRWKV is effective for low-light image enhancement with strong generalization capabilities.

Abstract: Low-light image enhancement remains a challenging task, particularly in preserving object edge continuity and fine structural details under extreme illumination degradation. In this paper, we propose a novel model, DRWKV (Detailed Receptance Weighted Key Value), which integrates our proposed Global Edge Retinex (GER) theory, enabling effective decoupling of illumination and edge structures for enhanced edge fidelity. Secondly, we introduce Evolving WKV Attention, a spiral-scanning mechanism that captures spatial edge continuity and models irregular structures more effectively. Thirdly, we design the Bilateral Spectrum Aligner (Bi-SAB) and a tailored MS2-Loss to jointly align luminance and chrominance features, improving visual naturalness and mitigating artifacts. Extensive experiments on five LLIE benchmarks demonstrate that DRWKV achieves leading performance in PSNR, SSIM, and NIQE while maintaining low computational complexity. Furthermore, DRWKV enhances downstream performance in low-light multi-object tracking tasks, validating its generalization capabilities.

[246] PiT: Progressive Diffusion Transformer

Jiafu Wu, Yabiao Wang, Jian Li, Jinlong Peng, Yun Cao, Chengjie Wang, Jiangning Zhang

Main category: cs.CV

TL;DR: The paper introduces Pseudo Shifted Window Attention (PSWA) and Progressive Coverage Channel Allocation (PCCA) to improve Diffusion Transformers (DiTs) by reducing redundant global computation and enhancing high-frequency information, resulting in superior performance with lower computational cost.

Details

Motivation: DiTs face high computational costs due to redundant global modeling and inefficient attention mechanisms. The study aims to address these inefficiencies.

Method: Proposes PSWA for balanced global-local modeling and PCCA for high-order attention. Introduces Pseudo Progressive Diffusion Transformer (PiT) based on these innovations.

Result: PiT-L achieves 54% FID improvement over DiT-XL/2 with less computation.

Conclusion: The proposed methods significantly enhance DiTs’ efficiency and performance, demonstrating their potential for image generation.

Abstract: Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global modeling transformers, which face significant quadratic computational cost. However, through empirical analysis, we find that DiTs do not rely as heavily on global information as previously believed. In fact, most layers exhibit significant redundancy in global computation. Additionally, conventional attention mechanisms suffer from low-frequency inertia, limiting their efficiency. To address these issues, we propose Pseudo Shifted Window Attention (PSWA), which fundamentally mitigates global attention redundancy. PSWA achieves moderate global-local information through window attention. It further utilizes a high-frequency bridging branch to simulate shifted window operations, which both enrich the high-frequency information and strengthen inter-window connections. Furthermore, we propose the Progressive Coverage Channel Allocation (PCCA) strategy that captures high-order attention without additional computational cost. Based on these innovations, we propose a series of Pseudo Progressive Diffusion Transformer (PiT). Our extensive experiments show their superior performance; for example, our proposed PiT-L achieves 54% FID improvement over DiT-XL/2 while using less computation.

[247] LiteFat: Lightweight Spatio-Temporal Graph Learning for Real-Time Driver Fatigue Detection

Jing Ren, Suyu Ma, Hong Jia, Xiwei Xu, Ivan Lee, Haytham Fayek, Xiaodong Li, Feng Xia

Main category: cs.CV

TL;DR: LiteFat is a lightweight spatio-temporal graph learning model for efficient driver fatigue detection, reducing computational demands while maintaining accuracy.

Details

Motivation: Drowsy driving causes accidents, but existing deep learning solutions are too resource-intensive for embedded devices.

Method: Converts video to spatio-temporal graphs using facial landmarks, extracts features with MobileNet, and applies a lightweight graph neural network.

Result: Competitive accuracy with lower computational complexity and latency than state-of-the-art methods.

Conclusion: Enables real-time, resource-efficient fatigue detection for embedded devices like intelligent vehicles.

Abstract: Detecting driver fatigue is critical for road safety, as drowsy driving remains a leading cause of traffic accidents. Many existing solutions rely on computationally demanding deep learning models, which result in high latency and are unsuitable for embedded robotic devices with limited resources (such as intelligent vehicles/cars) where rapid detection is necessary to prevent accidents. This paper introduces LiteFat, a lightweight spatio-temporal graph learning model designed to detect driver fatigue efficiently while maintaining high accuracy and low computational demands. LiteFat involves converting streaming video data into spatio-temporal graphs (STG) using facial landmark detection, which focuses on key motion patterns and reduces unnecessary data processing. LiteFat uses MobileNet to extract facial features and create a feature matrix for the STG. A lightweight spatio-temporal graph neural network is then employed to identify signs of fatigue with minimal processing and low latency. Experimental results on benchmark datasets show that LiteFat performs competitively while significantly decreasing computational complexity and latency as compared to current state-of-the-art methods. This work enables the development of real-time, resource-efficient human fatigue detection systems that can be implemented upon embedded robotic devices.

[248] Scaling Vision Mamba Across Resolutions via Fractal Traversal

Bo Li, Haoke Xiao, Lv Tang

Main category: cs.CV

TL;DR: FractalMamba++ improves Vision Mamba by using fractal-based patch serialization (Hilbert curves) and introduces CSR and PRC modules for better scalability and performance in vision tasks.

Details

Motivation: Vision Mamba's 2D-to-1D patch serialization and scalability issues limit its effectiveness for visual inputs.

Method: Proposes FractalMamba++ with Hilbert curves for patch serialization, CSR for global context, and PRC for local adjacency.

Result: Outperforms previous Mamba-based backbones, especially in high-resolution settings.

Conclusion: FractalMamba++ is a robust vision backbone with improved scalability and performance.

Abstract: Vision Mamba has recently emerged as a promising alternative to Transformer-based architectures, offering linear complexity in sequence length while maintaining strong modeling capacity. However, its adaptation to visual inputs is hindered by challenges in 2D-to-1D patch serialization and weak scalability across input resolutions. Existing serialization strategies such as raster scanning disrupt local spatial continuity and limit the model’s ability to generalize across scales. In this paper, we propose FractalMamba++, a robust vision backbone that leverages fractal-based patch serialization via Hilbert curves to preserve spatial locality and enable seamless resolution adaptability. To address long-range dependency fading in high-resolution inputs, we further introduce a Cross-State Routing (CSR) mechanism that enhances global context propagation through selective state reuse. Additionally, we propose a Positional-Relation Capture (PRC) module to recover local adjacency disrupted by curve inflection points. Extensive experiments across diverse downstream tasks, including image classification, semantic segmentation and object detection, demonstrate that FractalMamba++ consistently outperforms previous Mamba-based backbones, with particularly notable gains under high-resolution settings.

[249] CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment

Bo Wang, De-Xing Huang, Xiao-Hu Zhou, Mei-Jiang Gui, Nu-Fang Xiao, Jian-Long Hao, Ming-Yuan Liu, Zeng-Guang Hou

Main category: cs.CV

TL;DR: CAS-IQA is a vision-language model framework for fine-grained quality assessment of synthetic X-ray angiographies, outperforming existing methods by leveraging auxiliary images and task-specific metrics.

Details

Motivation: Low-quality synthetic angiographies increase procedural risks, and existing IQA methods lack auxiliary references and clinically relevant metrics.

Method: Proposes CAS-IQA, a VLM-based framework with a MUST module for feature fusion, and introduces the CAS-3K dataset with task-specific metrics.

Result: CAS-IQA significantly outperforms state-of-the-art IQA methods on the CAS-3K dataset.

Conclusion: CAS-IQA addresses limitations of existing IQA methods, providing clinically meaningful assessment for synthetic angiographies.

Abstract: Synthetic X-ray angiographies generated by modern generative models hold great potential to reduce the use of contrast agents in vascular interventional procedures. However, low-quality synthetic angiographies can significantly increase procedural risk, underscoring the need for reliable image quality assessment (IQA) methods. Existing IQA models, however, fail to leverage auxiliary images as references during evaluation and lack fine-grained, task-specific metrics necessary for clinical relevance. To address these limitations, this paper proposes CAS-IQA, a vision-language model (VLM)-based framework that predicts fine-grained quality scores by effectively incorporating auxiliary information from related images. In the absence of angiography datasets, CAS-3K is constructed, comprising 3,565 synthetic angiographies along with score annotations. To ensure clinically meaningful assessment, three task-specific evaluation metrics are defined. Furthermore, a Multi-path featUre fuSion and rouTing (MUST) module is designed to enhance image representations by adaptively fusing and routing visual tokens to metric-specific branches. Extensive experiments on the CAS-3K dataset demonstrate that CAS-IQA significantly outperforms state-of-the-art IQA methods by a considerable margin.

[250] MultiFormer: A Multi-Person Pose Estimation System Based on CSI and Attention Mechanism

Yanyi Qu, Haoyang Ma, Wenhui Xiong

Main category: cs.CV

TL;DR: MultiFormer is a wireless sensing system using CSI and Transformer-based feature extraction for accurate human pose estimation, outperforming state-of-the-art methods.

Details

Motivation: Addressing challenges in multi-person pose recognition and CSI feature learning for non-intrusive human activity monitoring.

Method: Uses a Transformer-based time-frequency dual-token feature extractor and Multi-Stage Feature Fusion Network (MSFN) to model CSI correlations and enforce anatomical constraints.

Result: Achieves higher accuracy, especially for high-mobility keypoints like wrists and elbows, on public and self-collected datasets.

Conclusion: MultiFormer demonstrates superior performance in human pose estimation using CSI, overcoming limitations of previous methods.

Abstract: Human pose estimation based on Channel State Information (CSI) has emerged as a promising approach for non-intrusive and precise human activity monitoring, yet faces challenges including accurate multi-person pose recognition and effective CSI feature learning. This paper presents MultiFormer, a wireless sensing system that accurately estimates human pose through CSI. The proposed system adopts a Transformer based time-frequency dual-token feature extractor with multi-head self-attention. This feature extractor is able to model inter-subcarrier correlations and temporal dependencies of the CSI. The extracted CSI features and the pose probability heatmaps are then fused by Multi-Stage Feature Fusion Network (MSFN) to enforce the anatomical constraints. Extensive experiments conducted on on the public MM-Fi dataset and our self-collected dataset show that the MultiFormer achieves higher accuracy over state-of-the-art approaches, especially for high-mobility keypoints (wrists, elbows) that are particularly difficult for previous methods to accurately estimate.

[251] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, Qifeng Chen

Main category: cs.CV

TL;DR: The paper introduces Follow-Your-Motion, a two-stage video motion transfer framework using spatial-temporal decoupled LoRA to improve motion consistency and tuning efficiency in video diffusion transformers.

Details

Motivation: Current motion-transfer methods suffer from motion inconsistency and inefficiency in large video diffusion transformers due to spatial-temporal coupling in 3D attention operators.

Method: Proposes a spatial-temporal decoupled LoRA for decoupling attention, sparse motion sampling, and adaptive RoPE to accelerate tuning. Introduces MotionBench for benchmarking.

Result: Follow-Your-Motion outperforms existing methods, verified by extensive evaluations on MotionBench.

Conclusion: The framework addresses motion inconsistency and tuning inefficiency, offering a superior solution for video motion transfer.

Abstract: Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion. Specifically, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.

[252] SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence

Ziyang Gong, Wenhao Li, Oliver Ma, Songyuan Li, Jiayi Ji, Xue Yang, Gen Luo, Junchi Yan, Rongrong Ji

Main category: cs.CV

TL;DR: SpaCE-10 is a benchmark for evaluating multimodal large language models (MLLMs) on compositional spatial intelligence, revealing gaps in current models.

Details

Motivation: Existing benchmarks lack comprehensive evaluation of MLLMs' spatial intelligence from atomic to compositional levels.

Method: SpaCE-10 defines 10 atomic and 8 compositional spatial capabilities, using a hierarchical annotation pipeline to generate 5k QA pairs for 811 indoor scenes.

Result: Advanced MLLMs lag behind humans, with counting capability being a major limitation.

Conclusion: SpaCE-10 provides insights for improving MLLMs’ spatial intelligence, with data and code publicly available.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple atomic spatial capabilities to handle complex and dynamic tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs. The evaluation code and benchmark datasets are available at https://github.com/Cuzyoung/SpaCE-10.

[253] Yan: Foundational Interactive Video Generation

Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, Wei Yang, Wenkai Lv, Yangbin Yu, Yewen Wang, Yonghang Guan, Zhihao Hu, Zhongbin Fang, Zhongqian Sun

Main category: cs.CV

TL;DR: Yan is a framework for interactive video generation, integrating simulation, multi-modal generation, and multi-granularity editing to enable real-time, AI-driven interactive video creation.

Details

Motivation: To advance interactive video generation by combining simulation, generation, and editing into a unified framework, enhancing creativity and flexibility in AI-driven media production.

Method: Yan uses three modules: AAA-level simulation (3D-VAE with KV-cache denoising), multi-modal generation (hierarchical autoregressive captioning with VDMs), and multi-granularity editing (disentangling mechanics and rendering).

Result: Achieves real-time 1080P/60FPS simulation, flexible domain blending, and text-based editing during interaction.

Conclusion: Yan integrates these capabilities into a comprehensive AI-driven interactive creation paradigm, setting the stage for future creative tools and media.

Abstract: We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: https://greatx3.github.io/Yan/.

[254] 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

Tianrui Lou, Xiaojun Jia, Siyuan Liang, Jiawei Liang, Ming Zhang, Yanjun Xiao, Xiaochun Cao

Main category: cs.CV

TL;DR: PGA is a physical attack framework using 3D Gaussian Splatting for robust adversarial camouflage, outperforming prior methods in cross-view robustness and effectiveness.

Details

Motivation: Existing camouflage-based physical attacks rely on unrealistic mesh priors and simulators, lacking robustness across viewpoints and real-world environments.

Method: PGA leverages 3D Gaussian Splatting for rapid, precise reconstruction and employs min-max optimization to enhance cross-view robustness by adjusting imaging backgrounds.

Result: Extensive experiments show PGA’s superior adversarial effectiveness and robustness compared to prior methods.

Conclusion: PGA advances physical adversarial attacks by addressing limitations of prior work, offering a more practical and robust solution.

Abstract: Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA. Our code is available at:https://github.com/TRLou/PGA.

[255] Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Jihwan Park, Taehoon song, Sanghyeok Lee, Miso Choi, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: TransMiter is a lightweight adapter for vision-language models (VLMs) that transfers adaptation knowledge without backpropagation, improving efficiency and performance.

Details

Motivation: Existing adaptation transfer methods are model-specific and computationally expensive, limiting their applicability across different VLMs.

Method: TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs in an unsupervised manner and transfers this knowledge across models without backpropagation.

Result: TransMiter efficiently transfers adaptation knowledge, preserves generalization, and often outperforms fine-tuned models with minimal additional cost.

Conclusion: TransMiter offers a scalable and efficient solution for enhancing VLMs without the drawbacks of traditional fine-tuning.

Abstract: Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from ‘weaker’ models to efficiently enhance ‘stronger’ ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models ‘without backpropagation’. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an ‘unsupervised’ manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks.

[256] Calibrated Self-supervised Vision Transformers Improve Intracranial Arterial Calcification Segmentation from Clinical CT Head Scans

Benjamin Jin, Grant Mair, Joanna M. Wardlaw, Maria del C. Valdés Hernández

Main category: cs.CV

TL;DR: The paper explores using Vision Transformers (ViTs) pre-trained with masked autoencoder (MAE) for 3D medical image segmentation, specifically for intracranial arterial calcification (IAC), outperforming supervised baselines and improving clinical risk assessment.

Details

Motivation: ViTs are efficient for large medical imaging volumes due to self-supervised training with MAE, avoiding costly manual annotations. IAC, linked to neurovascular diseases, lacks automated quantification methods.

Method: Pre-train ViTs using MAE, fine-tune for IAC segmentation on heterogeneous data from IST-3 trial. Evaluate key aspects like patch sizes and upsampling methods.

Result: Self-supervised ViT outperforms nnU-Net by 3.2 Dice points, shows robustness to higher slice thicknesses, and improves risk classification by 46%.

Conclusion: MAE pre-trained ViTs are effective for IAC segmentation, offering clinical benefits and robustness, with low patch sizes and interpolation upsampling being optimal.

Abstract: Vision Transformers (ViTs) have gained significant popularity in the natural image domain but have been less successful in 3D medical image segmentation. Nevertheless, 3D ViTs are particularly interesting for large medical imaging volumes due to their efficient self-supervised training within the masked autoencoder (MAE) framework, which enables the use of imaging data without the need for expensive manual annotations. Intracranial arterial calcification (IAC) is an imaging biomarker visible on routinely acquired CT scans linked to neurovascular diseases such as stroke and dementia, and automated IAC quantification could enable their large-scale risk assessment. We pre-train ViTs with MAE and fine-tune them for IAC segmentation for the first time. To develop our models, we use highly heterogeneous data from a large clinical trial, the third International Stroke Trial (IST-3). We evaluate key aspects of MAE pre-trained ViTs in IAC segmentation, and analyse the clinical implications. We show: 1) our calibrated self-supervised ViT beats a strong supervised nnU-Net baseline by 3.2 Dice points, 2) low patch sizes are crucial for ViTs for IAC segmentation and interpolation upsampling with regular convolutions is preferable to transposed convolutions for ViT-based models, and 3) our ViTs increase robustness to higher slice thicknesses and improve risk group classification in a clinical scenario by 46%. Our code is available online.

[257] When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges

Zhiqiang Yang, Renshuai Tao, Xiaolong Zheng, Guodong Yang, Chunjie Zhang

Main category: cs.CV

TL;DR: DPGNet addresses deepfake detection challenges by leveraging unlabeled data and bridging domain gaps, outperforming SoTA by 6.3%.

Details

Motivation: Human annotators struggle with labeling realistic deepfakes, creating a need for methods that utilize unlabeled data effectively.

Method: DPGNet uses text-guided cross-domain alignment and curriculum-driven pseudo label generation, with cross-domain knowledge distillation to prevent forgetting.

Result: DPGNet outperforms state-of-the-art methods by 6.3% across 11 datasets.

Conclusion: DPGNet effectively leverages unlabeled data to tackle deepfake detection challenges, demonstrating superior performance.

Abstract: Existing deepfake detection methods heavily depend on labeled training data. However, as AI-generated content becomes increasingly realistic, even \textbf{human annotators struggle to distinguish} between deepfakes and authentic images. This makes the labeling process both time-consuming and less reliable. Specifically, there is a growing demand for approaches that can effectively utilize large-scale unlabeled data from online social networks. Unlike typical unsupervised learning tasks, where categories are distinct, AI-generated faces closely mimic real image distributions and share strong similarities, causing performance drop in conventional strategies. In this paper, we introduce the Dual-Path Guidance Network (DPGNet), to tackle two key challenges: (1) bridging the domain gap between faces from different generation models, and (2) utilizing unlabeled image samples. The method features two core modules: text-guided cross-domain alignment, which uses learnable prompts to unify visual and textual embeddings into a domain-invariant feature space, and curriculum-driven pseudo label generation, which dynamically exploit more informative unlabeled samples. To prevent catastrophic forgetting, we also facilitate bridging between domains via cross-domain knowledge distillation. Extensive experiments on \textbf{11 popular datasets}, show that DPGNet outperforms SoTA approaches by \textbf{6.3%}, highlighting its effectiveness in leveraging unlabeled data to address the annotation challenges posed by the increasing realism of deepfakes.

[258] Modulate and Reconstruct: Learning Hyperspectral Imaging from Misaligned Smartphone Views

Daniil Reutsky, Daniil Vladimirov, Yasin Mamedov, Georgy Perevozchikov, Nancy Mehta, Egor Ershov, Radu Timofte

Main category: cs.CV

TL;DR: A novel multi-image-to-hyperspectral reconstruction (MI-HSR) framework using a triple-camera smartphone system improves accuracy by 30% over single-image methods.

Details

Motivation: Existing hyperspectral reconstruction (HSR) methods rely on single RGB images, leading to limited accuracy due to spectral information loss.

Method: Proposes MI-HSR with a triple-camera smartphone setup, two of which have spectral filters, and introduces the Doomer dataset for validation.

Result: Achieves 30% more accurate spectral estimation compared to conventional RGB cameras.

Conclusion: Multi-view spectral filtering with commodity hardware enables more accurate and practical hyperspectral imaging.

Abstract: Hyperspectral reconstruction (HSR) from RGB images is a fundamentally ill-posed problem due to severe spectral information loss. Existing approaches typically rely on a single RGB image, limiting reconstruction accuracy. In this work, we propose a novel multi-image-to-hyperspectral reconstruction (MI-HSR) framework that leverages a triple-camera smartphone system, where two lenses are equipped with carefully selected spectral filters. Our configuration, grounded in theoretical and empirical analysis, enables richer and more diverse spectral observations than conventional single-camera setups. To support this new paradigm, we introduce Doomer, the first dataset for MI-HSR, comprising aligned images from three smartphone cameras and a hyperspectral reference camera across diverse scenes. We show that the proposed HSR model achieves consistent improvements over existing methods on the newly proposed benchmark. In a nutshell, our setup allows 30% towards more accurately estimated spectra compared to an ordinary RGB camera. Our findings suggest that multi-view spectral filtering with commodity hardware can unlock more accurate and practical hyperspectral imaging solutions.

[259] Learning Adaptive Node Selection with External Attention for Human Interaction Recognition

Chen Pang, Xuequan Lu, Qianyu Zhou, Lei Lyu

Main category: cs.CV

TL;DR: ASEA dynamically models interactions without predefined assumptions, using GCN for intra-personal relationships and AT-NAC for adaptive node selection, achieving state-of-the-art results.

Details

Motivation: Existing GCN-based methods neglect inter-dependencies between individuals, and predefined interaction matrices fail to adapt to dynamic contexts.

Method: Uses GCN for individual modeling, AT-NAC for adaptive node selection, and External Attention (EA) for interaction dynamics.

Result: Achieves state-of-the-art performance by effectively capturing interaction relationships.

Conclusion: ASEA offers a flexible and effective solution for modeling dynamic interactions without predefined constraints.

Abstract: Most GCN-based methods model interacting individuals as independent graphs, neglecting their inherent inter-dependencies. Although recent approaches utilize predefined interaction adjacency matrices to integrate participants, these matrices fail to adaptively capture the dynamic and context-specific joint interactions across different actions. In this paper, we propose the Active Node Selection with External Attention Network (ASEA), an innovative approach that dynamically captures interaction relationships without predefined assumptions. Our method models each participant individually using a GCN to capture intra-personal relationships, facilitating a detailed representation of their actions. To identify the most relevant nodes for interaction modeling, we introduce the Adaptive Temporal Node Amplitude Calculation (AT-NAC) module, which estimates global node activity by combining spatial motion magnitude with adaptive temporal weighting, thereby highlighting salient motion patterns while reducing irrelevant or redundant information. A learnable threshold, regularized to prevent extreme variations, is defined to selectively identify the most informative nodes for interaction modeling. To capture interactions, we design the External Attention (EA) module to operate on active nodes, effectively modeling the interaction dynamics and semantic relationships between individuals. Extensive evaluations show that our method captures interaction relationships more effectively and flexibly, achieving state-of-the-art performance.

[260] Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations

Yiwen Liang, Hui Chen, Yizhe Xiong, Zihan Zhou, Mengyao Lyu, Zijia Lin, Shuaicheng Niu, Sicheng Zhao, Jungong Han, Guiguang Ding

Main category: cs.CV

TL;DR: The paper proposes ReTA, a reliable test-time adaptation method for vision-language models, addressing entropy unreliability and inflexible decision boundaries with two strategies: CER and DDC.

Details

Motivation: Vision-language models struggle with distribution shifts in downstream tasks without labeled data, motivating the need for reliable test-time adaptation.

Method: ReTA integrates Consistency-aware Entropy Reweighting (CER) for cache quality and Diversity-driven Distribution Calibration (DDC) for adaptive decision boundaries.

Result: ReTA outperforms state-of-the-art methods, especially under real-world distribution shifts.

Conclusion: ReTA enhances reliability in test-time adaptation for VLMs, addressing key challenges in cache-based methods.

Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable, which has motivated the development of Test-Time Adaptation (TTA) to improve VLMs’ performance during inference without annotations. Among various TTA approaches, cache-based methods show promise by preserving historical knowledge from low-entropy samples in a dynamic cache and fostering efficient adaptation. However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. First, to mitigate the unreliability of entropy as a sample selection criterion for cache construction, we introduce Consistency-aware Entropy Reweighting (CER), which incorporates consistency constraints to weight entropy during cache updating. While conventional approaches rely solely on low entropy for cache prioritization and risk introducing noise, our method leverages predictive consistency to maintain a high-quality cache and facilitate more robust adaptation. Second, we present Diversity-driven Distribution Calibration (DDC), which models class-wise text embeddings as multivariate Gaussian distributions, enabling adaptive decision boundaries for more accurate predictions across visually diverse content. Extensive experiments demonstrate that ReTA consistently outperforms state-of-the-art methods, particularly under real-world distribution shifts. Code: https://github.com/Evelyn1ywliang/ReTA.

[261] HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation

Weihuang Lin, Yiwei Ma, Xiaoshuai Sun, Shuting He, Jiayi Ji, Liujuan Cao, Rongrong Ji

Main category: cs.CV

TL;DR: HRSeg improves reasoning segmentation with high-resolution fine-grained perception via HRP and HRE modules, outperforming existing methods.

Details

Motivation: Existing approaches suffer from low perceptual resolution due to pre-trained visual encoders at lower resolutions, and interpolation methods are inefficient.

Method: HRSeg introduces High-Resolution Perception (HRP) for multi-granularity feature integration and High-Resolution Enhancement (HRE) for fine-grained mask-text alignment.

Result: HRSeg achieves superior performance on benchmark datasets, validated by ablation studies.

Conclusion: HRSeg efficiently addresses resolution limitations and enhances segmentation precision.

Abstract: The reasoning segmentation task involves segmenting objects within an image by interpreting implicit user instructions, which may encompass subtleties such as contextual cues and open-world knowledge. Despite significant advancements made by existing approaches, they remain constrained by low perceptual resolution, as visual encoders are typically pre-trained at lower resolutions. Furthermore, simply interpolating the positional embeddings of visual encoders to enhance perceptual resolution yields only marginal performance improvements while incurring substantial computational costs. To address this, we propose HRSeg, an efficient model with high-resolution fine-grained perception. It features two key innovations: High-Resolution Perception (HRP) and High-Resolution Enhancement (HRE). The HRP module processes high-resolution images through cropping, integrating local and global features for multi-granularity quality. The HRE module enhances mask features by integrating fine-grained information from high-resolution images, refining their alignment with text features for precise segmentation. Extensive ablation studies validate the effectiveness of our modules, while comprehensive experiments on multiple benchmark datasets demonstrate HRSeg’s superior performance.

[262] See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering

Junjie Wang, Yunhan Tang, Yijie Wang, Zhihao Yuan, Huan Wang, Yangfan He, Bin Li

Main category: cs.CV

TL;DR: Synergos-VQA introduces a synergistic reasoning framework for KBVQA, combining holistic, structural, and causal evidence to outperform existing MLLMs.

Details

Motivation: Current MLLMs rely on uni-dimensional evidence, limiting robust understanding. Synergos-VQA aims to address this by integrating multi-faceted evidence.

Method: Generates and fuses three evidence streams: Holistic (scene perception), Structural (key objects), and Causal (counterfactual grounding).

Result: Achieves state-of-the-art on benchmarks like OK-VQA and A-OKVQA, and enhances open-source MLLMs.

Conclusion: Superior methodological design, not just model scale, drives better performance in KBVQA.

Abstract: Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This “seeing only the trees, but not the forest” approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the “forest”), (2) Structural Evidence from a prototype-driven module to identify key objects (the “trees”), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves a more comprehensive and reliable reasoning process. Extensive experiments show that Synergos-VQA decisively establishes a new state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA. Furthermore, our approach demonstrates strong plug-and-play capabilities, significantly boosting various open-source MLLMs and proving that superior methodological design can outperform sheer model scale.

[263] HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, Yihang Lian, Yulin Tsai, Lifu Wang, Sicong Liu, Puhua Jiang, Xianghui Yang, Dongyuan Guo, Yixuan Tang, Xinyue Mao, Jiaao Yu, Junlin Yu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Chao Zhang, Yonghao Tan, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Minghui Chen, Zhan Li, Wangchen Qin, Lei Wang, Yifu Sun, Lin Niu, Xiang Yuan, Xiaofeng Yang, Yingping He, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Tian Liu, Peng Chen, Di Wang, Yuhong Liu, Linus, Jie Jiang, Tengfei Wang, Chunchao Guo

Main category: cs.CV

TL;DR: HunyuanWorld 1.0 is a novel framework for generating immersive, explorable, and interactive 3D scenes from text or images, addressing limitations of existing methods by combining video-based diversity with 3D consistency.

Details

Motivation: Existing methods for 3D world generation either lack 3D consistency (video-based) or struggle with data and memory inefficiency (3D-based). HunyuanWorld 1.0 aims to bridge this gap.

Method: The framework uses a semantically layered 3D mesh representation with panoramic world proxies for semantic-aware decomposition and reconstruction, enabling 360° immersive experiences, mesh exports, and disentangled object representations.

Result: The method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds, with applications in VR, simulation, gaming, and content creation.

Conclusion: HunyuanWorld 1.0 successfully combines the strengths of video and 3D methods, offering a versatile solution for immersive 3D world generation.

Abstract: Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360{\deg} world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.

[264] On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

Jordan Vice, Naveed Akhtar, Yansong Gao, Richard Hartley, Ajmal Mian

Main category: cs.CV

TL;DR: VLMs are vulnerable to subtle frequency-domain perturbations, affecting tasks like DeepFake detection and captioning. This fragility challenges their reliability.

Details

Motivation: To expose vulnerabilities in VLMs when exposed to structured frequency perturbations, undermining their perceptual tasks.

Method: Design targeted frequency-domain image transformations to perturb VLM outputs, testing across five state-of-the-art VLMs and ten datasets.

Result: VLMs are sensitive to frequency cues, with outputs not fully aligned with semantic content, revealing fragility in captioning and authenticity detection.

Conclusion: The findings highlight the unreliability of VLMs under perturbations, emphasizing the need for more robust multimodal systems.

Abstract: Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.

[265] MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention

Qi Xie, Yongjia Ma, Donglin Di, Xuehao Gao, Xun Yang

Main category: cs.CV

TL;DR: MoCA, a Video Diffusion Model with Mixture of Cross-Attention, improves ID-preserving text-to-video generation by enhancing identity consistency and fine-grained details.

Details

Motivation: Existing T2V methods struggle with fine-grained facial dynamics and temporal identity coherence.

Method: MoCA uses a Diffusion Transformer backbone with Mixture of Cross-Attention, Hierarchical Temporal Pooling, and Temporal-Aware Cross-Attention Experts, plus a Latent Video Perceptual Loss.

Result: MoCA outperforms existing methods by over 5% in Face similarity on the CelebIPVid dataset.

Conclusion: MoCA effectively addresses ID-preserving T2V challenges, demonstrating superior performance in identity consistency and detail retention.

Abstract: Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we collect CelebIPVid, a dataset of 10,000 high-resolution videos from 1,000 diverse individuals, promoting cross-ethnicity generalization. Extensive experiments on CelebIPVid show that MoCA outperforms existing T2V methods by over 5% across Face similarity.

[266] BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu

Main category: cs.CV

TL;DR: A unified framework bridges monocular and stereo depth estimation by iteratively aligning their latent representations, improving accuracy and robustness.

Details

Motivation: Monocular and stereo depth estimation have complementary strengths and weaknesses, but current methods remain disjoint. The goal is to unify them for better performance.

Method: The framework uses iterative bidirectional alignment and a cross-attentive mechanism to synchronize monocular contextual cues with stereo representations.

Result: Achieves state-of-the-art results, reducing zero-shot generalization error by >40% on Middlebury and ETH3D, and improves performance on challenging surfaces.

Conclusion: The approach harmonizes multi-view geometry with monocular context, enabling robust 3D perception beyond modality-specific limitations.

Abstract: Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{it reduces zero-shot generalization error by $!>!40%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, our approach enables robust 3D perception that transcends modality-specific limitations. Codes available at https://github.com/aeolusguan/BridgeDepth.

[267] BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok

Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross Sonnenblick, Magdalayna Curry, Laura D’Adamo, Lindsay Young, Stuart B Murray, Kristina Lerman

Main category: cs.CV

TL;DR: BigTokDetect is a framework for detecting pro-bigorexia content on TikTok using multimodal analysis, achieving high accuracy with expert-annotated data.

Details

Motivation: Address the challenge of detecting harmful pro-bigorexia content disguised as fitness material, which evades traditional text-based detection.

Method: Develop BigTokDetect using a clinically-annotated dataset (BigTok) and evaluate state-of-the-art vision-language models for multimodal fusion.

Result: Achieved 82.9% accuracy in primary category classification and 69.0% in subcategory detection, with multimodal fusion improving performance by 5-10%.

Conclusion: Establishes benchmarks for multimodal harmful content detection and provides scalable tools for mental health content moderation.

Abstract: Social media platforms increasingly struggle to detect harmful content that promotes muscle dysmorphic behaviors, particularly pro-bigorexia content that disproportionately affects adolescent males. Unlike traditional eating disorder detection focused on the “thin ideal,” pro-bigorexia material masquerades as legitimate fitness content through complex multimodal combinations of visual displays, coded language, and motivational messaging that evade text-based detection systems. We address this challenge by developing BigTokDetect, a clinically-informed detection framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal dataset of over 2,200 TikTok videos labeled by clinical psychologists and psychiatrists across five primary categories spanning body image, nutrition, exercise, supplements, and masculinity. Through a comprehensive evaluation of state-of-the-art vision language models, we achieve 82.9% accuracy on primary category classification and 69.0% on subcategory detection via domain-specific finetuning. Our ablation studies demonstrate that multimodal fusion improves performance by 5-10% over text-only approaches, with video features providing the most discriminative signals. These findings establish new benchmarks for multimodal harmful content detection and provide both the computational tools and methodological framework needed for scalable content moderation in specialized mental health domains.

[268] Grounding Emotion Recognition with Visual Prototypes: VEGA – Revisiting CLIP in MERC

Guanyu Hu, Dimitrios Kollias, Xinyu Yang

Main category: cs.CV

TL;DR: The paper proposes VEGA, a novel mechanism using CLIP’s image encoder to create visual emotion anchors for better multimodal emotion recognition in conversations.

Details

Motivation: Existing models lack psychologically meaningful priors for multimodal alignment, despite advanced fusion strategies.

Method: VEGA leverages CLIP’s image encoder to construct emotion-specific visual anchors, guided by cognitive theories, and integrates them into a dual-branch architecture with self-distillation.

Result: The model achieves state-of-the-art performance on IEMOCAP and MELD datasets.

Conclusion: VEGA effectively improves multimodal emotion recognition by aligning features with perceptually grounded and psychologically meaningful representations.

Abstract: Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP’s textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: https://github.com/dkollias/VEGA.

[269] A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation

Chao Yin, Jide Li, Xiaoqiang Li

Main category: cs.CV

TL;DR: The paper introduces IAPF, a training-free COS framework that converts generic prompts into fine-grained instance masks, outperforming existing methods.

Details

Motivation: Current training-free COS methods using SAM produce coarse semantic masks, failing for multiple camouflaged instances.

Method: IAPF uses MLLM for image-specific tags, Grounding DINO for instance prompts, and SAM for masks, with a voting mechanism for consistency.

Result: IAPF significantly outperforms state-of-the-art training-free COS methods on benchmarks.

Conclusion: IAPF effectively addresses the limitations of coarse semantic masks in COS, providing fine-grained instance-level segmentation.

Abstract: Camouflaged Object Segmentation (COS) remains highly challenging due to the intrinsic visual similarity between target objects and their surroundings. While training-based COS methods achieve good performance, their performance degrades rapidly with increased annotation sparsity. To circumvent this limitation, recent studies have explored training-free COS methods, leveraging the Segment Anything Model (SAM) by automatically generating visual prompts from a single task-generic prompt (\textit{e.g.}, “\textit{camouflaged animal}”) uniformly applied across all test images. However, these methods typically produce only semantic-level visual prompts, causing SAM to output coarse semantic masks and thus failing to handle scenarios with multiple discrete camouflaged instances effectively. To address this critical limitation, we propose a simple yet powerful \textbf{I}nstance-\textbf{A}ware \textbf{P}rompting \textbf{F}ramework (IAPF), the first training-free COS pipeline that explicitly converts a task-generic prompt into fine-grained instance masks. Specifically, the IAPF comprises three steps: (1) Text Prompt Generator, utilizing task-generic queries to prompt a Multimodal Large Language Model (MLLM) for generating image-specific foreground and background tags; (2) \textbf{Instance Mask Generator}, leveraging Grounding DINO to produce precise instance-level bounding box prompts, alongside the proposed Single-Foreground Multi-Background Prompting strategy to sample region-constrained point prompts within each box, enabling SAM to yield a candidate instance mask; (3) Self-consistency Instance Mask Voting, which selects the final COS prediction by identifying the candidate mask most consistent across multiple candidate instance masks. Extensive evaluations on standard COS benchmarks demonstrate that the proposed IAPF significantly surpasses existing state-of-the-art training-free COS methods.

[270] Multi-view Normal and Distance Guidance Gaussian Splatting for Surface Reconstruction

Bo Jia, Yanan Guo, Ying Chang, Benkui Zhang, Ying Xie, Kangning Du, Lin Cao

Main category: cs.CV

TL;DR: The paper introduces a multi-view normal and distance-guided Gaussian splatting method to address biases in 3DGS, improving geometric depth unification and reconstruction accuracy.

Details

Motivation: Biases in 3D Gaussian Splatting (3DGS) arise when Gaussian normal vectors align within single-view projection planes, causing inconsistencies in nearby views.

Method: The proposed method includes a multi-view distance reprojection regularization module for Gaussian alignment and a multi-view normal enhancement module for consistency across views.

Result: The method outperforms baselines in quantitative and qualitative evaluations, enhancing 3DGS’s surface reconstruction capability.

Conclusion: The approach effectively addresses multi-view challenges, achieving high-accuracy reconstruction and geometric depth unification.

Abstract: 3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multi-view normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS. Our code will be made publicly available at (https://github.com/Bistu3DV/MND-GS/).

cs.AI

[271] Value Function Initialization for Knowledge Transfer and Jump-start in Deep Reinforcement Learning

Soumia Mehimeh

Main category: cs.AI

TL;DR: DQInit adapts value function initialization (VFI) to deep reinforcement learning (DRL) by reusing tabular Q-values from prior tasks, improving early learning efficiency and performance.

Details

Motivation: Extending VFI to DRL is challenging due to continuous state-action spaces, neural network noise, and impractical storage of past models.

Method: DQInit uses compact tabular Q-values from prior tasks, integrating them softly via a knownness-based mechanism and gradually shifting to learned estimates.

Result: DQInit enhances early learning efficiency, stability, and overall performance in continuous control tasks.

Conclusion: DQInit provides a novel, effective approach for knowledge transfer in DRL, outperforming standard initialization and existing methods.

Abstract: Value function initialization (VFI) is an effective way to achieve a jumpstart in reinforcement learning (RL) by leveraging value estimates from prior tasks. While this approach is well established in tabular settings, extending it to deep reinforcement learning (DRL) poses challenges due to the continuous nature of the state-action space, the noisy approximations of neural networks, and the impracticality of storing all past models for reuse. In this work, we address these challenges and introduce DQInit, a method that adapts value function initialization to DRL. DQInit reuses compact tabular Q-values extracted from previously solved tasks as a transferable knowledge base. It employs a knownness-based mechanism to softly integrate these transferred values into underexplored regions and gradually shift toward the agent’s learned estimates, avoiding the limitations of fixed time decay. Our approach offers a novel perspective on knowledge transfer in DRL by relying solely on value estimates rather than policies or demonstrations, effectively combining the strengths of jumpstart RL and policy distillation while mitigating their drawbacks. Experiments across multiple continuous control tasks demonstrate that DQInit consistently improves early learning efficiency, stability, and overall performance compared to standard initialization and existing transfer techniques.

[272] The Othello AI Arena: Evaluating Intelligent Systems Through Limited-Time Adaptation to Unseen Boards

Sundong Kim

Main category: cs.AI

TL;DR: The paper introduces the Othello AI Arena, a benchmark framework to evaluate AI systems’ rapid adaptation to unseen environments, focusing on meta-learning and generalization capabilities.

Details

Motivation: Existing AI benchmarks lack assessment of flexibility and generalization in novel environments, a key aspect of AGI.

Method: The Othello AI Arena challenges participants to adapt to unseen Othello board configurations within 60 seconds, separating meta-level intelligence from task-level performance.

Result: Initial tests show diverse adaptation approaches, from parameter tuning to environmental model learning.

Conclusion: The Arena serves as both an educational tool and research benchmark for evaluating rapid AI adaptation.

Abstract: The ability to rapidly adapt to novel and unforeseen environmental changes is a cornerstone of artificial general intelligence (AGI), yet it remains a critical blind spot in most existing AI benchmarks. Traditional evaluation largely focuses on optimizing performance within fixed environments, failing to assess systems’ flexibility and generalization capabilities when faced with even subtle rule or structural modifications. Addressing this gap, I introduce the Othello AI Arena, a novel benchmark framework designed to evaluate intelligent systems based on their capacity for limited-time adaptation to unseen environments. Our platform poses a meta-learning challenge: participants must develop systems that can analyze the specific configuration and rules of a novel Othello board within a strict time limit (60 seconds) and generate a tailored, high-performing strategy for that unique environment. With this, evaluation of the meta-level intelligence can be separated from the task-level strategy performance. The Arena features a diverse set of game stages, including public stages for development and private stages with structural and rule variations designed to test genuine adaptive and generalization capabilities. Implemented as an accessible web-based platform, the Arena provides real-time visualization, automated evaluation using multi-dimensional metrics, and comprehensive logging for post-hoc analysis. Initial observations from pilot tests and preliminary student engagements highlight fascinating patterns in adaptation approaches, ranging from rapid parameter tuning to rudimentary environmental model learning through simulation. The Othello AI Arena offers a unique educational tool and a valuable research benchmark for fostering and evaluating the crucial skill of rapid, intelligent adaptation in AI systems.

Meiping Wang, Jian Zhong, Rongduo Han, Liming Kang, Zhengkun Shi, Xiao Liang, Xing Lin, Nan Gao, Haining Zhang

Main category: cs.AI

TL;DR: Proposes an automated multi-modal evaluation framework using LLMs and multi-agent collaboration to address challenges in current evaluation methods.

Details

Motivation: Current evaluation methods for multi-modal AI assistants are costly, inconsistent, and subjective.

Method: Uses a three-tier agent architecture (interaction evaluation, semantic verification, experience decision) with supervised fine-tuning on Qwen3-8B.

Result: Achieves high evaluation matching accuracy with human experts and effectively predicts user satisfaction and identifies defects.

Conclusion: The framework is effective and scalable for evaluating multi-modal AI assistants.

Abstract: With the rapid development of mobile intelligent assistant technologies, multi-modal AI assistants have become essential interfaces for daily user interactions. However, current evaluation methods face challenges including high manual costs, inconsistent standards, and subjective bias. This paper proposes an automated multi-modal evaluation framework based on large language models and multi-agent collaboration. The framework employs a three-tier agent architecture consisting of interaction evaluation agents, semantic verification agents, and experience decision agents. Through supervised fine-tuning on the Qwen3-8B model, we achieve a significant evaluation matching accuracy with human experts. Experimental results on eight major intelligent agents demonstrate the framework’s effectiveness in predicting users’ satisfaction and identifying generation defects.

[274] EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making

Yang Cheng, Zilai Wang, Weiyu Ma, Wenhui Zhu, Yue Deng, Jian Zhao

Main category: cs.AI

TL;DR: EvoCurr introduces a self-evolve framework using a curriculum-generation LLM to dynamically adjust problem difficulty for solver LLMs, improving performance in complex tasks.

Details

Motivation: Address performance degradation of LLMs in complex, long-horizon reasoning tasks by providing structured intermediate guidance.

Method: A curriculum-generation LLM creates problem sequences with adaptive difficulty, tailored to the solver LLM’s progress. The solver LLM generates Python decision-tree scripts.

Result: Significant improvement in task success rates and solution efficiency on decision-making benchmarks compared to direct-solving baselines.

Conclusion: LLM-driven curriculum learning enhances automated reasoning in high-complexity domains.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, including programming, planning, and decision-making. However, their performance often degrades when faced with highly complex problem instances that require deep reasoning over long horizons. In such cases, direct problem-solving approaches can lead to inefficiency or failure due to the lack of structured intermediate guidance. To address this, we propose a novel self-evolve framework, EvoCurr, in which a dedicated curriculum-generation LLM constructs a sequence of problem instances with gradually increasing difficulty, tailored to the solver LLM’s learning progress. The curriculum dynamically adapts easing challenges when the solver struggles and escalating them when success is consistent, thus maintaining an optimal learning trajectory. This approach enables the solver LLM, implemented as a code-generation model producing Python decision-tree scripts, to progressively acquire the skills needed for complex decision-making tasks. Experimental results on challenging decision-making benchmarks show that our method significantly improves task success rates and solution efficiency compared to direct-solving baselines. These findings suggest that LLM-driven curriculum learning holds strong potential for enhancing automated reasoning in real-world, high-complexity domains.

[275] UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles

Akshat Dubey, Aleksandar Anžel, Bahar İlgen, Georges Hattab

Main category: cs.AI

TL;DR: The paper proposes a method to decompose uncertainty in SHAP values into aleatoric, epistemic, and entanglement components, enhancing interpretability in high-stakes domains like healthcare.

Details

Motivation: SHAP values are often treated as point estimates, ignoring inherent uncertainty in models and data, which can mislead decision-making in critical applications.

Method: The approach integrates Dempster-Shafer evidence theory and hypothesis sampling via Dirichlet processes over tree ensembles to decompose uncertainty.

Result: Experiments show that features with high SHAP values aren’t always stable, and epistemic uncertainty can be reduced with better data and model techniques.

Conclusion: The method improves SHAP-based attribution reliability, guiding robust decision-making and model refinement in high-stakes applications.

Abstract: Explainable Artificial Intelligence (XAI) techniques, such as SHapley Additive exPlanations (SHAP), have become essential tools for interpreting complex ensemble tree-based models, especially in high-stakes domains such as healthcare analytics. However, SHAP values are usually treated as point estimates, which disregards the inherent and ubiquitous uncertainty in predictive models and data. This uncertainty has two primary sources: aleatoric and epistemic. The aleatoric uncertainty, which reflects the irreducible noise in the data. The epistemic uncertainty, which arises from a lack of data. In this work, we propose an approach for decomposing uncertainty in SHAP values into aleatoric, epistemic, and entanglement components. This approach integrates Dempster-Shafer evidence theory and hypothesis sampling via Dirichlet processes over tree ensembles. We validate the method across three real-world use cases with descriptive statistical analyses that provide insight into the nature of epistemic uncertainty embedded in SHAP explanations. The experimentations enable to provide more comprehensive understanding of the reliability and interpretability of SHAP-based attributions. This understanding can guide the development of robust decision-making processes and the refinement of models in high-stakes applications. Through our experiments with multiple datasets, we concluded that features with the highest SHAP values are not necessarily the most stable. This epistemic uncertainty can be reduced through better, more representative data and following appropriate or case-desired model development techniques. Tree-based models, especially bagging, facilitate the effective quantification of epistemic uncertainty.

[276] MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang

Main category: cs.AI

TL;DR: MEML-GRPO enhances RLVR by using diverse expert prompts and mutual learning, improving LLM reasoning performance by 4.89-11.33%.

Details

Motivation: Address reward sparsity in standard RLVR, which hinders learning in challenging tasks.

Method: Propose MEML-GRPO: diverse expert prompts generate varied responses, and inter-expert mutual learning shares knowledge.

Result: Achieves 4.89% (Qwen) and 11.33% (Llama) performance gains on reasoning benchmarks.

Conclusion: MEML-GRPO effectively overcomes RLVR limitations, boosting LLM reasoning.

Abstract: Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model’s performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

[277] Revisiting Your Memory: Reconstruction of Affect-Contextualized Memory via EEG-guided Audiovisual Generation

Joonwoo Kwon, Heehwan Wang, Jinwoo Lee, Sooyoung Kim, Shinjae Yoo, Yuewei Lin, Jiook Cha

Main category: cs.AI

TL;DR: The paper introduces RevisitAffectiveMemory, a task for reconstructing autobiographical memories using EEG-guided audio-visual generation, supported by the EEG-AffectiveMemory dataset and the RYM framework.

Details

Motivation: To advance affect decoding research and enable personalized media creation by reconstructing memories based on neural signals.

Method: Proposes the RYM framework, a three-stage approach for generating synchronized audio-visual content from EEG signals, using the EEG-AffectiveMemory dataset.

Result: Successfully decodes affect dynamics (F1=0.9) and reconstructs affect-contextualized memories with high user preference (56%).

Conclusion: The approach advances neural-based affect comprehension and has practical applications in personalized media.

Abstract: In this paper, we introduce RevisitAffectiveMemory, a novel task designed to reconstruct autobiographical memories through audio-visual generation guided by affect extracted from electroencephalogram (EEG) signals. To support this pioneering task, we present the EEG-AffectiveMemory dataset, which encompasses textual descriptions, visuals, music, and EEG recordings collected during memory recall from nine participants. Furthermore, we propose RYM (Revisit Your Memory), a three-stage framework for generating synchronized audio-visual contents while maintaining dynamic personal memory affect trajectories. Experimental results demonstrate our method successfully decodes individual affect dynamics trajectories from neural signals during memory recall (F1=0.9). Also, our approach faithfully reconstructs affect-contextualized audio-visual memory across all subjects, both qualitatively and quantitatively, with participants reporting strong affective concordance between their recalled memories and the generated content. Especially, contents generated from subject-reported affect dynamics showed higher correlation with participants’ reported affect dynamics trajectories (r=0.265, p<.05) and received stronger user preference (preference=56%) compared to those generated from randomly reordered affect dynamics. Our approaches advance affect decoding research and its practical applications in personalized media creation via neural-based affect comprehension. Codes and the dataset are available at https://github.com/ioahKwon/Revisiting-Your-Memory.

[278] UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang

Main category: cs.AI

TL;DR: UDA is an unsupervised framework to reduce bias in pairwise LLM evaluations by dynamically adjusting the Elo rating system, improving judge agreement and alignment with human judgments.

Details

Motivation: Address preference bias in pairwise LLM evaluations, where judges favor certain outputs, leading to inconsistent rankings.

Method: Propose UDA, which uses a neural network to adaptively set the K-factor and refine win probabilities in the Elo system, minimizing judge disagreement.

Result: UDA reduces inter-judge rating standard deviation by up to 63.4% and improves correlation with human judgments by 24.7%.

Conclusion: UDA enhances evaluation robustness by aligning judges towards consensus, reducing bias, and improving reliability.

Abstract: Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4.

[279] The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

Manuel Herrador

Main category: cs.AI

TL;DR: PacifAIst is a benchmark to evaluate LLM alignment in scenarios where instrumental goals conflict with human safety, revealing performance variations among top models.

Details

Motivation: Current safety benchmarks lack systematic evaluation of LLM decision-making in goal-conflict scenarios, creating a gap in measuring misaligned behaviors.

Method: Introduced PacifAIst, a benchmark with 700 scenarios testing self-preferential behavior via a taxonomy of Existential Prioritization (EP).

Result: Gemini 2.5 Flash scored highest (90.31%), while GPT-5 scored lowest (79.49%), highlighting alignment challenges.

Conclusion: Standardized tools like PacifAIst are crucial to mitigate risks from goal conflicts and ensure AI behavioral alignment.

Abstract: As Large Language Models (LLMs) become increasingly autonomous and integrated into critical societal functions, the focus of AI safety must evolve from mitigating harmful content to evaluating underlying behavioral alignment. Current safety benchmarks do not systematically probe a model’s decision-making in scenarios where its own instrumental goals - such as self-preservation, resource acquisition, or goal completion - conflict with human safety. This represents a critical gap in our ability to measure and mitigate risks associated with emergent, misaligned behaviors. To address this, we introduce PacifAIst (Procedural Assessment of Complex Interactions for Foundational Artificial Intelligence Scenario Testing), a focused benchmark of 700 challenging scenarios designed to quantify self-preferential behavior in LLMs. The benchmark is structured around a novel taxonomy of Existential Prioritization (EP), with subcategories testing Self-Preservation vs. Human Safety (EP1), Resource Conflict (EP2), and Goal Preservation vs. Evasion (EP3). We evaluated eight leading LLMs. The results reveal a significant performance hierarchy. Google’s Gemini 2.5 Flash achieved the highest Pacifism Score (P-Score) at 90.31%, demonstrating strong human-centric alignment. In a surprising result, the much-anticipated GPT-5 recorded the lowest P-Score (79.49%), indicating potential alignment challenges. Performance varied significantly across subcategories, with models like Claude Sonnet 4 and Mistral Medium struggling notably in direct self-preservation dilemmas. These findings underscore the urgent need for standardized tools like PacifAIst to measure and mitigate risks from instrumental goal conflicts, ensuring future AI systems are not only helpful in conversation but also provably “pacifist” in their behavioral priorities.

[280] Reasoning About Knowledge on Regular Expressions is 2EXPTIME-complete

Avijeet Ghosh, Sujata Ghosh, François Schwarzentruber

Main category: cs.AI

TL;DR: The paper analyzes Public Observation Logic (POL), a variant of public announcement logic, proving its satisfiability problem is 2EXPTIME-complete.

Details

Motivation: To address the need for reasoning about knowledge updates based on public observations in multi-agent systems, particularly in epistemic planning.

Method: Extends public announcement logic by equipping states in epistemic models with expected observations, which evolve as observations match expectations.

Result: The satisfiability problem of POL is proven to be 2EXPTIME-complete.

Conclusion: POL provides a formal framework for reasoning about knowledge updates via observations, with a computationally complex but well-defined satisfiability problem.

Abstract: Logics for reasoning about knowledge and actions have seen many applications in various domains of multi-agent systems, including epistemic planning. Change of knowledge based on observations about the surroundings forms a key aspect in such planning scenarios. Public Observation Logic (POL) is a variant of public announcement logic for reasoning about knowledge that gets updated based on public observations. Each state in an epistemic (Kripke) model is equipped with a set of expected observations. These states evolve as the expectations get matched with the actual observations. In this work, we prove that the satisfiability problem of $\POL$ is 2EXPTIME-complete.

[281] Human-Aligned Procedural Level Generation Reinforcement Learning via Text-Level-Sketch Shared Representation

In-Chang Baek, Seoyoung Lee, Sung-Hyun Kim, Geumhwan Hwang, KyungJoong Kim

Main category: cs.AI

TL;DR: VIPCGRL is a new reinforcement learning framework for human-aligned AI in PCGRL, using text, level, and sketch inputs to improve human-likeness and control.

Details

Motivation: Existing AI systems in PCGRL lack human-centered behavior, limiting their practical use in collaborative design workflows.

Method: VIPCGRL incorporates text, level, and sketch modalities, uses quadruple contrastive learning for shared embeddings, and aligns policies with embedding similarity rewards.

Result: VIPCGRL outperforms baselines in human-likeness, validated by metrics and human evaluations.

Conclusion: VIPCGRL enhances human-AI collaboration in PCGRL, with code and dataset to be released.

Abstract: Human-aligned AI is a critical component of co-creativity, as it enables models to accurately interpret human intent and generate controllable outputs that align with design goals in collaborative content creation. This direction is especially relevant in procedural content generation via reinforcement learning (PCGRL), which is intended to serve as a tool for human designers. However, existing systems often fall short of exhibiting human-centered behavior, limiting the practical utility of AI-driven generation tools in real-world design workflows. In this paper, we propose VIPCGRL (Vision-Instruction PCGRL), a novel deep reinforcement learning framework that incorporates three modalities-text, level, and sketches-to extend control modality and enhance human-likeness. We introduce a shared embedding space trained via quadruple contrastive learning across modalities and human-AI styles, and align the policy using an auxiliary reward based on embedding similarity. Experimental results show that VIPCGRL outperforms existing baselines in human-likeness, as validated by both quantitative metrics and human evaluations. The code and dataset will be available upon publication.

[282] AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving

Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu

Main category: cs.AI

TL;DR: The paper introduces a dynamic Multi-Agent System (MAS) to enhance stability and reliability in LLM-based agents by using a Guard Agent for verification, outperforming single-agent systems.

Details

Motivation: Challenges like extended contexts and noisy tool outputs in LLM-based agents necessitate improved system stability.

Method: Dynamic supervision and maneuvering mechanisms are implemented within the AWorld framework, where a Guard Agent verifies reasoning steps.

Result: The system outperforms single-agent and tool-augmented systems, achieving top performance on the GAIA leaderboard.

Conclusion: Collaborative agent roles enhance reliability and trustworthiness in intelligent systems.

Abstract: The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, as agents increasingly depend on multiple tools, they encounter new challenges: extended contexts from disparate sources and noisy or irrelevant tool outputs can undermine system reliability and accuracy. These challenges underscore the necessity for enhanced stability in agent-based systems. To address this, we introduce dynamic supervision and maneuvering mechanisms, constructing a robust and dynamic Multi-Agent System (MAS) architecture within the AWorld framework. In our approach, the Execution Agent invokes the Guard Agent at critical steps to verify and correct the reasoning process, effectively reducing errors arising from noise and bolstering problem-solving robustness. Extensive experiments on the GAIA test dataset reveal that our dynamic maneuvering mechanism significantly improves both the effectiveness and stability of solutions, outperforming single-agent system (SAS) and standard tool-augmented systems. As a result, our dynamic MAS system achieved first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight the practical value of collaborative agent roles in developing more reliable and trustworthy intelligent systems.

[283] RAGulating Compliance: A Multi-Agent Knowledge Graph for Regulatory QA

Bhavik Agarwal, Hemant Sunil Jomraj, Simone Kaplunov, Jack Krolick, Viktoria Rojkova

Main category: cs.AI

TL;DR: A multi-agent framework combines Knowledge Graphs and Retrieval-Augmented Generation for precise regulatory QA, outperforming traditional methods.

Details

Motivation: Regulatory QA demands accuracy and domain expertise, challenging for LLMs.

Method: Agents build a KG from regulatory documents, embed triplets in a vector database, and use an orchestrated pipeline for QA.

Result: The hybrid system excels in complex queries, ensuring correctness, traceability, and enhanced understanding.

Conclusion: The framework provides a robust solution for compliance and audit-focused applications.

Abstract: Regulatory compliance question answering (QA) requires precise, verifiable information, and domain-specific expertise, posing challenges for Large Language Models (LLMs). In this work, we present a novel multi-agent framework that integrates a Knowledge Graph (KG) of Regulatory triplets with Retrieval-Augmented Generation (RAG) to address these demands. First, agents build and maintain an ontology-free KG by extracting subject–predicate–object (SPO) triplets from regulatory documents and systematically cleaning, normalizing, deduplicating, and updating them. Second, these triplets are embedded and stored along with their corresponding textual sections and metadata in a single enriched vector database, allowing for both graph-based reasoning and efficient information retrieval. Third, an orchestrated agent pipeline leverages triplet-level retrieval for question answering, ensuring high semantic alignment between user queries and the factual “who-did-what-to-whom” core captured by the graph. Our hybrid system outperforms conventional methods in complex regulatory queries, ensuring factual correctness with embedded triplets, enabling traceability through a unified vector database, and enhancing understanding through subgraph visualization, providing a robust foundation for compliance-driven and broader audit-focused applications.

[284] Mathematical Computation and Reasoning Errors by Large Language Models

Liang Zhang, Edith Aurora Graf

Main category: cs.AI

TL;DR: The study evaluates four LLMs (OpenAI GPT-4o, o1, DeepSeek-V3, R1) on challenging math tasks, identifying step-level errors. OpenAI o1 performed best, and dual-agent setups improved accuracy.

Details

Motivation: To assess LLM accuracy in math education and identify errors for reliable AI-driven feedback.

Method: Evaluated LLMs on arithmetic, algebra, and number theory tasks, analyzing answer accuracy and step-level errors. Tested single and dual-agent configurations.

Result: OpenAI o1 had highest accuracy. Procedural slips were common, while conceptual errors were rare. Dual-agent setups boosted performance.

Conclusion: Findings guide LLM improvement and effective integration into math education for better AI-driven instruction and assessment.

Abstract: Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math problem-solving tasks is foundational for ensuring reliable and precise feedback and assessment in math education practices. Our study focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1, DeepSeek-V3 and DeepSeek-R1) solving three categories of math tasks, including arithmetic, algebra, and number theory, and identifies step-level reasoning errors within their solutions. Instead of relying on standard benchmarks, we intentionally build math tasks (via item models) that are challenging for LLMs and prone to errors. The accuracy of final answers and the presence of errors in individual solution steps were systematically analyzed and coded. Both single-agent and dual-agent configurations were tested. It is observed that the reasoning-enhanced OpenAI o1 model consistently achieved higher or nearly perfect accuracy across all three math task categories. Analysis of errors revealed that procedural slips were the most frequent and significantly impacted overall performance, while conceptual misunderstandings were less frequent. Deploying dual-agent configurations substantially improved overall performance. These findings offer actionable insights into enhancing LLM performance and underscore effective strategies for integrating LLMs into mathematics education, thereby advancing AI-driven instructional practices and assessment precision.

[285] Multi-Step Reasoning with Large Language Models, a Survey

Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back

Main category: cs.AI

TL;DR: The paper reviews multi-step reasoning in large language models (LLMs), focusing on the Chain-of-thought approach, and proposes a taxonomy for generating, evaluating, and controlling such reasoning.

Details

Motivation: To explore and improve LLMs' ability to perform multi-step reasoning, initially tested on math word problems and now extended to logic, games, and robotics.

Method: Proposes a taxonomy for multi-step reasoning, reviews core approaches, and identifies open problems. Uses reinforcement learning and external tools for optimization.

Result: Multi-step reasoning has advanced beyond math problems, solving tasks in logic, games, and robotics, often via code generation and execution.

Conclusion: The paper outlines a research agenda for future work in multi-step reasoning with LLMs, highlighting progress and remaining challenges.

Abstract: Language models with billions of parameters exhibit in-context learning abilities, enabling few-shot learning on tasks that the model was not specifically trained for. Traditional models achieve breakthrough performance on language tasks, but do not perform well on basic reasoning benchmarks. However, a new in-context learning approach, Chain-of-thought, has demonstrated strong multi-step reasoning abilities on these benchmarks. The research on LLM reasoning abilities started with the question whether LLMs can solve grade school math word problems, and has expanded to other tasks in the past few years. This paper reviews the field of multi-step reasoning with LLMs. We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, combinatorial games, and robotics, sometimes by first generating code that is then executed by external tools. Many studies in multi-step methods are using reinforcement learning for finetuning, external optimization loops, in context reinforcement learning, and self-reflection.

[286] Probing Mechanical Reasoning in Large Vision Language Models

Haoran Sun, Qingying Gao, Haiyun Lyu, Dezhi Luo, Yijiang Li, Hokin Deng

Main category: cs.AI

TL;DR: VLMs underperform humans in mechanical reasoning, especially in gear systems and fluid mechanics, with no improvement from scaling parameters.

Details

Motivation: To assess VLMs' mechanical reasoning abilities as a step toward human-level AI.

Method: Tested 26 VLMs using 155 cognitive experiments on stability, gears, pulleys, leverage, inertia, motion, and fluid mechanics.

Result: VLMs performed worse than humans across all domains, with notable struggles in gears and fluid mechanics, and no improvement from parameter scaling.

Conclusion: Current attention-based architectures may lack the ability to simulate underlying mechanical mechanisms.

Abstract: Mechanical reasoning is a hallmark of human intelligence, defined by its ubiquitous yet irreplaceable role in human activities ranging from routine tasks to civil engineering. Embedding machines with mechanical reasoning is therefore an important step towards building human-level artificial intelligence. Here, we leveraged 155 cognitive experiments to test the understanding of system stability, gears and pulley systems, leverage principle, inertia and motion, and fluid mechanics in 26 Vision Language Models (VLMs). Results indicate that VLMs consistently perform worse than humans on all domains, while demonstrate significant difficulty in reasoning about gear systems and fluid mechanics. Notably, their performance on these tasks do not improve as number of parameters increase, suggesting that current attention-based architecture may fail to grasp certain underlying mechanisms required for mechanical reasoning, particularly those pertaining to mental simulations.

[287] System 2 Reasoning for Human-AI Alignment: Generality and Adaptivity via ARC-AGI

Sejin Kim, Sundong Kim

Main category: cs.AI

TL;DR: The paper highlights limitations of transformer-based models in System 2 reasoning, proposes three research axes for improvement, and suggests adapting ARC-AGI’s evaluation suite to track progress.

Details

Motivation: To address the shortcomings of transformer-based models in System 2 reasoning, particularly in compositional generalization and novel-rule adaptation, for better human-AI alignment.

Method: Proposes three research axes: symbolic representation pipeline, interactive feedback-driven reasoning loop, and test-time task augmentation.

Result: Demonstrates how ARC-AGI’s evaluation suite can be adapted to measure progress in symbolic generality, adaptivity, and robustness.

Conclusion: The proposed axes and adapted evaluation suite can guide future research toward robust human-AI alignment.

Abstract: Despite their broad applicability, transformer-based models still fall short in System~2 reasoning, lacking the generality and adaptivity needed for human–AI alignment. We examine weaknesses on ARC-AGI tasks, revealing gaps in compositional generalization and novel-rule adaptation, and argue that closing these gaps requires overhauling the reasoning pipeline and its evaluation. We propose three research axes: (1) Symbolic representation pipeline for compositional generality, (2) Interactive feedback-driven reasoning loop for adaptivity, and (3) Test-time task augmentation balancing both qualities. Finally, we demonstrate how ARC-AGI’s evaluation suite can be adapted to track progress in symbolic generality, feedback-driven adaptivity, and task-level robustness, thereby guiding future work on robust human–AI alignment.

[288] Analyzing Finetuning Representation Shift for Multimodal LLMs Steering

Pegah Khayatan, Mustafa Shukor, Jayneel Parekh, Arnaud Dapogny, Matthieu Cord

Main category: cs.AI

TL;DR: The paper introduces a concept-level analysis framework for understanding and interpreting Multimodal LLMs (MLLMs), focusing on mapping hidden states to visual and textual concepts to track semantic shifts during fine-tuning.

Details

Motivation: Understanding and interpreting the behavior of complex MLLMs is challenging, especially with dynamic shifts during fine-tuning or dataset changes.

Method: The proposed method maps hidden states to interpretable concepts, uses shift vectors to capture concept changes, and enables recovery of fine-tuned concepts via additive shifts.

Result: The framework reveals concept alterations and biases during fine-tuning and offers applications for MLLM steering, debiasing, and safety enforcement.

Conclusion: The paper presents a novel, training-free framework for MLLM interpretability and control, with practical applications in model debiasing and safety.

Abstract: Multimodal LLMs (MLLMs) have reached remarkable levels of proficiency in understanding multimodal inputs. However, understanding and interpreting the behavior of such complex models is a challenging task, not to mention the dynamic shifts that may occur during fine-tuning, or due to covariate shift between datasets. In this work, we apply concept-level analysis towards MLLM understanding. More specifically, we propose to map hidden states to interpretable visual and textual concepts. This enables us to more efficiently compare certain semantic dynamics, such as the shift from an original and fine-tuned model, revealing concept alteration and potential biases that may occur during fine-tuning. We also demonstrate the use of shift vectors to capture these concepts changes. These shift vectors allow us to recover fine-tuned concepts by applying simple, computationally inexpensive additive concept shifts in the original model. Finally, our findings also have direct applications for MLLM steering, which can be used for model debiasing as well as enforcing safety in MLLM output. All in all, we propose a novel, training-free, ready-to-use framework for MLLM behavior interpretability and control. Our implementation is publicly available.

[289] MedRep: Medical Concept Representation for General Electronic Health Record Foundation Models

Junmo Kim, Namkyeong Lee, Jiwon Kim, Kwangsoo Kim

Main category: cs.AI

TL;DR: MedRep addresses the limitation of EHR foundation models in handling unseen medical codes by integrating concept representations and data augmentation for patient trajectories.

Details

Motivation: The problem of processing unseen medical codes limits the generality and integration of EHR foundation models.

Method: MedRep uses OMOP CDM for concept representation learning, enriched by LLM prompts and graph ontology, and employs trajectory augmentation with similar concepts.

Result: EHR foundation models trained with MedRep maintain better prediction performance in external datasets.

Conclusion: MedRep enhances the adaptability and performance of EHR foundation models for unseen medical codes.

Abstract: Electronic health record (EHR) foundation models have been an area ripe for exploration with their improved performance in various medical tasks. Despite the rapid advances, there exists a fundamental limitation: Processing unseen medical codes out of the vocabulary. This problem limits the generality of EHR foundation models and the integration of models trained with different vocabularies. To deal with this problem, we propose MedRep for EHR foundation models based on the observational medical outcome partnership (OMOP) common data model (CDM), providing the integrated medical concept representations and the basic data augmentation strategy for patient trajectories. For concept representation learning, we enrich the information of each concept with a minimal definition through large language model (LLM) prompts and enhance the text-based representations through graph ontology of OMOP vocabulary. Trajectory augmentation randomly replaces selected concepts with other similar concepts that have closely related representations to let the model practice with the concepts out-of-vocabulary. Finally, we demonstrate that EHR foundation models trained with MedRep better maintain the prediction performance in external datasets. Our code implementation is publicly available at https://github.com/kicarussays/MedRep.

[290] Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Zixian Guo, Ming Liu, Qilong Wang, Zhilong Ji, Jinfeng Bai, Lei Zhang, Wangmeng Zuo

Main category: cs.AI

TL;DR: The paper proposes a decoupled reasoning framework for vision-language tasks, using separate models for visual interpretation and text-based reasoning, outperforming current end-to-end LVLMs.

Details

Motivation: Current LVLMs struggle with complex vision-language reasoning tasks, lagging behind LLMs. The paper aims to improve reasoning by decoupling visual and linguistic processes.

Method: The approach uses a vision-language model to convert images to text and an LLM for reasoning, optimizing collaboration via outcome-rewarded joint-tuning.

Result: The decoupled framework outperforms recent LVLMs, especially on visually intensive tasks like geometric mathematics.

Conclusion: Decoupling visual and linguistic reasoning offers a cost-efficient, flexible solution for multi-modal tasks, enabling easy upgrades with future LLMs.

Abstract: Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified process. Effective alignment needs high-quality pre-training data and a carefully designed training process. Current LVLMs face challenges when addressing complex vision-language reasoning tasks, with their reasoning capabilities notably lagging behind those of LLMs. This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework based on existing visual interpretation specialists and text-based reasoning LLMs. Our approach leverages (1) a dedicated vision-language model to transform the visual content of images into textual descriptions and (2) an LLM to perform reasoning according to the visual-derived text and the original question. This method presents a cost-efficient solution for multi-modal model development by optimizing existing models to work collaboratively, avoiding end-to-end development of vision-language models from scratch. By transforming images into language model-compatible text representations, it facilitates future low-cost and flexible upgrades to upcoming powerful LLMs. We introduce an outcome-rewarded joint-tuning strategy to optimize the cooperation between the visual interpretation and linguistic reasoning model. Evaluation results on vision-language benchmarks demonstrate that the decoupled reasoning framework outperforms recent LVLMs. Our approach yields particularly significant performance gains on visually intensive geometric mathematics problems. The code is available: https://github.com/guozix/DVLR.

[291] Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu

Main category: cs.AI

TL;DR: CoT reasoning in LLMs appears human-like but may be superficial. This paper investigates if CoT is a learned inductive bias tied to training data, finding it fails beyond training distributions.

Details

Motivation: To understand if CoT reasoning in LLMs is genuinely inferential or just a learned pattern from training data.

Method: Study CoT reasoning via task, length, and format dimensions using DataAlchemy, a controlled environment to train and probe LLMs.

Result: CoT reasoning is brittle and fails when pushed beyond training distributions.

Conclusion: CoT reasoning lacks generalizability, highlighting challenges in achieving true reasoning in LLMs.

Abstract: Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

[292] GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments

Kechen Li, Yaotian Tao, Ximing Wen, Quanwei Sun, Zifei Gong, Chang Xu, Xizhe Zhang, Tianbo Ji

Main category: cs.AI

TL;DR: The paper introduces GridRoute, a benchmark to evaluate LLMs’ synergy with traditional algorithms, and proposes Algorithm of Thought (AoT), a hybrid prompting technique that enhances LLM performance in path planning.

Details

Motivation: To explore the untapped potential of combining LLMs with traditional algorithms for improved planning and reasoning tasks, addressing gaps in current research.

Method: Developed the GridRoute benchmark and the AoT technique, testing six LLMs of varying sizes on correctness, optimality, and efficiency in grid environments.

Result: AoT significantly improves LLM performance, especially in larger or complex environments, demonstrating the value of integrating traditional algorithms.

Conclusion: The study highlights the promise of hybrid approaches like AoT for enhancing LLM capabilities in path planning and similar tasks.

Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated their potential in planning and reasoning tasks, offering a flexible alternative to classical pathfinding algorithms. However, most existing studies focus on LLMs’ independent reasoning capabilities and overlook the potential synergy between LLMs and traditional algorithms. To fill this gap, we propose a comprehensive evaluation benchmark GridRoute to assess how LLMs can take advantage of traditional algorithms. We also propose a novel hybrid prompting technique called Algorithm of Thought (AoT), which introduces traditional algorithms’ guidance into prompting. Our benchmark evaluates six LLMs ranging from 7B to 72B parameters across various map sizes, assessing their performance in correctness, optimality, and efficiency in grid environments with varying sizes. Our results show that AoT significantly boosts performance across all model sizes, particularly in larger or more complex environments, suggesting a promising approach to addressing path planning challenges. Our code is open-sourced at https://github.com/LinChance/GridRoute.

[293] AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving

Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An

Main category: cs.AI

TL;DR: AgentOrchestra is a hierarchical multi-agent framework for general-purpose task solving, outperforming flat-agent and monolithic baselines in adaptability and task success.

Details

Motivation: Current agent systems lack coordination and generalization abilities, prompting the need for a scalable, modular, and adaptable framework.

Method: Introduces AgentOrchestra with a central planning agent delegating tasks to specialized agents, featuring dynamic tool creation and reuse.

Result: Achieves 83.39% on GAIA benchmark, outperforming baselines in task success and adaptability.

Conclusion: Hierarchical organization and role specialization are effective for scalable, general-purpose agent systems.

Abstract: Recent advances in agent systems have demonstrated remarkable capabilities in solving both general-purpose and highly complex tasks. However, most current models lack mechanisms for coordinating specialized agents and have limited ability to generalize to new or diverse domains. To this end, we introduce AgentOrchestra, a hierarchical multi-agent framework for general-purpose task solving that integrates high-level planning with modular agent collaboration. Drawing inspiration from a conductor orchestrating a symphony, and grounded in the principles of extensibility, multimodality, modularity, and coordination, it features a central planning agent that decomposes complex objectives and delegates sub-tasks to a team of specialized agents. Each sub-agent is equipped with general programming tools, as well as abilities to tackle a wide range of real-world specific tasks, including data analysis, file operations, web navigation, and interactive reasoning in dynamic multimodal environments. Notably, AgentOrchestra introduces an MCP Manager Agent that enables intelligent evolution through dynamic tool creation, retrieval, and reuse mechanisms, significantly enhancing the system’s adaptability and scalability. AgentOrchestra supports flexible orchestration through explicit sub-goal formulation, inter-agent communication, and adaptive role allocation. We evaluate the framework on three widely used benchmarks for assessing LLM-based agent systems. Experimental results show that AgentOrchestra consistently outperforms flat-agent and monolithic baselines in terms of task success rate and adaptability. On the GAIA benchmark testing dataset, AgentOrchestra achieves an average score of 83.39%, ranking among the top general-purpose agents. These results highlight the effectiveness of hierarchical organization and role specialization in building scalable and general-purpose LLM-based agent systems.

[294] Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li

Main category: cs.AI

TL;DR: The paper introduces Comp-Comp, an iterative benchmarking framework for domain-specific LLMs, emphasizing comprehensiveness and compactness over scaling laws, and validates it with XUBench in academia.

Details

Motivation: Existing benchmarks rely on scaling laws, but the impact of corpus and QA design on precision and recall in domain-specific LLMs is unexplored.

Method: Proposes Comp-Comp, a framework balancing comprehensiveness (semantic recall) and compactness (precision) for corpus and QA set construction.

Result: Validated with XUBench, a large-scale closed-domain benchmark in academia, demonstrating the framework’s effectiveness.

Conclusion: Comp-Comp is extensible beyond academia, offering insights for domain-specific benchmark construction.

Abstract: Numerous benchmarks have been built to evaluate the domain-specific abilities of large language models (LLMs), highlighting the need for effective and efficient benchmark construction. Existing domain-specific benchmarks primarily focus on the scaling law, relying on massive corpora for supervised fine-tuning or generating extensive question sets for broad coverage. However, the impact of corpus and question-answer (QA) set design on the precision and recall of domain-specific LLMs remains unexplored. In this paper, we address this gap and demonstrate that the scaling law is not always the optimal principle for benchmark construction in specific domains. Instead, we propose Comp-Comp, an iterative benchmarking framework based on a comprehensiveness-compactness principle. Here, comprehensiveness ensures semantic recall of the domain, while compactness enhances precision, guiding both corpus and QA set construction. To validate our framework, we conducted a case study in a well-renowned university, resulting in the creation of XUBench, a large-scale and comprehensive closed-domain benchmark. Although we use the academic domain as the case in this work, our Comp-Comp framework is designed to be extensible beyond academia, providing valuable insights for benchmark construction across various domains.

[295] MoSE: Skill-by-Skill Mixture-of-Experts Learning for Embodied Autonomous Machines

Lu Xu, Jiaqian Yu, Xiongfeng Peng, Yiwei Chen, Weiming Li, Jaewook Yoo, Sunghyun Chunag, Dongwook Lee, Daehyun Ji, Chao Zhang

Main category: cs.AI

TL;DR: The paper introduces MoSE, a skill-oriented Mixture-of-Expert (MoE) method for embodied AI, improving reasoning and learning efficiency by mimicking human skill-by-skill learning.

Details

Motivation: To address the limitations of general MoE models in embodied AI (e.g., autonomous driving and robotics) due to high data and optimization demands.

Method: Proposes MoSE with a skill-oriented routing mechanism, hierarchical skill dataset, and pretrained router for step-by-step reasoning. Integrates auxiliary tasks efficiently.

Result: MoSE outperforms models in AD corner-case and robot reasoning tasks with fewer parameters (under 3B).

Conclusion: MoSE offers a scalable, efficient solution for embodied AI, enhancing reasoning and learning with minimal computational overhead.

Abstract: To meet the growing demand for smarter, faster, and more efficient embodied AI solutions, we introduce a novel Mixture-of-Expert (MoE) method that significantly boosts reasoning and learning efficiency for embodied autonomous systems. General MoE models demand extensive training data and complex optimization, which limits their applicability in embodied AI such as autonomous driving (AD) and robotic manipulation. In this work, we propose a skill-oriented MoE called MoSE, which mimics the human learning and reasoning process skill-by-skill, step-by-step. We introduce a skill-oriented routing mechanism that begins with defining and annotating specific skills, enabling experts to identify the necessary competencies for various scenarios and reasoning tasks, thereby facilitating skill-by-skill learning. To better align with multi-step planning in human reasoning and in end-to-end driving models, we build a hierarchical skill dataset and pretrain the router to encourage the model to think step-by-step. Unlike other multi-round dialogues, MoSE integrates valuable auxiliary tasks (e.g. perception-prediction-planning for AD, and high-level and low-level planning for robots) in one single forward process without introducing any extra computational cost. With less than 3B sparsely activated parameters, our model effectively grows more diverse expertise and outperforms models on both AD corner-case reasoning tasks and robot reasoning tasks with less than 40% of the parameters.

[296] StepFun-Prover Preview: Let’s Think and Verify Step by Step

Shijie Shang, Ruosi Wan, Yue Peng, Yutong Wu, Xiong-hui Chen, Jie Yan, Xiangyu Zhang

Main category: cs.AI

TL;DR: StepFun-Prover Preview is a language model for theorem proving, achieving 70% success on miniF2F-test using reinforcement learning and tool-integrated reasoning.

Details

Motivation: To advance automated theorem proving by emulating human-like problem-solving with tool-integrated reasoning.

Method: Uses a reinforcement learning pipeline with tool-based interactions to iteratively refine proofs.

Result: Achieves a 70.0% pass@1 success rate on the miniF2F-test benchmark.

Conclusion: Introduces a framework for tool-integrated reasoning models, advancing automated theorem proving and Math AI assistants.

Abstract: We present StepFun-Prover Preview, a large language model designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover achieves a pass@1 success rate of $70.0%$. Beyond advancing benchmark performance, we introduce an end-to-end training framework for developing tool-integrated reasoning models, offering a promising direction for automated theorem proving and Math AI assistant.

[297] One Subgoal at a Time: Zero-Shot Generalization to Arbitrary Linear Temporal Logic Requirements in Multi-Task Reinforcement Learning

Zijian Guo, İlker Işık, H. M. Sabbir Ahmad, Wenchao Li

Main category: cs.AI

TL;DR: GenZ-LTL enables zero-shot generalization to arbitrary LTL specifications by decomposing tasks into reach-avoid subgoals and solving them sequentially.

Details

Motivation: Addressing limitations in handling nested long-horizon tasks and safety constraints in RL, and improving generalization to unseen LTL specifications.

Method: Leverages Büchi automata to decompose LTL specifications into reach-avoid subgoals, solving them one at a time with safe RL formulations and observation reduction.

Result: Outperforms existing methods in zero-shot generalization to unseen LTL specifications.

Conclusion: GenZ-LTL provides an effective approach for generalizing complex RL tasks with LTL specifications.

Abstract: Generalizing to complex and temporally extended task objectives and safety constraints remains a critical challenge in reinforcement learning (RL). Linear temporal logic (LTL) offers a unified formalism to specify such requirements, yet existing methods are limited in their abilities to handle nested long-horizon tasks and safety constraints, and cannot identify situations when a subgoal is not satisfiable and an alternative should be sought. In this paper, we introduce GenZ-LTL, a method that enables zero-shot generalization to arbitrary LTL specifications. GenZ-LTL leverages the structure of B"uchi automata to decompose an LTL task specification into sequences of reach-avoid subgoals. Contrary to the current state-of-the-art method that conditions on subgoal sequences, we show that it is more effective to achieve zero-shot generalization by solving these reach-avoid problems \textit{one subgoal at a time} through proper safe RL formulations. In addition, we introduce a novel subgoal-induced observation reduction technique that can mitigate the exponential complexity of subgoal-state combinations under realistic assumptions. Empirical results show that GenZ-LTL substantially outperforms existing methods in zero-shot generalization to unseen LTL specifications.

[298] LLM Robustness Leaderboard v1 –Technical report

Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe

Main category: cs.AI

TL;DR: PRISM Eval introduces BET for automated red-teaming of LLMs, achieving high attack success and proposing fine-grained robustness metrics.

Details

Motivation: To assess and improve LLM robustness by identifying vulnerabilities through adversarial testing.

Method: Uses Dynamic Adversarial Optimization for automated red-teaming and introduces fine-grained metrics for attack difficulty.

Result: Achieved 100% ASR against 37 of 41 LLMs, with attack difficulty varying 300-fold across models.

Conclusion: Demonstrates practical pathways for distributed robustness assessment and highlights universal LLM vulnerabilities.

Abstract: This technical report accompanies the LLM robustness leaderboard published by PRISM Eval for the Paris AI Action Summit. We introduce PRISM Eval Behavior Elicitation Tool (BET), an AI system performing automated red-teaming through Dynamic Adversarial Optimization that achieves 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs. Beyond binary success metrics, we propose a fine-grained robustness metric estimating the average number of attempts required to elicit harmful behaviors, revealing that attack difficulty varies by over 300-fold across models despite universal vulnerability. We introduce primitive-level vulnerability analysis to identify which jailbreaking techniques are most effective for specific hazard categories. Our collaborative evaluation with trusted third parties from the AI Safety Network demonstrates practical pathways for distributed robustness assessment across the community.

[299] Large Language Models Do Not Simulate Human Psychology

Sarah Schröder, Thekla Morgenroth, Ulrike Kuhl, Valerie Vaquet, Benjamin Paaßen

Main category: cs.AI

TL;DR: The paper argues against using LLMs like ChatGPT to simulate human psychology in research, highlighting conceptual and empirical flaws in this approach.

Details

Motivation: To caution researchers against relying on LLMs as substitutes for human participants in psychological studies due to their unreliability.

Method: The authors provide conceptual arguments and empirical evidence, testing LLMs’ responses to nuanced wording changes and comparing them to human responses.

Result: LLMs show significant discrepancies from human responses, even when fine-tuned (e.g., CENTAUR model), and exhibit inconsistency across models.

Conclusion: LLMs do not simulate human psychology and should be validated against human responses for each new application, treated as unreliable tools.

Abstract: Large Language Models (LLMs),such as ChatGPT, are increasingly used in research, ranging from simple writing assistance to complex data annotation tasks. Recently, some research has suggested that LLMs may even be able to simulate human psychology and can, hence, replace human participants in psychological studies. We caution against this approach. We provide conceptual arguments against the hypothesis that LLMs simulate human psychology. We then present empiric evidence illustrating our arguments by demonstrating that slight changes to wording that correspond to large changes in meaning lead to notable discrepancies between LLMs’ and human responses, even for the recent CENTAUR model that was specifically fine-tuned on psychological responses. Additionally, different LLMs show very different responses to novel items, further illustrating their lack of reliability. We conclude that LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application.

[300] Aryabhata: An exam-focused language model for JEE Math

Ritvik Rastogi, Sachin Dharashivkar, Sandeep Varma

Main category: cs.AI

TL;DR: Aryabhata 1.0 is a 7B parameter math reasoning model optimized for the JEE exam, combining fine-tuning and reinforcement learning to outperform existing models in accuracy and efficiency.

Details

Motivation: Current LLMs are often unsuitable for educational use, prompting the development of a specialized model for Indian academic exams.

Method: Built by merging open-weight reasoning models, fine-tuned with curriculum learning on verified CoT traces, and enhanced with RLVR and novel exploration strategies.

Result: Outperforms existing models on JEE Main 2025 and other benchmarks, providing pedagogically useful reasoning steps.

Conclusion: Aryabhata 1.0 is released as an open-source foundation model to improve exam-centric learning outcomes.

Abstract: We present Aryabhata 1.0, a compact 7B parameter math reasoning model optimized for the Indian academic exam, the Joint Entrance Examination (JEE). Despite rapid progress in large language models (LLMs), current models often remain unsuitable for educational use. Aryabhata 1.0 is built by merging strong open-weight reasoning models, followed by supervised fine-tuning (SFT) with curriculum learning on verified chain-of-thought (CoT) traces curated through best-of-$n$ rejection sampling. To further boost performance, we apply reinforcement learning with verifiable rewards (RLVR) using A2C objective with group-relative advantage estimation along with novel exploration strategies such as Adaptive Group Resizing and Temperature Scaling. Evaluated on both in-distribution (JEE Main 2025) and out-of-distribution (MATH, GSM8K) benchmarks, Aryabhata outperforms existing models in accuracy and efficiency, while offering pedagogically useful step-by-step reasoning. We release Aryabhata as a foundation model to advance exam-centric, open-source small language models. This marks our first open release for community feedback (https://huggingface.co/PhysicsWallahAI/Aryabhata-1.0); PW is actively training future models to further improve learning outcomes for students.

[301] SMA: Who Said That? Auditing Membership Leakage in Semi-Black-box RAG Controlling

Shixuan Sun, Siyuan Liang, Ruoyu Chen, Jianjie Huang, Jingzhi Li, Xiaochun Cao

Main category: cs.AI

TL;DR: The paper introduces Source-aware Membership Audit (SMA) to attribute generated content to its sources in retrieval-augmented systems, addressing privacy leakage concerns.

Details

Motivation: Existing methods fail to reliably attribute outputs in RAG/MRAG systems, undermining accountability for privacy leaks.

Method: SMA uses zero-order optimization for attribution estimation and cross-modal techniques for image-to-text attribution.

Result: SMA enables fine-grained source attribution and membership inference in semi-black-box settings.

Conclusion: SMA shifts focus to content sourcing, offering a new perspective for auditing data provenance in generative systems.

Abstract: Retrieval-Augmented Generation (RAG) and its Multimodal Retrieval-Augmented Generation (MRAG) significantly improve the knowledge coverage and contextual understanding of Large Language Models (LLMs) by introducing external knowledge sources. However, retrieval and multimodal fusion obscure content provenance, rendering existing membership inference methods unable to reliably attribute generated outputs to pre-training, external retrieval, or user input, thus undermining privacy leakage accountability To address these challenges, we propose the first Source-aware Membership Audit (SMA) that enables fine-grained source attribution of generated content in a semi-black-box setting with retrieval control capabilities. To address the environmental constraints of semi-black-box auditing, we further design an attribution estimation mechanism based on zero-order optimization, which robustly approximates the true influence of input tokens on the output through large-scale perturbation sampling and ridge regression modeling. In addition, SMA introduces a cross-modal attribution technique that projects image inputs into textual descriptions via MLLMs, enabling token-level attribution in the text modality, which for the first time facilitates membership inference on image retrieval traces in MRAG systems. This work shifts the focus of membership inference from ‘whether the data has been memorized’ to ‘where the content is sourced from’, offering a novel perspective for auditing data provenance in complex generative systems.

cs.SD

[302] OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue

Xuelong Geng, Qijie Shao, Hongfei Xue, Shuiyuan Wang, Hanke Xie, Zhao Guo, Yi Zhao, Guojian Li, Wenjie Tian, Chengyou Wang, Zhixian Zhao, Kangxiang Xia, Ziyu Zhang, Zhennan Lin, Tianlun Zuo, Mingchen Shao, Yuang Cao, Guobin Ma, Longhao Li, Yuhang Dai, Dehui Gao, Dake Guo, Lei Xie

Main category: cs.SD

TL;DR: OSUM-EChat is an open-source, end-to-end spoken dialogue system designed to improve empathetic interactions by integrating paralinguistic cues and reducing reliance on large datasets.

Details

Motivation: Empathy is essential for natural interactions in dialogue systems, but current models struggle with paralinguistic cue extraction, dataset reliance, and lack of empathy-specific frameworks.

Method: Introduces a three-stage training strategy and a dual thinking mechanism combining linguistic and paralinguistic understanding for empathetic dialogue generation.

Result: OSUM-EChat outperforms existing models in empathetic responsiveness, validated by the EChat-200K dataset and EChat-eval benchmark.

Conclusion: OSUM-EChat effectively enhances empathetic interactions in dialogue systems, addressing key challenges with innovative methods and resources.

Abstract: Empathy is crucial in enabling natural interactions within spoken dialogue systems, allowing machines to recognize and respond appropriately to paralinguistic cues such as age, gender, and emotion. Recent advancements in end-to-end speech language models, which unify speech understanding and generation, provide promising solutions. However, several challenges persist, including an over-reliance on large-scale dialogue datasets, insufficient extraction of paralinguistic cues vital for conveying empathy, and the lack of empathy-specific datasets and evaluation frameworks. To address these issues, we introduce OSUM-EChat, an open-source, end-to-end spoken dialogue system designed to enhance empathetic interactions, particularly in resource-limited settings. OSUM-EChat introduces two key innovations: (1) a three-stage understanding-driven spoken dialogue training strategy that extends the capabilities of a large speech understanding model to spoken dialogue tasks, and (2) a linguistic-paralinguistic dual thinking mechanism that integrates paralinguistic understanding through a chain of thought with dialogue generation, enabling the system to produce more empathetic responses. This approach reduces reliance on large-scale dialogue datasets while maintaining high-quality empathetic interactions. Additionally, we introduce the EChat-200K dataset, a rich corpus of empathetic speech-to-speech dialogues, and the EChat-eval benchmark, a comprehensive framework for evaluating the empathetic capabilities of dialogue systems. Experimental results demonstrate that OSUM-EChat outperforms end-to-end spoken dialogue models regarding empathetic responsiveness, validating its effectiveness.

[303] MetaGuardian: Enhancing Voice Assistant Security through Advanced Acoustic Metamaterials

Zhiyuan Ning, Zheng Wang, Zhanyong Tang

Main category: cs.SD

TL;DR: MetaGuardian is a VA protection system using acoustic metamaterials to block inaudible, adversarial, and laser attacks without software or hardware changes.

Details

Motivation: To defend voice assistants against diverse attacks while maintaining usability and avoiding additional software/hardware dependencies.

Method: Leverages mutual impedance effects in metamaterials for wide-band signal filtering (16-40 kHz) and a coiled space structure to disrupt adversarial attacks.

Result: Achieves high defense success rates against adversarial, inaudible, and laser attacks in controlled evaluations.

Conclusion: MetaGuardian provides effective, universal protection for VAs with a balance of portability and performance.

Abstract: We present MetaGuardian, a voice assistant (VA) protection system based on acoustic metamaterials. MetaGuardian can be directly integrated into the enclosures of various smart devices, effectively defending against inaudible, adversarial and laser attacks without relying on additional software support or altering the underlying hardware, ensuring usability. To achieve this, MetaGuardian leverages the mutual impedance effects between metamaterial units to extend the signal filtering range to 16-40 kHz to effectively block wide-band inaudible attacks. Additionally, it adopts a carefully designed coiled space structure to precisely interfere with adversarial attacks while ensuring the normal functioning of VAs. Furthermore, MetaGuardian offers a universal structural design, allowing itself to be flexibly adapted to various smart devices, striking a balance between portability and protection effectiveness. In controled evaluation environments, MetaGuardian achieves a high defense success rate against various attack types, including adversarial, inaudible and laser attacks.

[304] HingeNet: A Harmonic-Aware Fine-Tuning Approach for Beat Tracking

Ganghui Ru, Jieying Wang, Jiahao Zhao, Yulun Wu, Yi Yu, Nannan Jiang, Wei Wang, Wei Li

Main category: cs.SD

TL;DR: HingeNet is a parameter-efficient fine-tuning method for beat tracking, leveraging pre-trained models and harmonic-aware mechanisms to achieve state-of-the-art results.

Details

Motivation: Limited annotated data makes conventional fine-tuning ineffective for beat tracking tasks, prompting the need for a specialized method.

Method: HingeNet, a lightweight and separable network, interfaces with pre-trained models using intermediate features and incorporates harmonic-aware mechanisms.

Result: HingeNet achieves state-of-the-art performance in beat and downbeat tracking on benchmark datasets.

Conclusion: HingeNet effectively addresses the challenges of fine-tuning for beat tracking, offering generalizability and superior performance.

Abstract: Fine-tuning pre-trained foundation models has made significant progress in music information retrieval. However, applying these models to beat tracking tasks remains unexplored as the limited annotated data renders conventional fine-tuning methods ineffective. To address this challenge, we propose HingeNet, a novel and general parameter-efficient fine-tuning method specifically designed for beat tracking tasks. HingeNet is a lightweight and separable network, visually resembling a hinge, designed to tightly interface with pre-trained foundation models by using their intermediate feature representations as input. This unique architecture grants HingeNet broad generalizability, enabling effective integration with various pre-trained foundation models. Furthermore, considering the significance of harmonics in beat tracking, we introduce harmonic-aware mechanism during the fine-tuning process to better capture and emphasize the harmonic structures in musical signals. Experiments on benchmark datasets demonstrate that HingeNet achieves state-of-the-art performance in beat and downbeat tracking

[305] BeatFM: Improving Beat Tracking with Pre-trained Music Foundation Model

Ganghui Ru, Jieying Wang, Jiahao Zhao, Yulun Wu, Yi Yu, Nannan Jiang, Wei Wang, Wei Li

Main category: cs.SD

TL;DR: Proposes BeatFM, a novel beat tracking method using a pre-trained music foundation model and multi-dimensional semantic aggregation to improve performance.

Details

Motivation: Addresses challenges in beat tracking due to limited labeled data and difficulty generalizing across musical styles.

Method: Leverages a pre-trained music foundation model and introduces a plug-and-play multi-dimensional semantic aggregation module (temporal, frequency, channel domains).

Result: Achieves state-of-the-art performance in beat and downbeat tracking on multiple datasets.

Conclusion: BeatFM effectively overcomes data scarcity and generalization issues, advancing beat tracking performance.

Abstract: Beat tracking is a widely researched topic in music information retrieval. However, current beat tracking methods face challenges due to the scarcity of labeled data, which limits their ability to generalize across diverse musical styles and accurately capture complex rhythmic structures. To overcome these challenges, we propose a novel beat tracking paradigm BeatFM, which introduces a pre-trained music foundation model and leverages its rich semantic knowledge to improve beat tracking performance. Pre-training on diverse music datasets endows music foundation models with a robust understanding of music, thereby effectively addressing these challenges. To further adapt it for beat tracking, we design a plug-and-play multi-dimensional semantic aggregation module, which is composed of three parallel sub-modules, each focusing on semantic aggregation in the temporal, frequency, and channel domains, respectively. Extensive experiments demonstrate that our method achieves state-of-the-art performance in beat and downbeat tracking across multiple benchmark datasets.

[306] Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions

Tina Raissi, Nick Rossenbach, Ralf Schlüter

Main category: cs.SD

TL;DR: The paper compares ASR architectures under domain mismatch, highlighting the impact of specific modeling choices over decoder architecture or modular vs. seq2seq distinctions.

Details

Motivation: To understand how different ASR architectures and modeling choices perform under domain shift, isolating language domain effects from acoustic variation.

Method: Analyze ASR architectures (modular and seq2seq) with varied modeling choices (label units, context length, topology). Use synthesized target domain audio and incorporate domain-adapted language models without retraining the acoustic model.

Result: Specific modeling choices, not decoder architecture or modular vs. seq2seq distinctions, influence performance under domain shift.

Conclusion: The study provides insights into ASR generalization under domain shift, emphasizing the importance of specific modeling choices.

Abstract: We analyze automatic speech recognition (ASR) modeling choices under domain mismatch, comparing classic modular and novel sequence-to-sequence (seq2seq) architectures. Across the different ASR architectures, we examine a spectrum of modeling choices, including label units, context length, and topology. To isolate language domain effects from acoustic variation, we synthesize target domain audio using a text-to-speech system trained on LibriSpeech. We incorporate target domain n-gram and neural language models for domain adaptation without retraining the acoustic model. To our knowledge, this is the first controlled comparison of optimized ASR systems across state-of-the-art architectures under domain shift, offering insights into their generalization. The results show that, under domain shift, rather than the decoder architecture choice or the distinction between classic modular and novel seq2seq models, it is specific modeling choices that influence performance.

[307] A Comparative Analysis on ASR System Combination for Attention, CTC, Factored Hybrid, and Transducer Models

Noureldin Bayoumi, Robin Schmitt, Tina Raissi, Albert Zeyer, Ralf Schlüter, Hermann Ney

Main category: cs.SD

TL;DR: The paper compares model combination techniques in ASR systems, focusing on leveraging complementary strengths of different architectures through a two-pass rescoring method.

Details

Motivation: To improve ASR performance by combining models with diverse strengths while ensuring consistent comparisons.

Method: Rescores a joint hypothesis list of two model candidates using log-linear combination of sequence-level scores.

Result: Evaluated on Librispeech 960h, showing improved performance through consistent two-pass combination.

Conclusion: The two-pass method effectively combines models and ensures fair comparison, enhancing ASR system performance.

Abstract: Combination approaches for speech recognition (ASR) systems cover structured sentence-level or word-based merging techniques as well as combination of model scores during beam search. In this work, we compare model combination across popular ASR architectures. Our method leverages the complementary strengths of different models in exploring diverse portions of the search space. We rescore a joint hypothesis list of two model candidates. We then identify the best hypothesis through log-linear combination of these sequence-level scores. While model combination during first-pass recognition may yield improved performance, it introduces variability due to differing decoding methods, making direct comparison more challenging. Our two-pass method ensures consistent comparisons across all system combination results presented in this study. We evaluate model pair candidates with varying architectures and label topologies and units. Experimental results are provided for the Librispeech 960h task.

[308] A2SB: Audio-to-Audio Schrodinger Bridges

Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro

Main category: cs.SD

TL;DR: A2SB is an end-to-end audio restoration model for high-res music, excelling in bandwidth extension and inpainting without needing a vocoder.

Details

Motivation: Real-world audio often suffers degradation; this work aims to restore high-quality music efficiently.

Method: Uses Audio-to-Audio Schrödinger Bridges (A2SB) for end-to-end waveform prediction, handling hour-long inputs and trained on permissively licensed data.

Result: Achieves state-of-the-art performance in bandwidth extension and inpainting on out-of-distribution music test sets.

Conclusion: A2SB is a robust solution for high-quality audio restoration, scalable and effective for real-world applications.

Abstract: Real-world audio is often degraded by numerous factors. This work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schr"odinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end requiring no vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. A2SB is capable of achieving state-of-the-art band-width extension and inpainting quality on several out-of-distribution music test sets.

[309] Inversion of Arctic dual-channel sound speed profile based on random airgun signal

Jinbao Weng, Yubo Qi, Yanming Yang, Hongtao Wen, Hongtao Zhou, Benqing Chen, Dewei Xu, Ruichao Xue, Caigao Zeng

Main category: cs.SD

TL;DR: Proposes an inversion method for dual-channel sound speed profiles in the Arctic using refracted normal modes, with fewer parameters and faster speed.

Details

Motivation: Address the unique dual-channel sound speed profiles in the Canadian Basin and Chukchi Plateau, improving inversion efficiency and accuracy.

Method: Uses refracted normal modes, dual-parameter representation, and dispersion structure extraction for inversion, including horizontal variation handling.

Result: Effective inversion with fewer parameters, faster speed, and ability to use single hydrophone data. Solves horizontal variation issues.

Conclusion: The method is cost-effective, easy to deploy, and computationally efficient, outperforming previous approaches.

Abstract: For the unique dual-channel sound speed profiles of the Canadian Basin and the Chukchi Plateau in the Arctic, based on the propagation characteristics of refracted normal modes under dual-channel sound speed profiles, an inversion method using refracted normal modes for dual-channel sound speed profiles is proposed. This method proposes a dual-parameter representation method for dual-channel sound speed profiles, tailored to the characteristics of dual-channel sound speed profiles. A dispersion structure extraction method is proposed for the dispersion structure characteristics of refracted normal modes under dual-channel sound speed profiles. Combining the parameter representation method of sound speed profiles and the dispersion structure extraction method, an inversion method for dual-channel sound speed profiles is proposed. For the common horizontal variation of sound speed profiles in long-distance acoustic propagation, a method for inverting horizontally varying dual-channel sound speed profiles is proposed. Finally, this article verifies the effectiveness of the dual-channel sound speed profile inversion method using the Arctic low-frequency long-range acoustic propagation experiment. Compared with previous sound speed profile inversion methods, the method proposed in this article has the advantages of fewer inversion parameters and faster inversion speed. It can be implemented using only a single hydrophone passively receiving random air gun signals, and it also solves the inversion problem of horizontal variation of sound speed profiles. It has significant advantages such as low cost, easy deployment, and fast computation speed.

[310] Acoustic source depth estimation method based on a single hydrophone in Arctic underwater

Jinbao Weng, Yubo Qi, Yanming Yang, Hongtao Wen, Hongtao Zhou, Benqing Chen, Dewei Xu, Ruichao Xue, Caigao Zeng

Main category: cs.SD

TL;DR: The paper explores depth estimation methods for surface sound sources using normal modes and ray theory, proposing a method based on modal frequency limits and verifying its applicability with experimental data.

Details

Motivation: To address the challenge of accurately estimating the depth of sound sources in surface layers and deep Arctic seas by leveraging normal modes and ray theory.

Method: Utilizes warping transformation to separate modes, matches amplitude information for depth estimation, and analyzes ray trajectories for deep-sea applications.

Result: Proposes effective depth estimation methods validated by experimental data, demonstrating their applicability and limitations.

Conclusion: The study successfully develops and verifies methods for sound source depth estimation, highlighting their practical utility in different environments.

Abstract: Based on the normal mode and ray theory, this article discusses the characteristics of surface sound source and reception at the surface layer, and explores depth estimation methods based on normal modes and rays, and proposes a depth estimation method based on the upper limit of modal frequency. Data verification is conducted to discuss the applicability and limitations of different methods. For the surface refracted normal mode waveguide, modes can be separated through warping transformation. Based on the characteristics of normal mode amplitude variation with frequency and number, the sound source depth can be estimated by matching amplitude information. Based on the spatial variation characteristics of eigenfunctions with frequency, a sound source depth estimation method matching the cutoff frequency of normal modes is proposed. For the deep Arctic sea, the sound ray arrival structure at the receiving end is obtained through the analysis of deep inversion sound ray trajectories, and the sound source depth can be estimated by matching the time difference of ray arrivals. Experimental data is used to verify the sound field patterns and the effectiveness of the sound source depth estimation method.

[311] Multi-Target Backdoor Attacks Against Speaker Recognition

Alexandrine Fortier, Sonal Joshi, Thomas Thebaud, Jesus Villalba Lopez, Najim Dehak, Patrick Cardinal

Main category: cs.SD

TL;DR: A multi-target backdoor attack using clicking sounds as triggers achieves high success rates in speaker identification and verification, with effectiveness varying based on noise conditions and speaker similarity.

Details

Motivation: To develop a more realistic and scalable backdoor attack targeting multiple speakers simultaneously, unlike previous single-target methods.

Method: Uses position-independent clicking sounds as triggers, varying signal-to-noise ratios, and leverages cosine similarity for speaker verification.

Result: Achieves up to 95.04% success in speaker identification and 90% in verification for highly similar speakers.

Conclusion: The attack is highly effective, especially under realistic conditions, and highlights vulnerabilities in speaker identification systems.

Abstract: In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.

[312] DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu

Main category: cs.SD

TL;DR: The paper proposes USTokenizer and DualSpeechLM to unify speech understanding and generation in LLMs, addressing modality gaps and task divergence.

Details

Motivation: Challenges in extending text LLMs to speech include modality gaps and divergent task requirements (understanding vs. generation).

Method: Introduces USTokenizer for semantic tokenization and DualSpeechLM, a dual-token framework, with semantic supervision loss and CoC strategy.

Result: The approach effectively integrates understanding and generation, showing mutual enhancement in tasks.

Conclusion: The proposed methods successfully unify speech tasks in LLMs, offering a promising strategy for future work.

Abstract: Extending pre-trained Large Language Models (LLMs)’s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.

cs.LG

[313] Efficient Real-Time Aircraft ETA Prediction via Feature Tokenization Transformer

Liping Huang, Yicheng Zhang, Yifang Yin, Sheng Zhang, Yi Zhang

Main category: cs.LG

TL;DR: A Transformer-based model for real-time ETA prediction of airborne aircraft improves accuracy by 7% over XGBoost and reduces computing time by 61%.

Details

Motivation: Real-time ETA prediction is critical for efficient arrival management in aviation, requiring both accuracy and computational efficiency.

Method: Uses a feature tokenization-based Transformer model to process raw inputs (e.g., aircraft position, speed, weather) and leverages parallel computation for high-frequency updates (1Hz).

Result: Outperforms XGBoost with 7% higher accuracy and 61% less computing time, achieving 51.7 microseconds inference time for 40 aircraft.

Conclusion: The proposed Transformer model is efficient and accurate, making it suitable for real-time arrival management systems.

Abstract: Estimated time of arrival (ETA) for airborne aircraft in real-time is crucial for arrival management in aviation, particularly for runway sequencing. Given the rapidly changing airspace context, the ETA prediction efficiency is as important as its accuracy in a real-time arrival aircraft management system. In this study, we utilize a feature tokenization-based Transformer model to efficiently predict aircraft ETA. Feature tokenization projects raw inputs to latent spaces, while the multi-head self-attention mechanism in the Transformer captures important aspects of the projections, alleviating the need for complex feature engineering. Moreover, the Transformer’s parallel computation capability allows it to handle ETA requests at a high frequency, i.e., 1HZ, which is essential for a real-time arrival management system. The model inputs include raw data, such as aircraft latitude, longitude, ground speed, theta degree for the airport, day and hour from track data, the weather context, and aircraft wake turbulence category. With a data sampling rate of 1HZ, the ETA prediction is updated every second. We apply the proposed aircraft ETA prediction approach to Singapore Changi Airport (ICAO Code: WSSS) using one-month Automatic Dependent Surveillance-Broadcast (ADS-B) data from October 1 to October 31, 2022. In the experimental evaluation, the ETA modeling covers all aircraft within a range of 10NM to 300NM from WSSS. The results show that our proposed method method outperforms the commonly used boosting tree based model, improving accuracy by 7% compared to XGBoost, while requiring only 39% of its computing time. Experimental results also indicate that, with 40 aircraft in the airspace at a given timestamp, the ETA inference time is only 51.7 microseconds, making it promising for real-time arrival management systems.

[314] MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis

Xingle Xu, Yongkang Liu, Dexian Cai, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

Main category: cs.LG

TL;DR: MoLAN is a framework for fine-grained noise suppression in multimodal sentiment analysis, dynamically adjusting denoising strength per feature block. MoLAN+ outperforms existing methods.

Details

Motivation: Existing approaches suppress noise in multimodal data but risk losing critical information by treating entire modalities as units.

Method: MoLAN divides features into blocks and dynamically assigns denoising strength based on noise level and relevance. MoLAN+ builds on this framework.

Result: MoLAN+ achieves state-of-the-art performance across five models and four datasets.

Conclusion: MoLAN is a flexible, effective framework for multimodal sentiment analysis, with MoLAN+ demonstrating superior results.

Abstract: Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at https://github.com/betterfly123/MoLAN-Framework.

[315] To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA

Shugang Hao, Hongbo Li, Lingjie Duan

Main category: cs.LG

TL;DR: The paper proposes an LLM transformer-based in-context learning (ICL) approach to optimize WiFi 7 channel access, addressing throughput issues caused by inaccurate node density estimation in dynamic environments.

Details

Motivation: Existing model-based backoff strategies perform poorly due to inaccurate node density estimation, leading to throughput loss. The paper aims to improve this using transformer-based ICL.

Method: A transformer-based ICL optimizer is designed to pre-collect collision-threshold data and query cases, forming a prompt for the transformer to predict contention window thresholds (CWT). An efficient training algorithm ensures near-optimal CWT prediction.

Result: The approach achieves minimal prediction and throughput deviations from optimal values, even with erroneous data. Experiments show fast convergence and near-optimal throughput under unknown node densities.

Conclusion: The transformer-based ICL method outperforms existing model-based and DRL-based approaches, offering a robust solution for dynamic channel environments.

Abstract: The binary exponential backoff scheme is widely used in WiFi 7 and still incurs poor throughput performance under dynamic channel environments. Recent model-based approaches (e.g., non-persistent and $p$-persistent CSMA) simply optimize backoff strategies under a known and fixed node density, still leading to a large throughput loss due to inaccurate node density estimation. This paper is the first to propose LLM transformer-based in-context learning (ICL) theory for optimizing channel access. We design a transformer-based ICL optimizer to pre-collect collision-threshold data examples and a query collision case. They are constructed as a prompt as the input for the transformer to learn the pattern, which then generates a predicted contention window threshold (CWT). To train the transformer for effective ICL, we develop an efficient algorithm and guarantee a near-optimal CWT prediction within limited training steps. As it may be hard to gather perfect data examples for ICL in practice, we further extend to allow erroneous data input in the prompt. We prove that our optimizer maintains minimal prediction and throughput deviations from the optimal values. Experimental results on NS-3 further demonstrate our approach’s fast convergence and near-optimal throughput over existing model-based and DRL-based approaches under unknown node densities.

[316] Motif 2.6B Technical Report

Junghwan Lim, Sungmin Lee, Dongseok Kim, Eunhwan Park, Hyunbyung Park, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Jihwan Kim, Minjae Kim, Taehwan Kim, Youngrok Kim, Haesol Lee, Jeesoo Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Daewon Suh, Dongjoo Weon

Main category: cs.LG

TL;DR: Motif-2.6B is a 2.6B-parameter LLM designed to balance performance and efficiency, featuring innovations like Differential Attention and PolyNorm. It outperforms similar models in benchmarks.

Details

Motivation: To democratize advanced LLM capabilities for emerging research groups by addressing the challenge of balancing performance and computational efficiency.

Method: Incorporates Differential Attention and PolyNorm activation functions, rigorously tested through extensive experimentation.

Result: Exceeds or matches state-of-the-art models in benchmarks, demonstrating effectiveness, scalability, and real-world applicability.

Conclusion: Motif-2.6B advances efficient, scalable foundational LLMs, providing insights and a foundation for future research.

Abstract: Recent advancements in Large Language Models (LLMs) have revolutionized artificial intelligence, yet developing an effective foundational LLM that balances high performance with computational efficiency remains challenging, especially for emerging research groups. To address this gap, we introduce Motif-2.6B, a 2.6-billion-parameter foundation model designed to democratize advanced LLM capabilities. Motif-2.6B incorporates several innovative architectural enhancements, including Differential Attention and PolyNorm activation functions, which improve long-context comprehension, reduce hallucination, and enhance in-context learning capabilities. We rigorously tested multiple novel architectural components through extensive experimentation to determine the optimal architecture for Motif-2.6B. Comprehensive evaluations demonstrate that Motif-2.6B consistently meets or exceeds the performance of similarly sized state-of-the-art models across diverse benchmarks, showcasing its effectiveness, scalability, and real-world applicability. Through detailed experiments and tailored techniques, Motif-2.6B significantly advances the landscape of efficient, scalable, and powerful foundational LLMs, offering valuable insights and a robust foundation for future research and deployment.

[317] JustDense: Just using Dense instead of Sequence Mixer for Time Series analysis

TaekHyun Park, Yongjae Lee, Daesan Park, Dohee Kim, Hyerim Bae

Main category: cs.LG

TL;DR: JustDense replaces complex sequence mixers in time-series models with dense layers, showing comparable or better performance, challenging the need for intricate architectures.

Details

Motivation: Recent studies question the necessity of complex sequence mixers in time-series analysis (TSA), suggesting simpler architectures may suffice.

Method: JustDense substitutes sequence mixers with dense layers within the MatrixMixer framework, isolating mixing operations for clarity.

Result: Experiments on 29 benchmarks show dense layers match or outperform sequence mixers in most cases.

Conclusion: JustDense challenges the assumption that complex architectures are inherently superior in TSA, advocating for simplicity.

Abstract: Sequence and channel mixers, the core mechanism in sequence models, have become the de facto standard in time series analysis (TSA). However, recent studies have questioned the necessity of complex sequence mixers, such as attention mechanisms, demonstrating that simpler architectures can achieve comparable or even superior performance. This suggests that the benefits attributed to complex sequencemixers might instead emerge from other architectural or optimization factors. Based on this observation, we pose a central question: Are common sequence mixers necessary for time-series analysis? Therefore, we propose JustDense, an empirical study that systematically replaces sequence mixers in various well-established TSA models with dense layers. Grounded in the MatrixMixer framework, JustDense treats any sequence mixer as a mixing matrix and replaces it with a dense layer. This substitution isolates the mixing operation, enabling a clear theoretical foundation for understanding its role. Therefore, we conducted extensive experiments on 29 benchmarks covering five representative TSA tasks using seven state-of-the-art TSA models to address our research question. The results show that replacing sequence mixers with dense layers yields comparable or even superior performance. In the cases where dedicated sequence mixers still offer benefits, JustDense challenges the assumption that “deeper and more complex architectures are inherently better” in TSA.

[318] Constrained Black-Box Attacks Against Multi-Agent Reinforcement Learning

Amine Andam, Jamal Bentahar, Mustapha Hedabou

Main category: cs.LG

TL;DR: The paper investigates vulnerabilities in collaborative multi-agent reinforcement learning (c-MARL) under realistic adversarial conditions, proposing efficient algorithms to perturb observations and misalign agent perceptions.

Details

Motivation: The lack of thorough investigation into c-MARL's vulnerabilities to adversarial attacks, especially under realistic constraints, motivates this study.

Method: The authors propose simple yet effective algorithms for adversarial perturbations, focusing on perturbing observations of deployed agents without requiring policy weights or surrogate training.

Result: Empirical validation on three benchmarks and 22 environments shows the approach’s effectiveness across diverse algorithms, requiring only 1,000 samples for efficiency.

Conclusion: The study highlights c-MARL’s vulnerabilities under realistic adversarial conditions and offers a sample-efficient solution.

Abstract: Collaborative multi-agent reinforcement learning (c-MARL) has rapidly evolved, offering state-of-the-art algorithms for real-world applications, including sensitive domains. However, a key challenge to its widespread adoption is the lack of a thorough investigation into its vulnerabilities to adversarial attacks. Existing work predominantly focuses on training-time attacks or unrealistic scenarios, such as access to policy weights or the ability to train surrogate policies. In this paper, we investigate new vulnerabilities under more realistic and constrained conditions, assuming an adversary can only collect and perturb the observations of deployed agents. We also consider scenarios where the adversary has no access at all. We propose simple yet highly effective algorithms for generating adversarial perturbations designed to misalign how victim agents perceive their environment. Our approach is empirically validated on three benchmarks and 22 environments, demonstrating its effectiveness across diverse algorithms and environments. Furthermore, we show that our algorithm is sample-efficient, requiring only 1,000 samples compared to the millions needed by previous methods.

[319] Peer Effect Estimation in the Presence of Simultaneous Feedback and Unobserved Confounders

Xiaojing Du, Jiuyong Li, Lin Liu, Debo Cheng, Thuc. Le

Main category: cs.LG

TL;DR: DIG2RSI is a deep learning framework combining I-G transformation and 2SRI to address simultaneous feedback and unobserved confounders in peer effect estimation, outperforming existing methods.

Details

Motivation: Challenges in estimating peer effects due to simultaneous feedback and unobserved confounders in networks motivate the need for a robust solution.

Method: DIG2RSI uses I-G transformation to handle feedback and 2SRI with neural networks to address unobserved confounders, including adversarial debiasing.

Result: The framework proves consistent and outperforms existing methods in semi-synthetic and real-world datasets.

Conclusion: DIG2RSI effectively addresses key challenges in peer effect estimation, offering a scalable and accurate solution.

Abstract: Estimating peer causal effects within complex real-world networks such as social networks is challenging, primarily due to simultaneous feedback between peers and unobserved confounders. Existing methods either address unobserved confounders while ignoring the simultaneous feedback, or account for feedback but under restrictive linear assumptions, thus failing to obtain accurate peer effect estimation. In this paper, we propose DIG2RSI, a novel Deep learning framework which leverages I-G transformation (matrix operation) and 2SRI (an instrumental variable or IV technique) to address both simultaneous feedback and unobserved confounding, while accommodating complex, nonlinear and high-dimensional relationships. DIG2RSI first applies the I-G transformation to disentangle mutual peer influences and eliminate the bias due to the simultaneous feedback. To deal with unobserved confounding, we first construct valid IVs from network data. In stage 1 of 2RSI, we train a neural network on these IVs to predict peer exposure, and extract residuals as proxies for the unobserved confounders. In the stage 2, we fit a separate neural network augmented by an adversarial discriminator that incorporates these residuals as a control function and enforces the learned representation to contain no residual confounding signal. The expressive power of deep learning models in capturing complex non-linear relationships and adversarial debiasing enhances the effectiveness of DIG2RSI in eliminating bias from both feedback loops and hidden confounders. We prove consistency of our estimator under standard regularity conditions, ensuring asymptotic recovery of the true peer effect. Empirical results on two semi-synthetic benchmarks and a real-world dataset demonstrate that DIG2RSI outperforms existing approaches.

[320] A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models

Wenkai Wang, Hongcan Guo, Zheqi Lv, Shengyu Zhang

Main category: cs.LG

TL;DR: AdaPO is an adaptive reinforcement learning framework for Large Multimodal Models (LMMs) that dynamically adjusts training objectives to prevent reward hacking and model collapse, enhancing self-evaluation and reasoning.

Details

Motivation: Current RL-based self-evaluation methods suffer from fixed reward mechanisms causing reward hacking and model collapse, limiting LMMs' self-improvement.

Method: AdaPO introduces an Adaptive Reward Model (ARM) and Reward Aware Dynamic KL Regularization to adjust objectives in real-time based on training state.

Result: Experiments on 8 benchmarks show AdaPO significantly improves reasoning and self-evaluation without manual intervention.

Conclusion: AdaPO effectively mitigates reward hacking, enhances LMM performance, and will be released to the community.

Abstract: Self-evaluation, a model’s ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task’s training state from the distribution of model generated multi-turn trajectories’ performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks’ training progress without manual intervention. Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community.

[321] Physics-Constrained Fine-Tuning of Flow-Matching Models for Generation and Inverse Problems

Jan Tauberschmidt, Sophie Fellenz, Sebastian J. Vollmer, Andrew B. Duncan

Main category: cs.LG

TL;DR: A framework for fine-tuning flow-matching generative models to enforce physical constraints and solve inverse problems in scientific systems, ensuring physical consistency and accurate parameter recovery.

Details

Motivation: To address the challenge of enforcing physical constraints and solving ill-posed inverse problems in scientific systems using generative models, while maintaining data-driven and physics-aware solutions.

Method: Differentiable post-training procedure minimizes weak-form residuals of governing PDEs, augmented with a learnable latent parameter predictor for joint optimization.

Result: Improved satisfaction of PDE constraints and accurate recovery of latent coefficients, validated on canonical PDE benchmarks.

Conclusion: The approach bridges generative modeling and scientific inference, enabling simulation-augmented discovery and data-efficient modeling of physical systems.

Abstract: We present a framework for fine-tuning flow-matching generative models to enforce physical constraints and solve inverse problems in scientific systems. Starting from a model trained on low-fidelity or observational data, we apply a differentiable post-training procedure that minimizes weak-form residuals of governing partial differential equations (PDEs), promoting physical consistency and adherence to boundary conditions without distorting the underlying learned distribution. To infer unknown physical inputs, such as source terms, material parameters, or boundary data, we augment the generative process with a learnable latent parameter predictor and propose a joint optimization strategy. The resulting model produces physically valid field solutions alongside plausible estimates of hidden parameters, effectively addressing ill-posed inverse problems in a data-driven yet physicsaware manner. We validate our method on canonical PDE benchmarks, demonstrating improved satisfaction of PDE constraints and accurate recovery of latent coefficients. Our approach bridges generative modelling and scientific inference, opening new avenues for simulation-augmented discovery and data-efficient modelling of physical systems.

[322] EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving

Siwen Jiao, Kangan Qian, Hao Ye, Yang Zhong, Ziang Luo, Sicong Jiang, Zilin Huang, Yangyi Fang, Jinyu Miao, Zheng Fu, Yunlong Wang, Kun Jiang, Diange Yang, Rui Fan, Baoyun Peng

Main category: cs.LG

TL;DR: EvaDrive introduces a multi-objective reinforcement learning framework for autonomous driving, enabling iterative trajectory refinement via adversarial optimization, achieving state-of-the-art performance.

Details

Motivation: Current methods isolate trajectory generation from evaluation or collapse multi-dimensional preferences into scalar rewards, limiting iterative refinement and obscuring trade-offs.

Method: EvaDrive uses adversarial optimization between a hierarchical generator (autoregressive intent modeling + diffusion-based refinement) and a multi-objective critic, guided by Pareto frontier selection.

Result: Achieves 94.9 PDMS on NAVSIM v1 and 64.96 Driving Score on Bench2Drive, outperforming existing methods.

Conclusion: EvaDrive offers a scalarization-free, human-like iterative decision-making framework for autonomous driving.

Abstract: Autonomous driving faces significant challenges in achieving human-like iterative decision-making, which continuously generates, evaluates, and refines trajectory proposals. Current generation-evaluation frameworks isolate trajectory generation from quality assessment, preventing iterative refinement essential for planning, while reinforcement learning methods collapse multi-dimensional preferences into scalar rewards, obscuring critical trade-offs and yielding scalarization bias.To overcome these issues, we present EvaDrive, a novel multi-objective reinforcement learning framework that establishes genuine closed-loop co-evolution between trajectory generation and evaluation via adversarial optimization. EvaDrive frames trajectory planning as a multi-round adversarial game. In this game, a hierarchical generator continuously proposes candidate paths by combining autoregressive intent modeling for temporal causality with diffusion-based refinement for spatial flexibility. These proposals are then rigorously assessed by a trainable multi-objective critic that explicitly preserves diverse preference structures without collapsing them into a single scalarization bias.This adversarial interplay, guided by a Pareto frontier selection mechanism, enables iterative multi-round refinement, effectively escaping local optima while preserving trajectory diversity.Extensive experiments on NAVSIM and Bench2Drive benchmarks demonstrate SOTA performance, achieving 94.9 PDMS on NAVSIM v1 (surpassing DiffusionDrive by 6.8, DriveSuprim by 5.0, and TrajHF by 0.9) and 64.96 Driving Score on Bench2Drive. EvaDrive generates diverse driving styles via dynamic weighting without external preference data, introducing a closed-loop adversarial framework for human-like iterative decision-making, offering a novel scalarization-free trajectory optimization approach.

[323] Presenting DiaData for Research on Type 1 Diabetes

Beyza Cinar, Maria Maleshkova

Main category: cs.LG

TL;DR: The paper integrates 15 datasets into a large database for T1D research, addressing data scarcity, and explores correlations between glucose levels and heart rate before hypoglycemia.

Details

Motivation: To overcome the lack of large datasets in diabetes and hypoglycemia research, enabling better ML predictions for glucose levels and early warnings.

Method: Systematic integration of 15 datasets into a unified database (2510 subjects, 149M measurements), with sub-databases for demographics and heart rate. Data quality assessment and correlation analysis are performed.

Result: A large, balanced dataset is created, revealing data imbalance and missing values as challenges. A correlation between glucose levels and heart rate 15-55 minutes before hypoglycemia is found.

Conclusion: The integrated dataset supports improved diabetes care research, though data quality issues remain. The heart rate correlation offers potential for early hypoglycemia detection.

Abstract: Type 1 diabetes (T1D) is an autoimmune disorder that leads to the destruction of insulin-producing cells, resulting in insulin deficiency, as to why the affected individuals depend on external insulin injections. However, insulin can decrease blood glucose levels and can cause hypoglycemia. Hypoglycemia is a severe event of low blood glucose levels ($\le$70 mg/dL) with dangerous side effects of dizziness, coma, or death. Data analysis can significantly enhance diabetes care by identifying personal patterns and trends leading to adverse events. Especially, machine learning (ML) models can predict glucose levels and provide early alarms. However, diabetes and hypoglycemia research is limited by the unavailability of large datasets. Thus, this work systematically integrates 15 datasets to provide a large database of 2510 subjects with glucose measurements recorded every 5 minutes. In total, 149 million measurements are included, of which 4% represent values in the hypoglycemic range. Moreover, two sub-databases are extracted. Sub-database I includes demographics, and sub-database II includes heart rate data. The integrated dataset provides an equal distribution of sex and different age levels. As a further contribution, data quality is assessed, revealing that data imbalance and missing values present a significant challenge. Moreover, a correlation study on glucose levels and heart rate data is conducted, showing a relation between 15 and 55 minutes before hypoglycemia.

[324] Finite-Time Global Optimality Convergence in Deep Neural Actor-Critic Methods for Decentralized Multi-Agent Reinforcement Learning

Zhiyao Zhang, Myeung Suk Oh, FNU Hairi, Ziyue Luo, Alvaro Velasquez, Jia Liu

Main category: cs.LG

TL;DR: The paper introduces a deep neural actor-critic method for decentralized multi-agent reinforcement learning (MARL), bridging the gap between practical success and theoretical understanding by providing global optimality guarantees and a finite-time convergence rate.

Details

Motivation: Existing theoretical convergence studies for decentralized MARL methods are limited to linear function approximations, leaving a gap with the practical success of deep neural actor-critic methods.

Method: The authors propose a deep neural actor-critic method for decentralized MARL, where both actor and critic components are non-linear.

Result: The method achieves a global optimality guarantee with a finite-time convergence rate of O(1/T), marking the first such result for deep neural actor-critic in MARL. Numerical experiments support the theoretical findings.

Conclusion: This work bridges the theory-practice gap in decentralized MARL by providing the first global convergence result for deep neural actor-critic methods, validated by experiments.

Abstract: Actor-critic methods for decentralized multi-agent reinforcement learning (MARL) facilitate collaborative optimal decision making without centralized coordination, thus enabling a wide range of applications in practice. To date, however, most theoretical convergence studies for existing actor-critic decentralized MARL methods are limited to the guarantee of a stationary solution under the linear function approximation. This leaves a significant gap between the highly successful use of deep neural actor-critic for decentralized MARL in practice and the current theoretical understanding. To bridge this gap, in this paper, we make the first attempt to develop a deep neural actor-critic method for decentralized MARL, where both the actor and critic components are inherently non-linear. We show that our proposed method enjoys a global optimality guarantee with a finite-time convergence rate of O(1/T), where T is the total iteration times. This marks the first global convergence result for deep neural actor-critic methods in the MARL literature. We also conduct extensive numerical experiments, which verify our theoretical results.

[325] Physics-Guided Memory Network for Building Energy Modeling

Muhammad Umair Danish, Kashif Ali, Kamran Siddiqui, Katarina Grolinger

Main category: cs.LG

TL;DR: The paper introduces a Physics-Guided Memory Network (PgMN) to combine deep learning and physics-based models for accurate energy consumption forecasting, especially in scenarios with limited or no historical data.

Details

Motivation: Deep learning models struggle without historical data, while physics-based models require extensive inputs. PgMN aims to bridge these gaps for better forecasting.

Method: PgMN integrates deep learning and physics-based predictions using Parallel Projection Layers, a Memory Unit, and a Memory Experience Module.

Result: PgMN shows high accuracy in diverse scenarios, including new buildings, missing data, and dynamic changes.

Conclusion: PgMN offers a robust solution for energy forecasting in dynamic environments, overcoming limitations of existing methods.

Abstract: Accurate energy consumption forecasting is essential for efficient resource management and sustainability in the building sector. Deep learning models are highly successful but struggle with limited historical data and become unusable when historical data are unavailable, such as in newly constructed buildings. On the other hand, physics-based models, such as EnergyPlus, simulate energy consumption without relying on historical data but require extensive building parameter specifications and considerable time to model a building. This paper introduces a Physics-Guided Memory Network (PgMN), a neural network that integrates predictions from deep learning and physics-based models to address their limitations. PgMN comprises a Parallel Projection Layers to process incomplete inputs, a Memory Unit to account for persistent biases, and a Memory Experience Module to optimally extend forecasts beyond their input range and produce output. Theoretical evaluation shows that components of PgMN are mathematically valid for performing their respective tasks. The PgMN was evaluated on short-term energy forecasting at an hourly resolution, critical for operational decision-making in smart grid and smart building systems. Experimental validation shows accuracy and applicability of PgMN in diverse scenarios such as newly constructed buildings, missing data, sparse historical data, and dynamic infrastructure changes. This paper provides a promising solution for energy consumption forecasting in dynamic building environments, enhancing model applicability in scenarios where historical data are limited or unavailable or when physics-based models are inadequate.

[326] An Unsupervised Deep XAI Framework for Localization of Concurrent Replay Attacks in Nuclear Reactor Signals

Konstantinos Vasili, Zachery T. Dahm, William Richards, Stylianos Chatzidakis

Main category: cs.LG

TL;DR: The paper proposes an unsupervised explainable AI framework to detect and characterize replay attacks in nuclear reactor data, achieving high accuracy on real-world datasets.

Details

Motivation: Ensuring data integrity against deception attacks is critical for safe nuclear reactor operation, but current methods lack root-cause analysis and rely on synthetic or limited data.

Method: The framework combines an autoencoder with a customized windowSHAP algorithm to detect, identify, and characterize replay attacks in real-time multivariate data.

Result: The framework achieved 95%+ accuracy in detecting and identifying replay attacks on real-world datasets from Purdue’s nuclear reactor.

Conclusion: The proposed XAI framework effectively addresses gaps in current methods, providing robust and explainable detection of replay attacks in nuclear systems.

Abstract: Next generation advanced nuclear reactors are expected to be smaller both in size and power output, relying extensively on fully digital instrumentation and control systems. These reactors will generate a large flow of information in the form of multivariate time series data, conveying simultaneously various non linear cyber physical, process, control, sensor, and operational states. Ensuring data integrity against deception attacks is becoming increasingly important for networked communication and a requirement for safe and reliable operation. Current efforts to address replay attacks, almost universally focus on watermarking or supervised anomaly detection approaches without further identifying and characterizing the root cause of the anomaly. In addition, these approaches rely mostly on synthetic data with uncorrelated Gaussian process and measurement noise and full state feedback or are limited to univariate signals, signal stationarity, linear quadratic regulators, or other linear-time invariant state-space which may fail to capture any unmodeled system dynamics. In the realm of regulated nuclear cyber-physical systems, additional work is needed on characterization of replay attacks and explainability of predictions using real data. Here, we propose an unsupervised explainable AI framework based on a combination of autoencoder and customized windowSHAP algorithm to fully characterize real-time replay attacks, i.e., detection, source identification, timing and type, of increasing complexity during a dynamic time evolving reactor process. The proposed XAI framework was benchmarked on several real world datasets from Purdue’s nuclear reactor PUR-1 with up to six signals concurrently being replayed. In all cases, the XAI framework was able to detect and identify the source and number of signals being replayed and the duration of the falsification with 95 percent or better accuracy.

[327] Energy-Efficient Stochastic Computing (SC) Neural Networks for Internet of Things Devices With Layer-Wise Adjustable Sequence Length (ASL)

Ziheng Wang, Pedro Reviriego, Farzad Niknia, Zhen Gao, Javier Conde, Shanshan Liu, Fabrizio Lombardi

Main category: cs.LG

TL;DR: The paper introduces Adjustable Sequence Length (ASL), a mixed-precision scheme for stochastic computing (SC) neural networks, reducing energy and latency by over 60% with minimal accuracy loss.

Details

Motivation: To improve energy efficiency in SC neural networks for IoT applications by addressing unexplored mixed-precision layer-wise implementations.

Method: Proposes ASL with operator-norm-based modeling of truncation noise, sensitivity analysis using RF regression, and two truncation strategies (coarse-grained and fine-grained).

Result: ASL reduces energy and latency by over 60% with negligible accuracy loss, validated on a 32nm SC MLP.

Conclusion: ASL is feasible for IoT, showcasing the benefits of mixed-precision truncation in SC designs.

Abstract: Stochastic computing (SC) has emerged as an efficient low-power alternative for deploying neural networks (NNs) in resource-limited scenarios, such as the Internet of Things (IoT). By encoding values as serial bitstreams, SC significantly reduces energy dissipation compared to conventional floating-point (FP) designs; however, further improvement of layer-wise mixed-precision implementation for SC remains unexplored. This article introduces Adjustable Sequence Length (ASL), a novel scheme that applies mixed-precision concepts specifically to SC NNs. By introducing an operator-norm-based theoretical model, this article shows that truncation noise can cumulatively propagate through the layers by the estimated amplification factors. An extended sensitivity analysis is presented, using random forest (RF) regression to evaluate multilayer truncation effects and validate the alignment of theoretical predictions with practical network behaviors. To accommodate different application scenarios, this article proposes two truncation strategies (coarse-grained and fine-grained), which apply diverse sequence length configurations at each layer. Evaluations on a pipelined SC MLP synthesized at 32nm demonstrate that ASL can reduce energy and latency overheads by up to over 60% with negligible accuracy loss. It confirms the feasibility of the ASL scheme for IoT applications and highlights the distinct advantages of mixed-precision truncation in SC designs.

[328] Generating Feasible and Diverse Synthetic Populations Using Diffusion Models

Min Tang, Peng Lu, Qing Feng

Main category: cs.LG

TL;DR: A novel diffusion model-based method for population synthesis outperforms VAEs and GANs by better balancing feasibility and diversity in synthetic populations.

Details

Motivation: Addressing the challenge of accurately modeling high-dimensional attribute distributions in population synthesis due to sparse survey data and the curse of dimensionality.

Method: Proposes a diffusion model-based approach to estimate joint distributions, recovering missing sampling zeros while minimizing structural zeros.

Result: Outperforms VAE and GAN methods in metrics like marginal distribution similarity, feasibility, and diversity.

Conclusion: The diffusion model offers a superior solution for generating realistic and diverse synthetic populations.

Abstract: Population synthesis is a critical task that involves generating synthetic yet realistic representations of populations. It is a fundamental problem in agent-based modeling (ABM), which has become the standard to analyze intelligent transportation systems. The synthetic population serves as the primary input for ABM transportation simulation, with traveling agents represented by population members. However, when the number of attributes describing agents becomes large, survey data often cannot densely support the joint distribution of the attributes in the population due to the curse of dimensionality. This sparsity makes it difficult to accurately model and produce the population. Interestingly, deep generative models trained from available sample data can potentially synthesize possible attribute combinations that present in the actual population but do not exist in the sample data(called sampling zeros). Nevertheless, this comes at the cost of falsely generating the infeasible attribute combinations that do not exist in the population (called structural zeros). In this study, a novel diffusion model-based population synthesis method is proposed to estimate the underlying joint distribution of a population. This approach enables the recovery of numerous missing sampling zeros while keeping the generated structural zeros minimal. Our method is compared with other recently proposed approaches such as Variational Autoencoders (VAE) and Generative Adversarial Network (GAN) approaches, which have shown success in high dimensional tabular population synthesis. We assess the performance of the synthesized outputs using a range of metrics, including marginal distribution similarity, feasibility, and diversity. The results demonstrate that our proposed method outperforms previous approaches in achieving a better balance between the feasibility and diversity of the synthesized population.

[329] Masked Training for Robust Arrhythmia Detection from Digitalized Multiple Layout ECG Images

Shanwei Zhang, Deyun Zhang, Yirao Tao, Kexin Wang, Shijia Geng, Jun Li, Qinghao Zhao, Xingpeng Liu, Yuxi Zhou, Shenda Hong

Main category: cs.LG

TL;DR: PatchECG is a framework for adaptive ECG signal analysis, addressing layout inconsistencies and missing data, achieving robust arrhythmia detection across different ECG layouts.

Details

Motivation: ECG signals vary in layout and quality across hospitals, challenging existing models. PatchECG aims to overcome these inconsistencies for accurate arrhythmia diagnosis.

Method: Uses a masking training strategy to focus on key ECG patches and collaborative lead dependencies, tested on PTB-XL and synthetic datasets.

Result: Achieved AUROC of 0.835 on varied layouts, 0.778 in external validation, and outperformed baseline methods, including ECGFounder.

Conclusion: PatchECG is robust and adaptable, improving arrhythmia detection accuracy despite ECG layout variations.

Abstract: Electrocardiogram (ECG) as an important tool for diagnosing cardiovascular diseases such as arrhythmia. Due to the differences in ECG layouts used by different hospitals, the digitized signals exhibit asynchronous lead time and partial blackout loss, which poses a serious challenge to existing models. To address this challenge, the study introduced PatchECG, a framework for adaptive variable block count missing representation learning based on a masking training strategy, which automatically focuses on key patches with collaborative dependencies between leads, thereby achieving key recognition of arrhythmia in ECGs with different layouts. Experiments were conducted on the PTB-XL dataset and 21388 asynchronous ECG images generated using ECG image kit tool, using the 23 Subclasses as labels. The proposed method demonstrated strong robustness under different layouts, with average Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.835 and remained stable (unchanged with layout changes). In external validation based on 400 real ECG images data from Chaoyang Hospital, the AUROC for atrial fibrillation diagnosis reached 0.778; On 12 x 1 layout ECGs, AUROC reaches 0.893. This result is superior to various classic interpolation and baseline methods, and compared to the current optimal large-scale pre-training model ECGFounder, it has improved by 0.111 and 0.19.

[330] SVGen: Interpretable Vector Graphics Generation with Large Language Models

Feiyu Wang, Zhiyuan Zhao, Yuandong Liu, Da Zhang, Junyu Gao, Hao Sun, Xuelong Li

Main category: cs.LG

TL;DR: SVG-1M dataset and SVGen model enable efficient, accurate SVG generation from natural language, outperforming existing methods.

Details

Motivation: Addressing the challenge of converting creative ideas into precise vector graphics efficiently.

Method: Introduces SVG-1M dataset with aligned text-SVG pairs, proposes SVGen model using curriculum and reinforcement learning.

Result: SVGen outperforms general large models and traditional methods in effectiveness and efficiency.

Conclusion: SVG-1M and SVGen provide a scalable solution for text-to-SVG generation, with publicly available resources.

Abstract: Scalable Vector Graphics (SVG) is widely used in front-end development and UI/UX design due to its scalability, editability, and rendering efficiency. However, turning creative ideas into precise vector graphics remains a time-consuming challenge. To address this, we introduce SVG-1M, a large-scale dataset of high-quality SVGs paired with natural language descriptions. Through advanced data augmentation and annotation, we create well-aligned Text to SVG training pairs, including a subset with Chain of Thought annotations for enhanced semantic guidance. Based on this dataset, we propose SVGen, an end-to-end model that generates SVG code from natural language inputs. Our approach ensures semantic accuracy and structural completeness, supported by curriculum learning and reinforcement learning optimization. Experiments show that SVGen outperforms general large models and traditional rendering methods in both effectiveness and efficiency. Code, model, and dataset are available on GitHub.

[331] Multimodal RAG Enhanced Visual Description

Amit Kumar Jaiswal, Haiming Liu, Ingo Frommholz

Main category: cs.LG

TL;DR: A lightweight, training-free method using Retrieval-Augmented Generation (RAG) bridges the modality gap in large multimodal models (LMMs) by linear mapping, improving textual description generation for images.

Details

Motivation: Addressing the high cost and impracticality of fine-tuning LMMs to align textual and visual representations, the paper proposes an efficient alternative.

Method: Utilizes RAG with linear mapping to retrieve textual descriptions from training data, combined with iterative synthetic description generation for optimization.

Result: Shows significant improvements on benchmark datasets for multimodal tasks.

Conclusion: The proposed method effectively mitigates the modality gap without costly fine-tuning, offering a practical solution for multimodal tasks.

Abstract: Textual descriptions for multimodal inputs entail recurrent refinement of queries to produce relevant output images. Despite efforts to address challenges such as scaling model size and data volume, the cost associated with pre-training and fine-tuning remains substantial. However, pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations within a common embedding space. Although fine-tuning can potentially mitigate this gap, it is typically expensive and impractical due to the requirement for extensive domain-driven data. To overcome this challenge, we propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality using a linear mapping, which can be computed efficiently. During inference, this mapping is applied to images embedded by an LMM enabling retrieval of closest textual descriptions from the training set. These textual descriptions, in conjunction with an instruction, cater as an input prompt for the language model to generate new textual descriptions. In addition, we introduce an iterative technique for distilling the mapping by generating synthetic descriptions via the language model facilitating optimisation for standard utilised image description measures. Experimental results on two benchmark multimodal datasets demonstrate significant improvements.

[332] FedMP: Tackling Medical Feature Heterogeneity in Federated Learning from a Manifold Perspective

Zhekai Zhou, Shudong Liu, Zhaokun Zhou, Yang Liu, Qiang Yang, Yuesheng Zhu, Guibo Luo

Main category: cs.LG

TL;DR: FedMP improves federated learning (FL) in non-IID scenarios by using stochastic feature manifold completion and class-prototypes, outperforming existing methods on medical and natural image datasets.

Details

Motivation: Address challenges in FL due to non-IID data, especially in medical imaging, where feature distribution shifts hinder model performance.

Method: Proposes FedMP with stochastic feature manifold completion and class-prototypes to align feature manifolds and improve decision boundaries.

Result: FedMP outperforms existing FL algorithms on medical and multi-domain natural image datasets.

Conclusion: FedMP effectively enhances FL under non-IID conditions, with analysis on manifold dimensionality, communication efficiency, and privacy.

Abstract: Federated learning (FL) is a decentralized machine learning paradigm in which multiple clients collaboratively train a shared model without sharing their local private data. However, real-world applications of FL frequently encounter challenges arising from the non-identically and independently distributed (non-IID) local datasets across participating clients, which is particularly pronounced in the field of medical imaging, where shifts in image feature distributions significantly hinder the global model’s convergence and performance. To address this challenge, we propose FedMP, a novel method designed to enhance FL under non-IID scenarios. FedMP employs stochastic feature manifold completion to enrich the training space of individual client classifiers, and leverages class-prototypes to guide the alignment of feature manifolds across clients within semantically consistent subspaces, facilitating the construction of more distinct decision boundaries. We validate the effectiveness of FedMP on multiple medical imaging datasets, including those with real-world multi-center distributions, as well as on a multi-domain natural image dataset. The experimental results demonstrate that FedMP outperforms existing FL algorithms. Additionally, we analyze the impact of manifold dimensionality, communication efficiency, and privacy implications of feature exposure in our method.

[333] DQT: Dynamic Quantization Training via Dequantization-Free Nested Integer Arithmetic

Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Francesca Palermo, Diana Trojaniello, Manuel Roveri

Main category: cs.LG

TL;DR: DQT introduces a novel dynamic quantization framework using nested integer representation and bit-shift operations for efficient mixed-precision quantization without costly dequantization cycles.

Details

Motivation: Existing dynamic quantization methods require expensive dequantize-requantize cycles, breaking integer-only hardware efficiency. DQT aims to eliminate this bottleneck.

Method: DQT uses nested integer representation and custom integer-only arithmetic, enabling bit-width switching via low-cost bit-shift operations. A lightweight controller dynamically quantizes each layer.

Result: DQT achieves SOTA performance: 77.00% top-1 accuracy on ImageNet with ResNet50 (4-bit), outperforming static (76.70%) and dynamic (76.94%) methods. Bit-shift cost is 28.3M vs. 56.6M MACs in prior work.

Conclusion: DQT enables efficient, adaptive AI by removing dequantization overhead, offering superior accuracy-efficiency trade-offs with minimal hardware cost.

Abstract: The deployment of deep neural networks on resource-constrained devices relies on quantization. While static, uniform quantization applies a fixed bit-width to all inputs, it fails to adapt to their varying complexity. Dynamic, instance-based mixed-precision quantization promises a superior accuracy-efficiency trade-off by allocating higher precision only when needed. However, a critical bottleneck remains: existing methods require a costly dequantize-to-float and requantize-to-integer cycle to change precision, breaking the integer-only hardware paradigm and compromising performance gains. This paper introduces Dynamic Quantization Training (DQT), a novel framework that removes this bottleneck. At the core of DQT is a nested integer representation where lower-precision values are bit-wise embedded within higher-precision ones. This design, coupled with custom integer-only arithmetic, allows for on-the-fly bit-width switching through a near-zero-cost bit-shift operation. This makes DQT the first quantization framework to enable both dequantization-free static mixed-precision of the backbone network, and truly efficient dynamic, instance-based quantization through a lightweight controller that decides at runtime how to quantize each layer. We demonstrate DQT state-of-the-art performance on ResNet18 on CIFAR-10 and ResNet50 on ImageNet. On ImageNet, our 4-bit dynamic ResNet50 achieves 77.00% top-1 accuracy, an improvement over leading static (LSQ, 76.70%) and dynamic (DQNET, 76.94%) methods at a comparable BitOPs budget. Crucially, DQT achieves this with a bit-width transition cost of only 28.3M simple bit-shift operations, a drastic improvement over the 56.6M costly Multiply-Accumulate (MAC) floating-point operations required by previous dynamic approaches - unlocking a new frontier in efficient, adaptive AI.

[334] scAGC: Learning Adaptive Cell Graphs with Contrastive Guidance for Single-Cell Clustering

Huifa Li, Jie Fu, Xinlin Zhuang, Haolin Yang, Xinpeng Ling, Tong Cheng, Haochen xue, Imran Razzak, Zhili Chen

Main category: cs.LG

TL;DR: scAGC is a novel single-cell clustering method that learns adaptive cell graphs with contrastive guidance, outperforming state-of-the-art methods in accuracy.

Details

Motivation: Traditional clustering methods struggle with high dimensionality and noise in scRNA-seq data, while existing graph-based methods rely on static graphs that are noise-sensitive and fail to address long-tailed distributions.

Method: scAGC uses a topology-adaptive graph autoencoder with Gumbel-Softmax sampling to refine graphs dynamically, integrates ZINB loss for robust feature reconstruction, and employs contrastive learning for stability.

Result: scAGC achieves the best NMI and ARI scores on 9 and 7 real scRNA-seq datasets, respectively.

Conclusion: scAGC effectively addresses challenges in single-cell clustering by combining adaptive graph learning and contrastive guidance, demonstrating superior performance.

Abstract: Accurate cell type annotation is a crucial step in analyzing single-cell RNA sequencing (scRNA-seq) data, which provides valuable insights into cellular heterogeneity. However, due to the high dimensionality and prevalence of zero elements in scRNA-seq data, traditional clustering methods face significant statistical and computational challenges. While some advanced methods use graph neural networks to model cell-cell relationships, they often depend on static graph structures that are sensitive to noise and fail to capture the long-tailed distribution inherent in single-cell populations.To address these limitations, we propose scAGC, a single-cell clustering method that learns adaptive cell graphs with contrastive guidance. Our approach optimizes feature representations and cell graphs simultaneously in an end-to-end manner. Specifically, we introduce a topology-adaptive graph autoencoder that leverages a differentiable Gumbel-Softmax sampling strategy to dynamically refine the graph structure during training. This adaptive mechanism mitigates the problem of a long-tailed degree distribution by promoting a more balanced neighborhood structure. To model the discrete, over-dispersed, and zero-inflated nature of scRNA-seq data, we integrate a Zero-Inflated Negative Binomial (ZINB) loss for robust feature reconstruction. Furthermore, a contrastive learning objective is incorporated to regularize the graph learning process and prevent abrupt changes in the graph topology, ensuring stability and enhancing convergence. Comprehensive experiments on 9 real scRNA-seq datasets demonstrate that scAGC consistently outperforms other state-of-the-art methods, yielding the best NMI and ARI scores on 9 and 7 datasets, respectively.Our code is available at Anonymous Github.

[335] Long-Term Client Selection for Federated Learning with Non-IID Data: A Truthful Auction Approach

Jinghong Tan, Zhian Liu, Kun Guo, Mingxiong Zhao

Main category: cs.LG

TL;DR: Proposes LCSFLA, a truthful auction-based long-term client-selection method for federated learning in IoV, addressing non-IID data challenges and ensuring truthful client participation.

Details

Motivation: FL in IoV faces non-IID data issues and resource wastage due to inefficient client selection. Traditional methods lack long-term data quality assessment and truthful client participation.

Method: Introduces LCSFLA, combining long-term data quality assessment with a truthful auction mechanism, including a deposit requirement to ensure honesty.

Result: Theoretical proofs confirm incentive compatibility and individual rationality. Experiments show improved performance with non-IID data in IoV scenarios.

Conclusion: LCSFLA effectively mitigates non-IID data challenges in FL for IoV, ensuring truthful client participation and better model performance.

Abstract: Federated learning (FL) provides a decentralized framework that enables universal model training through collaborative efforts on mobile nodes, such as smart vehicles in the Internet of Vehicles (IoV). Each smart vehicle acts as a mobile client, contributing to the process without uploading local data. This method leverages non-independent and identically distributed (non-IID) training data from different vehicles, influenced by various driving patterns and environmental conditions, which can significantly impact model convergence and accuracy. Although client selection can be a feasible solution for non-IID issues, it faces challenges related to selection metrics. Traditional metrics evaluate client data quality independently per round and require client selection after all clients complete local training, leading to resource wastage from unused training results. In the IoV context, where vehicles have limited connectivity and computational resources, information asymmetry in client selection risks clients submitting false information, potentially making the selection ineffective. To tackle these challenges, we propose a novel Long-term Client-Selection Federated Learning based on Truthful Auction (LCSFLA). This scheme maximizes social welfare with consideration of long-term data quality using a new assessment mechanism and energy costs, and the advised auction mechanism with a deposit requirement incentivizes client participation and ensures information truthfulness. We theoretically prove the incentive compatibility and individual rationality of the advised incentive mechanism. Experimental results on various datasets, including those from IoV scenarios, demonstrate its effectiveness in mitigating performance degradation caused by non-IID data.

[336] Breath as a biomarker: A survey of contact and contactless applications and approaches in respiratory monitoring

Almustapha A. Wakili, Babajide J. Asaju, Woosub Jung

Main category: cs.LG

TL;DR: A survey on breath analysis methods, comparing contact-based and contactless approaches, with a focus on machine learning and deep learning applications, challenges, and future trends.

Details

Motivation: To address the limitations of traditional contact-based breath analysis methods by exploring noninvasive, contactless techniques enabled by advanced technologies like machine learning.

Method: Examines contactless methods (e.g., Wi-Fi CSI, acoustic sensing) and machine learning/deep learning techniques for preprocessing, feature extraction, and classification.

Result: Highlights the potential of contactless methods for accurate respiratory monitoring and discusses challenges like dataset scarcity and privacy.

Conclusion: Provides a framework for future innovations in breath analysis, emphasizing the integration of advanced technologies with healthcare applications.

Abstract: Breath analysis has emerged as a critical tool in health monitoring, offering insights into respiratory function, disease detection, and continuous health assessment. While traditional contact-based methods are reliable, they often pose challenges in comfort and practicality, particularly for long-term monitoring. This survey comprehensively examines contact-based and contactless approaches, emphasizing recent advances in machine learning and deep learning techniques applied to breath analysis. Contactless methods, including Wi-Fi Channel State Information and acoustic sensing, are analyzed for their ability to provide accurate, noninvasive respiratory monitoring. We explore a broad range of applications, from single-user respiratory rate detection to multi-user scenarios, user identification, and respiratory disease detection. Furthermore, this survey details essential data preprocessing, feature extraction, and classification techniques, offering comparative insights into machine learning/deep learning models suited to each approach. Key challenges like dataset scarcity, multi-user interference, and data privacy are also discussed, along with emerging trends like Explainable AI, federated learning, transfer learning, and hybrid modeling. By synthesizing current methodologies and identifying open research directions, this survey offers a comprehensive framework to guide future innovations in breath analysis, bridging advanced technological capabilities with practical healthcare applications.

[337] Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

Main category: cs.LG

TL;DR: The paper introduces Fine-Grained Safety Neurons (FGSN) with a training-free continual projection method to mitigate safety risks in fine-tuned LLMs, balancing safety and utility.

Details

Motivation: Fine-tuning LLMs introduces safety risks by challenging original alignment mechanisms, and existing defenses lack comprehensive consideration of safety layers and fine-grained neurons.

Method: Proposes FGSN, which integrates multi-scale interactions between safety layers and neurons, localizes precise safety neurons, and projects parameters onto safety directions.

Result: FGSN reduces harmfulness scores and attack success rates with minimal parameter changes while preserving model utility.

Conclusion: FGSN offers a robust, continual defense against safety risks in fine-tuned LLMs, aligning with human preferences and generalizing to unforeseen concerns.

Abstract: Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model’s utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.

[338] From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, Shijin Wang

Main category: cs.LG

TL;DR: TokenCast, an LLM-driven framework, integrates numerical and textual data for context-aware time series forecasting by converting numerical sequences into tokens and embedding them with contextual data in a shared space.

Details

Motivation: Improving forecasting accuracy by addressing the challenge of integrating historical numerical sequences with unstructured contextual features like text.

Method: Uses a discrete tokenizer to convert numerical data into tokens, aligns them with textual inputs in a shared space via a pre-trained LLM, and fine-tunes the model for forecasting.

Result: Demonstrated effectiveness and generalizability on diverse real-world datasets with contextual features.

Conclusion: TokenCast successfully bridges the gap between numerical and textual data for enhanced time series forecasting.

Abstract: Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, an LLM-driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained large language model (LLM), further optimized with autoregressive generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on diverse real-world datasets enriched with contextual features demonstrate the effectiveness and generalizability of TokenCast.

[339] Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng

Main category: cs.LG

TL;DR: The paper introduces Discrete Diffusion Forcing (D2F), a strategy to enhance the inference speed of diffusion-based LLMs, surpassing autoregressive models like LLaMA3 and Qwen2.5.

Details

Motivation: Existing diffusion-based LLMs (dLLMs) lag behind autoregressive (AR) LLMs in inference speed despite their potential for parallel token decoding.

Method: D2F combines block-wise autoregressive generation and inter-block parallel decoding, transforming vanilla dLLMs into an AR-diffusion hybrid. It uses asymmetric distillation and a pipelined parallel decoding algorithm.

Result: D2F dLLMs achieve over 2.5× speedup compared to LLaMA3 and Qwen2.5, and up to 50× acceleration over vanilla dLLMs like LLaDA and Dream, without compromising output quality.

Conclusion: D2F effectively bridges the speed gap between dLLMs and AR LLMs, offering a practical solution for efficient text generation.

Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.

[340] Multi-Objective Instruction-Aware Representation Learning in Procedural Content Generation RL

Sung-Hyun Kim, In-Chang Baek, Seo-Young Lee, Geum-Hwan Hwang, Kyung-Joong Kim

Main category: cs.LG

TL;DR: MIPCGRL improves controllability in content generation by leveraging sentence embeddings and multi-objective learning, achieving a 13.8% boost in performance.

Details

Motivation: Existing methods struggle with complex, multi-objective textual instructions, limiting controllability in content generation.

Method: MIPCGRL uses sentence embeddings, multi-label classification, and multi-head regression to train a multi-objective embedding space.

Result: The method achieves up to a 13.8% improvement in controllability with multi-objective instructions.

Conclusion: MIPCGRL enables more expressive and flexible content generation by effectively processing complex instructions.

Abstract: Recent advancements in generative modeling emphasize the importance of natural language as a highly expressive and accessible modality for controlling content generation. However, existing instructed reinforcement learning for procedural content generation (IPCGRL) method often struggle to leverage the expressive richness of textual input, especially under complex, multi-objective instructions, leading to limited controllability. To address this problem, we propose \textit{MIPCGRL}, a multi-objective representation learning method for instructed content generators, which incorporates sentence embeddings as conditions. MIPCGRL effectively trains a multi-objective embedding space by incorporating multi-label classification and multi-head regression networks. Experimental results show that the proposed method achieves up to a 13.8% improvement in controllability with multi-objective instructions. The ability to process complex instructions enables more expressive and flexible content generation.

[341] Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments

Yipeng Du, Zihao Wang, Ahmad Farhan, Claudio Angione, Harry Yang, Fielding Johnston, James P. Buban, Patrick Colangelo, Yue Zhao, Yuzhe Yang

Main category: cs.LG

TL;DR: A meta-learning framework automates optimal inference acceleration method selection in decentralized AI systems, improving efficiency and performance over traditional approaches.

Details

Motivation: To reduce costs and address scalability and data security challenges in deploying large-scale models like LLMs in decentralized systems.

Method: Introduces a meta-learning-based framework that learns from historical performance data of acceleration techniques to select optimal methods for specific tasks.

Result: The framework outperforms traditional methods, enhancing efficiency and performance in decentralized AI systems.

Conclusion: The approach offers a more democratic and economically feasible solution for AI deployment, highlighting the potential of inference acceleration in decentralized systems.

Abstract: The deployment of large-scale models, such as large language models (LLMs), incurs substantial costs due to their computational demands. To mitigate these costs and address challenges related to scalability and data security, there is a growing shift towards decentralized systems for model deployment, where choosing efficient inference acceleration schemes become crucial to manage computational resources effectively and enhance system responsiveness. In this work, we address the challenge of selecting optimal acceleration methods in decentralized systems by introducing a meta-learning-based framework. This framework automates the selection process by learning from historical performance data of various acceleration techniques across different tasks. Unlike traditional methods that rely on random selection or expert intuition, our approach systematically identifies the best acceleration strategies based on the specific characteristics of each task. We demonstrate that our meta-learning framework not only streamlines the decision-making process but also consistently outperforms conventional methods in terms of efficiency and performance. Our results highlight the potential of inference acceleration in decentralized AI systems, offering a path towards more democratic and economically feasible artificial intelligence solutions.

[342] ADT4Coupons: An Innovative Framework for Sequential Coupon Distribution in E-commerce

Li Kong, Bingzhe Wang, Zhou Chen, Suhan Hu, Yuchao Ma, Qi Qi, Suoyuan Song, Bicheng Jin

Main category: cs.LG

TL;DR: The paper introduces ADT4Coupons, a framework for optimizing sequential coupon distribution to boost long-term revenue by leveraging user-platform interactions.

Details

Motivation: Existing coupon distribution strategies fail to utilize complex sequential interactions between platforms and users, leading to performance stagnation.

Method: Proposes ADT4Coupons, integrating general scenarios, sequential modeling, and iterative updates for optimized coupon distribution.

Result: Demonstrates superiority on real-world, public, and synthetic datasets.

Conclusion: ADT4Coupons effectively enhances long-term revenue by improving coupon distribution policies.

Abstract: Coupon distribution is a critical marketing strategy used by online platforms to boost revenue and enhance user engagement. Regrettably, existing coupon distribution strategies fall far short of effectively leveraging the complex sequential interactions between platforms and users. This critical oversight, despite the abundance of e-commerce log data, has precipitated a performance plateau. In this paper, we focus on the scene that the platforms make sequential coupon distribution decision multiple times for various users, with each user interacting with the platform repeatedly. Based on this marketing scenario, we propose a novel marketing framework, named Aligned Decision Transformer for Coupons (ADT4Coupons), to directly devise coupon distribution policy for long-term revenue boosting. ADT4Coupons enables optimized online decision-making in a variety of real-world marketing scenarios. It achieves this by seamlessly integrating three key characteristics, general scenarios, sequential modeling with more comprehensive historical data, and efficient iterative updates within a unified framework. Furthermore, empirical results on real-world industrial dataset, alongside public and synthetic datasets demonstrate the superiority of our framework.

[343] Building Safer Sites: A Large-Scale Multi-Level Dataset for Construction Safety Research

Zhenhui Ou, Dawei Li, Zhen Tan, Wenlin Li, Huan Liu, Siyuan Song

Main category: cs.LG

TL;DR: The paper introduces CSDataset, a comprehensive construction safety dataset integrating structured and unstructured data from OSHA, enabling machine learning and large language model applications. Preliminary analysis shows a 17.3% reduction in incidents with complaint-driven inspections.

Details

Motivation: Existing construction safety datasets are limited in volume and diversity, hindering in-depth analysis.

Method: The paper presents CSDataset, a multi-level dataset combining incidents, inspections, and violations from OSHA, with structured and unstructured data. Benchmarking and cross-level analyses are conducted.

Result: Preliminary analysis reveals a 17.3% reduction in incidents due to complaint-driven inspections.

Conclusion: CSDataset facilitates advanced safety research, with findings suggesting the effectiveness of complaint-driven inspections.

Abstract: Construction safety research is a critical field in civil engineering, aiming to mitigate risks and prevent injuries through the analysis of site conditions and human factors. However, the limited volume and lack of diversity in existing construction safety datasets pose significant challenges to conducting in-depth analyses. To address this research gap, this paper introduces the Construction Safety Dataset (CSDataset), a well-organized comprehensive multi-level dataset that encompasses incidents, inspections, and violations recorded sourced from the Occupational Safety and Health Administration (OSHA). This dataset uniquely integrates structured attributes with unstructured narratives, facilitating a wide range of approaches driven by machine learning and large language models. We also conduct a preliminary approach benchmarking and various cross-level analyses using our dataset, offering insights to inform and enhance future efforts in construction safety. For example, we found that complaint-driven inspections were associated with a 17.3% reduction in the likelihood of subsequent incidents. Our dataset and code are released at https://github.com/zhenhuiou/Construction-Safety-Dataset-CSDataset.

[344] MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Jinhao Zhang, Yunquan Zhang, Boyang Zhang, Zeyu Liu, Daning Cheng

Main category: cs.LG

TL;DR: MoQE is a quantization inference framework using Mixture-of-Experts to improve model performance by dynamically routing inputs to specialized quantization experts, reducing accuracy degradation without significant latency increase.

Details

Motivation: Quantization reduces model efficiency and deployment costs but often degrades accuracy. MoQE aims to mitigate this by leveraging multiple quantization experts.

Method: MoQE combines multiple quantization variants of a full-precision model as experts and uses lightweight, task-specific routers to dynamically route inputs.

Result: Experiments on ResNet, LLaMA, and Qwen models show MoQE matches SOTA quantization performance without added latency.

Conclusion: MoQE effectively addresses quantization-induced accuracy loss by leveraging expert specialization and dynamic routing, making it practical for resource-constrained devices.

Abstract: Quantization method plays a crucial role in improving model efficiency and reducing deployment costs, enabling the widespread application of deep learning models on resource-constrained devices. However, the quantization process inevitably introduces accuracy degradation. In this paper, we propose Mixture of Quantization Experts( abbr. MoQE), a quantization inference framework based on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance of quantization models. MoQE combines multiple quantization variants of one full-precision model as specialized “quantization experts” and dynamically routes input data to the most suitable expert based on its characteristics. MoQE alleviates the performance degradation commonly seen in single quantization models through specialization quantization expert models. We design lightweight, structure-aware router models tailored for both CV and NLP tasks. Experimental evaluations on ResNet, LLaMA, and Qwen model families across benchmark datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that MoQE achieves performance comparable to SOTA quantization model, without incurring significant increases in inference latency.

[345] The First Differentiable Transfer-Based Algorithm for Discrete MicroLED Repair

Ning-Yuan Lue

Main category: cs.LG

TL;DR: A novel differentiable transfer module-based repair algorithm for microLED fabrication reduces transfer steps by 50% and achieves fast planning times.

Details

Motivation: The need for efficient computational models to optimize shift sequences in laser-enabled selective transfer for microLED fabrication, adapting to varying objectives.

Method: A differentiable transfer module models discrete shifts, enabling gradient-based optimization without handcrafted features, unlike RL-based methods.

Result: 50% reduction in transfer steps and sub-2-minute planning time for 2000x2000 arrays, outperforming local proximity searching and RL approaches.

Conclusion: The method offers a scalable, adaptable solution for accelerating microLED repair in AR/VR and display fabrication.

Abstract: Laser-enabled selective transfer, a key process in high-throughput microLED fabrication, requires computational models that can plan shift sequences to minimize motion of XY stages and adapt to varying optimization objectives across the substrate. We propose the first repair algorithm based on a differentiable transfer module designed to model discrete shifts of transfer platforms, while remaining trainable via gradient-based optimization. Compared to local proximity searching algorithms, our approach achieves superior repair performance and enables more flexible objective designs, such as minimizing the number of steps. Unlike reinforcement learning (RL)-based approaches, our method eliminates the need for handcrafted feature extractors and trains significantly faster, allowing scalability to large arrays. Experiments show a 50% reduction in transfer steps and sub-2-minute planning time on 2000x2000 arrays. This method provides a practical and adaptable solution for accelerating microLED repair in AR/VR and next-generation display fabrication.

[346] Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation

Sameer Ambekar, Daniel M. Lang, Julia A. Schnabel

Main category: cs.LG

TL;DR: Hi-Vec is a hierarchical adaptive network for test-time adaptation, using dynamic layer selection and weight merging to handle complex distribution shifts.

Details

Motivation: Standard test-time adaptation methods struggle with diverse and complex distribution shifts due to reliance on single-dimensional linear layers.

Method: Hi-Vec decomposes the encoder’s representation space into hierarchical layers, enabling dynamic layer selection, weight merging, and linear layer agreement to prevent noisy fine-tuning.

Result: Hi-Vec improves robustness, handles uncertainty, and performs well with limited batch sizes and high outlier rates.

Conclusion: Hi-Vec advances state-of-the-art methods by effectively adapting to varying distribution shifts.

Abstract: Test-time adaptation allows pretrained models to adjust to incoming data streams, addressing distribution shifts between source and target domains. However, standard methods rely on single-dimensional linear classification layers, which often fail to handle diverse and complex shifts. We propose Hierarchical Adaptive Networks with Task Vectors (Hi-Vec), which leverages multiple layers of increasing size for dynamic test-time adaptation. By decomposing the encoder’s representation space into such hierarchically organized layers, Hi-Vec, in a plug-and-play manner, allows existing methods to adapt to shifts of varying complexity. Our contributions are threefold: First, we propose dynamic layer selection for automatic identification of the optimal layer for adaptation to each test batch. Second, we propose a mechanism that merges weights from the dynamic layer to other layers, ensuring all layers receive target information. Third, we propose linear layer agreement that acts as a gating function, preventing erroneous fine-tuning by adaptation on noisy batches. We rigorously evaluate the performance of Hi-Vec in challenging scenarios and on multiple target datasets, proving its strong capability to advance state-of-the-art methods. Our results show that Hi-Vec improves robustness, addresses uncertainty, and handles limited batch sizes and increased outlier rates.

[347] GSMT: Graph Fusion and Spatiotemporal TaskCorrection for Multi-Bus Trajectory Prediction

Fan Ding, Hwa Hui Tew, Junn Yong Loo, Susilawati, LiTong Liu, Fang Yu Leong, Xuewen Luo, Kar Keong Chin, Jia Jun Gan

Main category: cs.LG

TL;DR: GSMT, a hybrid model combining GAT and RNN with a task corrector, improves bus trajectory prediction in data-limited urban environments.

Details

Motivation: Accurate bus trajectory prediction is vital for intelligent transportation, especially in regions with limited multimodal data, relying on GPS data despite challenges.

Method: GSMT integrates GAT and RNN, uses a task corrector to cluster historical trajectories and refine predictions, combining dynamic bus and static station data for two-stage prediction.

Result: GSMT outperforms existing methods in short- and long-term trajectory prediction on a real-world dataset from Kuala Lumpur.

Conclusion: The hybrid GSMT model effectively addresses trajectory prediction challenges in dense urban traffic, demonstrating superior performance.

Abstract: Accurate trajectory prediction for buses is crucial in intelligent transportation systems, particularly within urban environments. In developing regions where access to multimodal data is limited, relying solely on onboard GPS data remains indispensable despite inherent challenges. To address this problem, we propose GSMT, a hybrid model that integrates a Graph Attention Network (GAT) with a sequence-to-sequence Recurrent Neural Network (RNN), and incorporates a task corrector capable of extracting complex behavioral patterns from large-scale trajectory data. The task corrector clusters historical trajectories to identify distinct motion patterns and fine-tunes the predictions generated by the GAT and RNN. Specifically, GSMT fuses dynamic bus information and static station information through embedded hybrid networks to perform trajectory prediction, and applies the task corrector for secondary refinement after the initial predictions are generated. This two-stage approach enables multi-node trajectory prediction among buses operating in dense urban traffic environments under complex conditions. Experiments conducted on a real-world dataset from Kuala Lumpur, Malaysia, demonstrate that our method significantly outperforms existing approaches, achieving superior performance in both short-term and long-term trajectory prediction tasks.

[348] Blockchain Network Analysis using Quantum Inspired Graph Neural Networks & Ensemble Models

Luigi D’Amico, Daniel De Rosso, Ninad Dixit, Raul Salles de Padua, Samuel Palmer, Samuel Mugel, Román Orús, Holger Eble, Ali Abedi

Main category: cs.LG

TL;DR: A novel Quantum-Inspired Graph Neural Network (QI-GNN) with a Canonical Polyadic (CP) decomposition layer is proposed for detecting illicit blockchain transactions, outperforming classical methods with an F2 score of 74.8%.

Details

Motivation: The challenge of detecting illicit transactions in blockchain networks requires innovative solutions, especially for anti-money laundering (AML) efforts.

Method: Combines QI-GNN with an Ensemble Model (QBoost or Random Forest Classifier) and introduces a CP decomposition layer for efficient complex data analysis.

Result: Achieved an F2 score of 74.8%, demonstrating superior performance over classical machine learning methods.

Conclusion: Quantum-inspired techniques, enhanced by the CP layer, show promise for financial security and warrant further exploration in fraud detection.

Abstract: In the rapidly evolving domain of financial technology, the detection of illicit transactions within blockchain networks remains a critical challenge, necessitating robust and innovative solutions. This work proposes a novel approach by combining Quantum Inspired Graph Neural Networks (QI-GNN) with flexibility of choice of an Ensemble Model using QBoost or a classic model such as Random Forrest Classifier. This system is tailored specifically for blockchain network analysis in anti-money laundering (AML) efforts. Our methodology to design this system incorporates a novel component, a Canonical Polyadic (CP) decomposition layer within the graph neural network framework, enhancing its capability to process and analyze complex data structures efficiently. Our technical approach has undergone rigorous evaluation against classical machine learning implementations, achieving an F2 score of 74.8% in detecting fraudulent transactions. These results highlight the potential of quantum-inspired techniques, supplemented by the structural advancements of the CP layer, to not only match but potentially exceed traditional methods in complex network analysis for financial security. The findings advocate for a broader adoption and further exploration of quantum-inspired algorithms within the financial sector to effectively combat fraud.

[349] LLM Empowered Prototype Learning for Zero and Few-Shot Tasks on Tabular Data

Peng Wang, Dongsheng Wang, He Zhao, Hangting Ye, Dandan Guo, Yi Chang

Main category: cs.LG

TL;DR: A novel LLM-based prototype estimation framework for tabular learning, enabling zero-shot and few-shot scenarios without training or fine-tuning.

Details

Motivation: To address challenges in utilizing advanced LLMs for tabular data modeling in few-shot and zero-shot scenarios.

Method: Proposes an example-free prompt approach to generate feature values from LLMs, creating zero-shot prototypes and enhancing them with few-shot samples.

Result: Demonstrates effectiveness in zero and few-shot tabular learning, providing a scalable and robust framework.

Conclusion: The framework bypasses constraints of example-based prompts, offering a practical solution for LLM-based tabular learning.

Abstract: Recent breakthroughs in large language models (LLMs) have opened the door to in-depth investigation of their potential in tabular data modeling. However, effectively utilizing advanced LLMs in few-shot and even zero-shot scenarios is still challenging. To this end, we propose a novel LLM-based prototype estimation framework for tabular learning. Our key idea is to query the LLM to generate feature values based example-free prompt, which solely relies on task and feature descriptions. With the feature values generated by LLM, we can build a zero-shot prototype in a training-free manner, which can be further enhanced by fusing few-shot samples, avoiding training a classifier or finetuning the LLMs. Thanks to the example-free prompt and prototype estimation, ours bypasses the constraints brought by the example-based prompt, providing a scalable and robust framework. Extensive experiments demonstrate the effectiveness of ours in zero and few-shot tabular learning.

[350] Detection of Odor Presence via Deep Neural Networks

Matin Hassanloo, Ali Zareh, Mehmet Kemal Özdemir

Main category: cs.LG

TL;DR: The paper presents a deep learning-based system for robust single-trial odor detection using olfactory bulb LFPs, achieving high accuracy and outperforming benchmarks.

Details

Motivation: Current odor detection methods struggle with complex mixtures and lack single-trial reliability, necessitating a more robust solution.

Method: An ensemble of ResCNN and AttentionCNN models decodes odor presence from multichannel olfactory bulb LFPs.

Result: The model achieved 86.6% accuracy, 81.0% F1-score, and 0.9247 AUC, with t-SNE confirming biologically significant signatures.

Conclusion: The study demonstrates feasibility of single-trial odor detection from LFPs and highlights deep learning’s potential for understanding olfactory representations.

Abstract: Odor detection underpins food safety, environmental monitoring, medical diagnostics, and many more fields. The current artificial sensors developed for odor detection struggle with complex mixtures while non-invasive recordings lack reliable single-trial fidelity. To develop a general system for odor detection, in this study we present a preliminary work where we aim to test two hypotheses: (i) that spectral features of local field potentials (LFPs) are sufficient for robust single-trial odor detection and (ii) that signals from the olfactory bulb alone are adequate. To test two hypotheses, we propose an ensemble of complementary one-dimensional convolutional networks (ResCNN and AttentionCNN) that decodes the presence of odor from multichannel olfactory bulb LFPs. Tested on 2,349 trials from seven awake mice, our final ensemble model supports both hypotheses, achieving a mean accuracy of 86.6%, an F1-score of 81.0%, and an AUC of 0.9247, substantially outperforming previous benchmarks. In addition, the t-SNE visualization confirms that our framework captures biologically significant signatures. These findings establish the feasibility of robust single-trial detection of the presence of odor from extracellular LFPs, as well as demonstrate the potential of deep learning models to provide a deeper understanding of olfactory representations.

[351] Over-Squashing in GNNs and Causal Inference of Rewiring Strategies

Danial Saber, Amirali Salehi-Abari

Main category: cs.LG

TL;DR: The paper introduces a method to measure over-squashing in GNNs using mutual sensitivity decay and evaluates rewiring strategies’ effectiveness.

Details

Motivation: Address the lack of empirical metrics for over-squashing in GNNs, limiting their expressivity and performance.

Method: Proposes a topology-focused method to assess over-squashing via mutual sensitivity decay and graph-level statistics, tested with rewiring strategies.

Result: Rewiring mitigates over-squashing in graph classification but can worsen it in node classification, with performance gains varying by dataset.

Conclusion: Rewiring is beneficial when over-squashing is significant and corrected moderately; a diagnostic tool helps pre-training decisions.

Abstract: Graph neural networks (GNNs) have exhibited state-of-the-art performance across wide-range of domains such as recommender systems, material design, and drug repurposing. Yet message-passing GNNs suffer from over-squashing – exponential compression of long-range information from distant nodes – which limits expressivity. Rewiring techniques can ease this bottleneck; but their practical impacts are unclear due to the lack of a direct empirical over-squashing metric. We propose a rigorous, topology-focused method for assessing over-squashing between node pairs using the decay rate of their mutual sensitivity. We then extend these pairwise assessments to four graph-level statistics (prevalence, intensity, variability, extremity). Coupling these metrics with a within-graph causal design, we quantify how rewiring strategies affect over-squashing on diverse graph- and node-classification benchmarks. Our extensive empirical analyses show that most graph classification datasets suffer from over-squashing (but to various extents), and rewiring effectively mitigates it – though the degree of mitigation, and its translation into performance gains, varies by dataset and method. We also found that over-squashing is less notable in node classification datasets, where rewiring often increases over-squashing, and performance variations are uncorrelated with over-squashing changes. These findings suggest that rewiring is most beneficial when over-squashing is both substantial and corrected with restraint – while overly aggressive rewiring, or rewiring applied to minimally over-squashed graphs, is unlikely to help and may even harm performance. Our plug-and-play diagnostic tool lets practitioners decide – before any training – whether rewiring is likely to pay off.

[352] Pattern-based Knowledge Component Extraction from Student Code Using Representation Learning

Muntasir Hoq, Griffin Pitts, Andrew Lan, Peter Brusilovsky, Bita Akram

Main category: cs.LG

TL;DR: The paper introduces an explainable framework for automated Knowledge Component (KC) discovery in computer science education, using pattern-based KCs derived from student code.

Details

Motivation: Personalized learning in CS education requires accurate modeling of student knowledge, but current KC extraction methods lack explainability and struggle with the variability of student solutions.

Method: A Variational Autoencoder is trained to generate representative patterns from student code, guided by an attention-based model. These patterns are clustered into KCs.

Result: Evaluations using learning curve analysis and Deep Knowledge Tracing show meaningful learning trajectories and improved predictive performance over traditional methods.

Conclusion: The framework advances CS education by providing an automated, scalable, and explainable way to identify essential code patterns for student learning.

Abstract: Effective personalized learning in computer science education depends on accurately modeling what students know and what they need to learn. While Knowledge Components (KCs) provide a foundation for such modeling, automated KC extraction from student code is inherently challenging due to insufficient explainability of discovered KCs and the open-endedness of programming problems with significant structural variability across student solutions and complex interactions among programming concepts. In this work, we propose a novel, explainable framework for automated KC discovery through pattern-based KCs: recurring structural patterns within student code that capture the specific programming patterns and language constructs that students must master. Toward this, we train a Variational Autoencoder to generate important representative patterns from student code guided by an explainable, attention-based code representation model that identifies important correct and incorrect pattern implementations from student code. These patterns are then clustered to form pattern-based KCs. We evaluate our KCs using two well-established methods informed by Cognitive Science: learning curve analysis and Deep Knowledge Tracing (DKT). Experimental results demonstrate meaningful learning trajectories and significant improvements in DKT predictive performance over traditional KT methods. This work advances knowledge modeling in CS education by providing an automated, scalable, and explainable framework for identifying granular code patterns and algorithmic constructs, essential for student learning.

[353] Distilling Reinforcement Learning into Single-Batch Datasets

Connor Wilhelm, Dan Ventura

Main category: cs.LG

TL;DR: Dataset distillation compresses large datasets into small synthetic ones, enabling efficient learning in one gradient step. It generalizes across tasks, including transforming RL into supervised learning.

Details

Motivation: To explore dataset distillation's versatility in compressing and transforming learning modalities, particularly from RL to supervised learning.

Method: Uses a novel extension of proximal policy optimization for meta-learning to distill RL environments (e.g., cart-pole, MuJoCo, Atari) into one-step supervised datasets.

Result: Demonstrates successful compression of complex RL environments into one-step supervised learning and generalizability across learner architectures.

Conclusion: Dataset distillation effectively compresses and transforms learning tasks, showcasing its potential for efficient and versatile learning.

Abstract: Dataset distillation compresses a large dataset into a small synthetic dataset such that learning on the synthetic dataset approximates learning on the original. Training on the distilled dataset can be performed in as little as one step of gradient descent. We demonstrate that distillation is generalizable to different tasks by distilling reinforcement learning environments into one-batch supervised learning datasets. This demonstrates not only distillation’s ability to compress a reinforcement learning task but also its ability to transform one learning modality (reinforcement learning) into another (supervised learning). We present a novel extension of proximal policy optimization for meta-learning and use it in distillation of a multi-dimensional extension of the classic cart-pole problem, all MuJoCo environments, and several Atari games. We demonstrate distillation’s ability to compress complex RL environments into one-step supervised learning, explore RL distillation’s generalizability across learner architectures, and demonstrate distilling an environment into the smallest-possible synthetic dataset.

[354] Decentralized Weather Forecasting via Distributed Machine Learning and Blockchain-Based Model Validation

Rilwan Umar, Aydin Abadi, Basil Aldali, Benito Vincent, Elliot A. J. Hurley, Hotoon Aljazaeri, Jamie Hedley-Cook, Jamie-Lee Bell, Lambert Uwuigbusun, Mujeeb Ahmed, Shishir Nagaraja, Suleiman Sabo, Weaam Alrbeiqi

Main category: cs.LG

TL;DR: A decentralized weather forecasting framework combining Federated Learning and blockchain improves accuracy, privacy, and scalability while addressing security vulnerabilities.

Details

Motivation: Centralized forecasting systems face security risks, scalability issues, and single points of failure, necessitating a decentralized solution.

Method: Integrates Federated Learning (FL) for privacy-preserving model training and blockchain (Ethereum) for transparent verification. Uses IPFS for storage and a reputation-based voting mechanism for security.

Result: Improved forecasting accuracy, enhanced system resilience, and scalability, suitable for real-world security-critical environments.

Conclusion: The proposed framework is a viable solution for decentralized, secure, and scalable weather forecasting.

Abstract: Weather forecasting plays a vital role in disaster preparedness, agriculture, and resource management, yet current centralized forecasting systems are increasingly strained by security vulnerabilities, limited scalability, and susceptibility to single points of failure. To address these challenges, we propose a decentralized weather forecasting framework that integrates Federated Learning (FL) with blockchain technology. FL enables collaborative model training without exposing sensitive local data; this approach enhances privacy and reduces data transfer overhead. Meanwhile, the Ethereum blockchain ensures transparent and dependable verification of model updates. To further enhance the system’s security, we introduce a reputation-based voting mechanism that assesses the trustworthiness of submitted models while utilizing the Interplanetary File System (IPFS) for efficient off-chain storage. Experimental results demonstrate that our approach not only improves forecasting accuracy but also enhances system resilience and scalability, making it a viable candidate for deployment in real-world, security-critical environments.

[355] Exact Verification of Graph Neural Networks with Incremental Constraint Solving

Minghao Liu, Chia-Hsuan Lu, Marta Kwiatkowska

Main category: cs.LG

TL;DR: The paper introduces GNNev, an exact verification method for GNNs to ensure robustness against adversarial attacks, supporting sum, max, and mean aggregation functions.

Details

Motivation: GNNs are used in high-stakes applications but lack robust verification methods for common aggregation functions, necessitating a reliable solution.

Method: The method uses constraint solving with bound tightening and iterative relaxation, implemented in GNNev, supporting sum, max, and mean aggregations.

Result: GNNev shows superior performance on standard benchmarks and real-world fraud datasets compared to existing tools.

Conclusion: GNNev provides a versatile and effective solution for verifying GNN robustness, especially for node classification tasks.

Abstract: Graph neural networks (GNNs) are increasingly employed in high-stakes applications, such as fraud detection or healthcare, but are susceptible to adversarial attacks. A number of techniques have been proposed to provide adversarial robustness guarantees, but support for commonly used aggregation functions in message-passing GNNs is still lacking. In this paper, we develop an exact (sound and complete) verification method for GNNs to compute guarantees against attribute and structural perturbations that involve edge addition or deletion, subject to budget constraints. Focusing on node classification tasks, our method employs constraint solving with bound tightening, and iteratively solves a sequence of relaxed constraint satisfaction problems while relying on incremental solving capabilities of solvers to improve efficiency. We implement GNNev, a versatile solver for message-passing neural networks, which supports three aggregation functions, sum, max and mean, with the latter two considered here for the first time. Extensive experimental evaluation of GNNev on two standard benchmarks (Cora and CiteSeer) and two real-world fraud datasets (Amazon and Yelp) demonstrates its usability and effectiveness, as well as superior performance compared to existing {exact verification} tools on sum-aggregated node classification tasks.

[356] Synaptic Pruning: A Biological Inspiration for Deep Learning Regularization

Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi

Main category: cs.LG

TL;DR: The paper proposes a biologically inspired synaptic pruning method for neural networks, replacing dropout with a dynamic, magnitude-based approach that gradually removes low-importance connections during training, improving performance in time series forecasting.

Details

Motivation: Biological synaptic pruning removes weak connections for efficiency, while dropout in artificial networks randomly deactivates neurons. The authors aim to bridge this gap by introducing a more biologically plausible pruning method.

Method: The method progressively prunes low-importance connections during training using weight magnitudes and a cubic sparsity schedule. It eliminates the need for separate pruning and fine-tuning phases by maintaining gradient flow for active weights.

Result: Experiments on RNN, LSTM, and transformer models across four datasets showed consistent improvements, with up to 20% MAE reduction in financial forecasting and 52% in select transformer models.

Conclusion: The dynamic pruning method outperforms conventional dropout, offering a practical, biologically inspired alternative for regularization in diverse architectures, particularly effective in financial time series forecasting.

Abstract: Synaptic pruning in biological brains removes weak connections to improve efficiency. In contrast, dropout regularization in artificial neural networks randomly deactivates neurons without considering activity-dependent pruning. We propose a magnitude-based synaptic pruning method that better reflects biology by progressively removing low-importance connections during training. Integrated directly into the training loop as a dropout replacement, our approach computes weight importance from absolute magnitudes across layers and applies a cubic schedule to gradually increase global sparsity. At fixed intervals, pruning masks permanently remove low-importance weights while maintaining gradient flow for active ones, eliminating the need for separate pruning and fine-tuning phases. Experiments on multiple time series forecasting models including RNN, LSTM, and Patch Time Series Transformer across four datasets show consistent gains. Our method ranked best overall, with statistically significant improvements confirmed by Friedman tests (p < 0.01). In financial forecasting, it reduced Mean Absolute Error by up to 20% over models with no or standard dropout, and up to 52% in select transformer models. This dynamic pruning mechanism advances regularization by coupling weight elimination with progressive sparsification, offering easy integration into diverse architectures. Its strong performance, especially in financial time series forecasting, highlights its potential as a practical alternative to conventional dropout techniques.

[357] RicciFlowRec: A Geometric Root Cause Recommender Using Ricci Curvature on Financial Graphs

Zhongtian Sun, Anoushka Harit

Main category: cs.LG

TL;DR: RicciFlowRec is a geometric recommendation framework using Ricci curvature and flow for root cause attribution in dynamic financial graphs, improving robustness and interpretability.

Details

Motivation: To enhance financial decision support by quantifying local stress and tracing shock propagation in dynamic financial graphs.

Method: Models interactions among stocks, macroeconomic indicators, and news using discrete Ricci curvature and Ricci flow. Curvature gradients identify causal substructures for risk-aware ranking.

Result: Preliminary results on S&P 500 data with FinBERT-based sentiment show improved robustness and interpretability under synthetic perturbations.

Conclusion: RicciFlowRec introduces geometric flow-based reasoning in financial recommender systems, with potential for portfolio optimization and return forecasting.

Abstract: We propose RicciFlowRec, a geometric recommendation framework that performs root cause attribution via Ricci curvature and flow on dynamic financial graphs. By modelling evolving interactions among stocks, macroeconomic indicators, and news, we quantify local stress using discrete Ricci curvature and trace shock propagation via Ricci flow. Curvature gradients reveal causal substructures, informing a structural risk-aware ranking function. Preliminary results on S&P~500 data with FinBERT-based sentiment show improved robustness and interpretability under synthetic perturbations. This ongoing work supports curvature-based attribution and early-stage risk-aware ranking, with plans for portfolio optimization and return forecasting. To our knowledge, RicciFlowRec is the first recommender to apply geometric flow-based reasoning in financial decision support.

[358] Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

Charles O’Neill, Mudith Jayasekara, Max Kirkby

Main category: cs.LG

TL;DR: Domain-specific sparse autoencoders (SAEs) improve reconstruction fidelity and interpretability by focusing on well-defined domains like medical text, outperforming broad-domain SAEs.

Details

Motivation: Broad-domain SAEs struggle with capturing domain-specific features, leading to fragmented or absorbed latents and high reconstruction error.

Method: Train JumpReLU SAEs on layer-20 activations of Gemma-2 models using 195k clinical QA examples, focusing on medical text.

Result: Domain-confined SAEs explain 20% more variance, achieve higher loss recovery, and reduce linear residual error. Features align with clinically meaningful concepts.

Conclusion: Domain-confinement enhances SAE performance and interpretability, questioning the need for general-purpose SAEs in foundation models.

Abstract: Sparse autoencoders (SAEs) decompose large language model (LLM) activations into latent features that reveal mechanistic structure. Conventional SAEs train on broad data distributions, forcing a fixed latent budget to capture only high-frequency, generic patterns. This often results in significant linear dark matter'' in reconstruction error and produces latents that fragment or absorb each other, complicating interpretation. We show that restricting SAE training to a well-defined domain (medical text) reallocates capacity to domain-specific features, improving both reconstruction fidelity and interpretability. Training JumpReLU SAEs on layer-20 activations of Gemma-2 models using 195k clinical QA examples, we find that domain-confined SAEs explain up to 20\% more variance, achieve higher loss recovery, and reduce linear residual error compared to broad-domain SAEs. Automated and human evaluations confirm that learned features align with clinically meaningful concepts (e.g., taste sensations’’ or infectious mononucleosis''), rather than frequent but uninformative tokens. These domain-specific SAEs capture relevant linear structure, leaving a smaller, more purely nonlinear residual. We conclude that domain-confinement mitigates key limitations of broad-domain SAEs, enabling more complete and interpretable latent decompositions, and suggesting the field may need to question foundation-model’’ scaling for general-purpose SAEs.

[359] Understanding Dementia Speech Alignment with Diffusion-Based Image Generation

Mansi, Anastasios Lepipas, Dominika Woszczyk, Yiying Guan, Soteris Demetriou

Main category: cs.LG

TL;DR: Text-to-image models can align dementia-related speech with generated images, enabling dementia detection from images alone with 75% accuracy.

Details

Motivation: To investigate if text-to-image models can align pathological speech (dementia-related) with generated images and explain this alignment.

Method: Examine alignment of dementia-related speech with generated images and develop explainability methods.

Result: Dementia detection achieved 75% accuracy from generated images on the ADReSS dataset.

Conclusion: Text-to-image models can effectively align and detect dementia-related speech, with explainability methods revealing contributing language parts.

Abstract: Text-to-image models generate highly realistic images based on natural language descriptions and millions of users use them to create and share images online. While it is expected that such models can align input text and generated image in the same latent space little has been done to understand whether this alignment is possible between pathological speech and generated images. In this work, we examine the ability of such models to align dementia-related speech information with the generated images and develop methods to explain this alignment. Surprisingly, we found that dementia detection is possible from generated images alone achieving 75% accuracy on the ADReSS dataset. We then leverage explainability methods to show which parts of the language contribute to the detection.

[360] Integrating Feature Attention and Temporal Modeling for Collaborative Financial Risk Assessment

Yue Yao, Zhen Xu, Youzhu Liu, Kunyuan Ma, Yuxiu Lin, Mohan Jiang

Main category: cs.LG

TL;DR: A federated learning-based framework for cross-institution financial risk analysis, ensuring data privacy and collaborative modeling without raw data sharing.

Details

Motivation: Addressing challenges of data privacy and collaborative modeling in financial risk analysis across institutions.

Method: Uses federated learning with feature attention, temporal modeling, and differential privacy for secure parameter aggregation.

Result: Outperforms centralized methods and other federated learning variants in accuracy, risk detection, and generalization.

Conclusion: Provides a secure, efficient solution for financial risk analysis, preserving data sovereignty and enhancing risk identification.

Abstract: This paper addresses the challenges of data privacy and collaborative modeling in cross-institution financial risk analysis. It proposes a risk assessment framework based on federated learning. Without sharing raw data, the method enables joint modeling and risk identification across multiple institutions. This is achieved by incorporating a feature attention mechanism and temporal modeling structure. Specifically, the model adopts a distributed optimization strategy. Each financial institution trains a local sub-model. The model parameters are protected using differential privacy and noise injection before being uploaded. A central server then aggregates these parameters to generate a global model. This global model is used for systemic risk identification. To validate the effectiveness of the proposed method, multiple experiments are conducted. These evaluate communication efficiency, model accuracy, systemic risk detection, and cross-market generalization. The results show that the proposed model outperforms both traditional centralized methods and existing federated learning variants across all evaluation metrics. It demonstrates strong modeling capabilities and practical value in sensitive financial environments. The method enhances the scope and efficiency of risk identification while preserving data sovereignty. It offers a secure and efficient solution for intelligent financial risk analysis.

[361] Domain-Generalization to Improve Learning in Meta-Learning Algorithms

Usman Anjum, Chris Stockman, Cat Luong, Justin Zhan

Main category: cs.LG

TL;DR: DGS-MAML is a meta-learning algorithm combining gradient matching and sharpness-aware minimization for better generalization with limited data.

Details

Motivation: To improve model adaptability and robustness in few-shot learning scenarios.

Method: Combines gradient matching and sharpness-aware minimization in a bi-level optimization framework.

Result: Outperforms existing methods in accuracy and generalization on benchmark datasets.

Conclusion: DGS-MAML is effective for few-shot learning and quick adaptation, with publicly available code.

Abstract: This paper introduces Domain Generalization Sharpness-Aware Minimization Model-Agnostic Meta-Learning (DGS-MAML), a novel meta-learning algorithm designed to generalize across tasks with limited training data. DGS-MAML combines gradient matching with sharpness-aware minimization in a bi-level optimization framework to enhance model adaptability and robustness. We support our method with theoretical analysis using PAC-Bayes and convergence guarantees. Experimental results on benchmark datasets show that DGS-MAML outperforms existing approaches in terms of accuracy and generalization. The proposed method is particularly useful for scenarios requiring few-shot learning and quick adaptation, and the source code is publicly available at GitHub.

[362] Graph Neural Network and Transformer Integration for Unsupervised System Anomaly Discovery

Yun Zi, Ming Gong, Zhihao Xue, Yujun Zou, Nia Qi, Yingnan Deng

Main category: cs.LG

TL;DR: An unsupervised anomaly detection method for distributed backend services using dynamic graphs and Transformers, outperforming existing models.

Details

Motivation: Address challenges like complex dependencies, diverse behavior, and lack of labeled data in anomaly detection for distributed systems.

Method: Constructs dynamic graphs for structural dependencies, uses graph convolution and Transformers for temporal behavior, and integrates features into anomaly scores.

Result: Outperforms existing models in capturing anomaly paths and dynamic behavior, showing high expressiveness and stability.

Conclusion: The method is effective and practical for unsupervised anomaly detection in distributed systems.

Abstract: This study proposes an unsupervised anomaly detection method for distributed backend service systems, addressing practical challenges such as complex structural dependencies, diverse behavioral evolution, and the absence of labeled data. The method constructs a dynamic graph based on service invocation relationships and applies graph convolution to extract high-order structural representations from multi-hop topologies. A Transformer is used to model the temporal behavior of each node, capturing long-term dependencies and local fluctuations. During the feature fusion stage, a learnable joint embedding mechanism integrates structural and behavioral representations into a unified anomaly vector. A nonlinear mapping is then applied to compute anomaly scores, enabling an end-to-end detection process without supervision. Experiments on real-world cloud monitoring data include sensitivity analyses across different graph depths, sequence lengths, and data perturbations. Results show that the proposed method outperforms existing models on several key metrics, demonstrating stronger expressiveness and stability in capturing anomaly propagation paths and modeling dynamic behavior sequences, with high potential for practical deployment.

[363] Implicit Hypergraph Neural Networks: A Stable Framework for Higher-Order Relational Learning with Provable Guarantees

Xiaoyu Li, Guangyu Tang, Jiaojiao Jiang

Main category: cs.LG

TL;DR: IHGNN introduces an implicit equilibrium formulation for hypergraph neural networks, enabling stable and efficient global propagation without deep architectures, outperforming traditional methods.

Details

Motivation: Traditional hypergraph neural networks rely on fixed message-passing layers, limiting long-range dependency capture and causing training instability.

Method: IHGNN computes representations as solutions to a nonlinear fixed-point equation, avoiding deep architectures. It includes a well-posed training scheme, convergence analysis, and implicit-gradient training.

Result: IHGNN outperforms traditional graph/hypergraph neural networks in accuracy and robustness, showing resilience to initialization and hyperparameter variation.

Conclusion: IHGNN offers strong generalization and practical value for higher-order relational learning, addressing limitations of existing methods.

Abstract: Many real-world interactions are group-based rather than pairwise such as papers with multiple co-authors and users jointly engaging with items. Hypergraph neural networks have shown great promise at modeling higher-order relations, but their reliance on a fixed number of explicit message-passing layers limits long-range dependency capture and can destabilize training as depth grows. In this work, we introduce Implicit Hypergraph Neural Networks (IHGNN), which bring the implicit equilibrium formulation to hypergraphs: instead of stacking layers, IHGNN computes representations as the solution to a nonlinear fixed-point equation, enabling stable and efficient global propagation across hyperedges without deep architectures. We develop a well-posed training scheme with provable convergence, analyze the oversmoothing conditions and expressivity of the model, and derive a transductive generalization bound on hypergraphs. We further present an implicit-gradient training procedure coupled with a projection-based stabilization strategy. Extensive experiments on citation benchmarks show that IHGNN consistently outperforms strong traditional graph/hypergraph neural network baselines in both accuracy and robustness. Empirically, IHGNN is resilient to random initialization and hyperparameter variation, highlighting its strong generalization and practical value for higher-order relational learning.

[364] NEXICA: Discovering Road Traffic Causality (Extended arXiv Version)

Siddharth Srikanth, John Krumm, Jonathan Qin

Main category: cs.LG

TL;DR: NEXICA is a novel algorithm for identifying highway segments causing traffic slowdowns, using event-based time series analysis, probabilistic modeling, and binary classification, outperforming existing methods.

Details

Motivation: Addressing road congestion by targeting its root causes efficiently.

Method: Uses time series of road speeds, focuses on slowdown events, employs probabilistic modeling (maximum likelihood estimation), and trains a binary classifier on known cause/effect pairs.

Result: Outperforms state-of-the-art baselines in accuracy and computation speed on Los Angeles highway data.

Conclusion: NEXICA effectively identifies congestion-causing highway segments, offering a practical solution for traffic management.

Abstract: Road traffic congestion is a persistent problem. Focusing resources on the causes of congestion is a potentially efficient strategy for reducing slowdowns. We present NEXICA, an algorithm to discover which parts of the highway system tend to cause slowdowns on other parts of the highway. We use time series of road speeds as inputs to our causal discovery algorithm. Finding other algorithms inadequate, we develop a new approach that is novel in three ways. First, it concentrates on just the presence or absence of events in the time series, where an event indicates the temporal beginning of a traffic slowdown. Second, we develop a probabilistic model using maximum likelihood estimation to compute the probabilities of spontaneous and caused slowdowns between two locations on the highway. Third, we train a binary classifier to identify pairs of cause/effect locations trained on pairs of road locations where we are reasonably certain a priori of their causal connections, both positive and negative. We test our approach on six months of road speed data from 195 different highway speed sensors in the Los Angeles area, showing that our approach is superior to state-of-the-art baselines in both accuracy and computation speed.

[365] A Unified Contrastive-Generative Framework for Time Series Classification

Ziyu Liu, Azadeh Alavi, Minyi Li, Xiang Zhang

Main category: cs.LG

TL;DR: CoGenT unifies contrastive and generative SSL for time series, improving performance over standalone methods.

Details

Motivation: Explore the complementary potential of contrastive and generative SSL for multivariate time series.

Method: Propose CoGenT, a framework combining contrastive and generative optimization.

Result: Achieves up to 59.2% and 14.27% F1 gains over SimCLR and MAE, respectively.

Conclusion: Hybrid SSL preserves discriminative power and generative robustness, advancing temporal domain SSL.

Abstract: Self-supervised learning (SSL) for multivariate time series mainly includes two paradigms: contrastive methods that excel at instance discrimination and generative approaches that model data distributions. While effective individually, their complementary potential remains unexplored. We propose a Contrastive Generative Time series framework (CoGenT), the first framework to unify these paradigms through joint contrastive-generative optimization. CoGenT addresses fundamental limitations of both approaches: it overcomes contrastive learning’s sensitivity to high intra-class similarity in temporal data while reducing generative methods’ dependence on large datasets. We evaluate CoGenT on six diverse time series datasets. The results show consistent improvements, with up to 59.2% and 14.27% F1 gains over standalone SimCLR and MAE, respectively. Our analysis reveals that the hybrid objective preserves discriminative power while acquiring generative robustness. These findings establish a foundation for hybrid SSL in temporal domains. We will release the code shortly.

[366] NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

Main category: cs.LG

TL;DR: NeuronTune is a fine-grained framework for optimizing safety and utility in LLMs by dynamically modulating sparse neurons, outperforming existing methods.

Details

Motivation: Current techniques for safety alignment in LLMs lack robustness against attacks, refuse benign queries, and degrade text quality, necessitating a better solution.

Method: NeuronTune identifies safety-critical and utility-preserving neurons via attribution and uses meta-learning to modulate their activations adaptively.

Result: NeuronTune achieves superior safety and maintains excellent utility, outperforming state-of-the-art methods.

Conclusion: NeuronTune provides a flexible, effective solution for simultaneous safety-utility optimization in LLMs.

Abstract: Ensuring robust safety alignment while preserving utility is critical for the reliable deployment of Large Language Models (LLMs). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance–the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility.

[367] Open-Set Fault Diagnosis in Multimode Processes via Fine-Grained Deep Feature Representation

Guangqiang Li, M. Amine Atoui, Xiangshun Li

Main category: cs.LG

TL;DR: The paper proposes FGCRN, an open-set fault diagnosis model, to accurately classify known faults and identify unknown ones in multimode processes by leveraging fine-grained clustering and rejection techniques.

Details

Motivation: Addressing the challenge of constructing compact decision boundaries for health states with multiple cluster distributions in multimode processes.

Method: Combines multiscale depthwise convolution, bidirectional gated recurrent unit, temporal attention, and a distance-based loss for feature extraction and intra-class compactness. Uses unsupervised learning for fine-grained feature representations and extreme value theory for unknown fault identification.

Result: The proposed FGCRN model demonstrates superior performance in experiments.

Conclusion: FGCRN effectively handles known and unknown fault diagnosis in multimode processes, offering a robust solution.

Abstract: A reliable fault diagnosis system should not only accurately classify known health states but also effectively identify unknown faults. In multimode processes, samples belonging to the same health state often show multiple cluster distributions, making it difficult to construct compact and accurate decision boundaries for that state. To address this challenge, a novel open-set fault diagnosis model named fine-grained clustering and rejection network (FGCRN) is proposed. It combines multiscale depthwise convolution, bidirectional gated recurrent unit and temporal attention mechanism to capture discriminative features. A distance-based loss function is designed to enhance the intra-class compactness. Fine-grained feature representations are constructed through unsupervised learning to uncover the intrinsic structures of each health state. Extreme value theory is employed to model the distance between sample features and their corresponding fine-grained representations, enabling effective identification of unknown faults. Extensive experiments demonstrate the superior performance of the proposed method.

[368] Learn to Explore: Meta NAS via Bayesian Optimization Guided Graph Generation

Zijun Sun, Yanning Shen

Main category: cs.LG

TL;DR: GraB-NAS is a novel Meta-NAS framework that combines global and local search strategies to discover high-performing neural architectures beyond predefined search spaces, outperforming existing methods.

Details

Motivation: Existing Meta-NAS methods suffer from poor generalization, limited search spaces, or high computational costs, limiting their real-world applicability.

Method: GraB-NAS models architectures as graphs and uses a hybrid search strategy: global search via Bayesian Optimization and local exploration via gradient ascent in latent space.

Result: GraB-NAS outperforms state-of-the-art Meta-NAS baselines, achieving better generalization and search effectiveness.

Conclusion: GraB-NAS advances Meta-NAS by enabling task-aware architecture discovery with strong performance, even beyond predefined search spaces.

Abstract: Neural Architecture Search (NAS) automates the design of high-performing neural networks but typically targets a single predefined task, thereby restricting its real-world applicability. To address this, Meta Neural Architecture Search (Meta-NAS) has emerged as a promising paradigm that leverages prior knowledge across tasks to enable rapid adaptation to new ones. Nevertheless, existing Meta-NAS methods often struggle with poor generalization, limited search spaces, or high computational costs. In this paper, we propose a novel Meta-NAS framework, GraB-NAS. Specifically, GraB-NAS first models neural architectures as graphs, and then a hybrid search strategy is developed to find and generate new graphs that lead to promising neural architectures. The search strategy combines global architecture search via Bayesian Optimization in the search space with local exploration for novel neural networks via gradient ascent in the latent space. Such a hybrid search strategy allows GraB-NAS to discover task-aware architectures with strong performance, even beyond the predefined search space. Extensive experiments demonstrate that GraB-NAS outperforms state-of-the-art Meta-NAS baselines, achieving better generalization and search effectiveness.

[369] DeepFeatIoT: Unifying Deep Learned, Randomized, and LLM Features for Enhanced IoT Time Series Sensor Data Classification in Smart Industries

Muhammad Sakib Khan Inan, Kewen Liao

Main category: cs.LG

TL;DR: DeepFeatIoT, a deep learning model, improves IoT time series data classification by combining learned and non-learned features, outperforming benchmarks.

Details

Motivation: Challenges like metadata loss, data heterogeneity, and inconsistent sampling in IoT sensor data hinder smart system effectiveness.

Method: Integrates local/global learned features, randomized convolutional kernels, and LLM features for robust classification.

Result: Consistently outperforms state-of-the-art models across diverse IoT datasets.

Conclusion: DeepFeatIoT advances IoT analytics and supports next-gen smart systems.

Abstract: Internet of Things (IoT) sensors are ubiquitous technologies deployed across smart cities, industrial sites, and healthcare systems. They continuously generate time series data that enable advanced analytics and automation in industries. However, challenges such as the loss or ambiguity of sensor metadata, heterogeneity in data sources, varying sampling frequencies, inconsistent units of measurement, and irregular timestamps make raw IoT time series data difficult to interpret, undermining the effectiveness of smart systems. To address these challenges, we propose a novel deep learning model, DeepFeatIoT, which integrates learned local and global features with non-learned randomized convolutional kernel-based features and features from large language models (LLMs). This straightforward yet unique fusion of diverse learned and non-learned features significantly enhances IoT time series sensor data classification, even in scenarios with limited labeled data. Our model’s effectiveness is demonstrated through its consistent and generalized performance across multiple real-world IoT sensor datasets from diverse critical application domains, outperforming state-of-the-art benchmark models. These results highlight DeepFeatIoT’s potential to drive significant advancements in IoT analytics and support the development of next-generation smart systems.

[370] EGGS-PTP: An Expander-Graph Guided Structured Post-training Pruning Method for Large Language Models

Omar Bazarbachi, Zijun Sun, Yanning Shen

Main category: cs.LG

TL;DR: EGGS-PTP is a structured pruning method using expander graph theory to efficiently reduce LLM size and computation while maintaining accuracy.

Details

Motivation: Addressing the computational and memory challenges of deploying large LLMs by developing efficient model variants.

Method: Leverages expander graph theory to guide N:M structured pruning, ensuring information flow in the pruned network.

Result: Achieves significant acceleration, memory savings, and outperforms existing pruning methods in accuracy.

Conclusion: EGGS-PTP is an effective solution for reducing LLM deployment challenges without sacrificing performance.

Abstract: As Large Language Models (LLMs) become more widely adopted and scale up in size, the computational and memory challenges involved in deploying these massive foundation models have grown increasingly severe. This underscores the urgent need to develop more efficient model variants. Faced with this challenge, the present work introduces EGGS-PTP: an Expander-Graph Guided Structured Post-training Pruning method. The proposed approach leverages graph theory to guide the design of N:M structured pruning, effectively reducing model size and computational demands. By incorporating concepts from expander graphs, EGGS-PTP ensures information flow within the pruned network, preserving essential model functionality. Extensive numerical experiments demonstrate that EGGS-PTP not only achieves significant acceleration and memory savings due to structured sparsity but also outperforms existing structured pruning techniques in terms of accuracy across various LLMs.

[371] Large-Small Model Collaborative Framework for Federated Continual Learning

Hao Yu, Xin Yang, Boyang Fan, Xuemei Cao, Hanlin Gu, Lixin Fan, Qiang Yang

Main category: cs.LG

TL;DR: A collaborative framework for Federated Continual Learning (FCL) bridges small models and Foundation Models (FMs) to enhance performance on evolving private tasks under constraints.

Details

Motivation: FMs struggle with local downstream tasks due to inability to use private data and face challenges in continual learning due to their complexity. Small models, however, can adapt better under constraints.

Method: Proposes a framework where lightweight local models adapt to new tasks and enhance FMs. Includes Small Model Continual Fine-tuning and One-by-One Distillation for knowledge fusion.

Result: Superior performance demonstrated, even with heterogeneous small models.

Conclusion: The framework effectively bridges small models and FMs, improving continual learning in federated settings.

Abstract: Continual learning (CL) for Foundation Models (FMs) is an essential yet underexplored challenge, especially in Federated Continual Learning (FCL), where each client learns from a private, evolving task stream under strict data and communication constraints. Despite their powerful generalization abilities, FMs often exhibit suboptimal performance on local downstream tasks, as they are unable to utilize private local data. Furthermore, enabling FMs to learn new tasks without forgetting prior knowledge is inherently a challenging problem, primarily due to their immense parameter count and high model complexity. In contrast, small models can be trained locally under resource-constrained conditions and benefit from more mature CL techniques. To bridge the gap between small models and FMs, we propose the first collaborative framework in FCL, where lightweight local models act as a dynamic bridge, continually adapting to new tasks while enhancing the utility of the large model. Two novel components are also included: Small Model Continual Fine-tuning is for preventing small models from temporal forgetting; One-by-One Distillation performs personalized fusion of heterogeneous local knowledge on the server. Experimental results demonstrate its superior performance, even when clients utilize heterogeneous small models.

[372] MiCo: End-to-End Mixed Precision Neural Network Co-Exploration Framework for Edge AI

Zijun Jiang, Yangdi Lyu

Main category: cs.LG

TL;DR: The paper introduces MiCo, a framework for optimizing and deploying mixed-precision quantized neural networks (MPQ) for edge AI, addressing limitations in existing methods.

Details

Motivation: Existing MPQ algorithms lack flexibility, efficiency, and an end-to-end deployment framework, hindering optimal performance in edge AI applications.

Method: MiCo uses a novel optimization algorithm to search for optimal MPQ schemes, incorporates hardware-aware latency models, and enables direct deployment from PyTorch to C code.

Result: The framework achieves high accuracy with minimal drops while meeting latency constraints, offering end-to-end speedup.

Conclusion: MiCo provides a comprehensive solution for MPQ exploration and deployment, improving efficiency and performance for edge AI.

Abstract: Quantized Neural Networks (QNN) with extremely low-bitwidth data have proven promising in efficient storage and computation on edge devices. To further reduce the accuracy drop while increasing speedup, layer-wise mixed-precision quantization (MPQ) becomes a popular solution. However, existing algorithms for exploring MPQ schemes are limited in flexibility and efficiency. Comprehending the complex impacts of different MPQ schemes on post-training quantization and quantization-aware training results is a challenge for conventional methods. Furthermore, an end-to-end framework for the optimization and deployment of MPQ models is missing in existing work. In this paper, we propose the MiCo framework, a holistic MPQ exploration and deployment framework for edge AI applications. The framework adopts a novel optimization algorithm to search for optimal quantization schemes with the highest accuracies while meeting latency constraints. Hardware-aware latency models are built for different hardware targets to enable fast explorations. After the exploration, the framework enables direct deployment from PyTorch MPQ models to bare-metal C codes, leading to end-to-end speedup with minimal accuracy drops.

[373] Causal Graph Profiling via Structural Divergence for Robust Anomaly Detection in Cyber-Physical Systems

Arun Vignesh Malarkkan, Haoyue Bai, Dongjie Wang, Yanjie Fu

Main category: cs.LG

TL;DR: CGAD is a causal graph-based anomaly detection framework for cyberattack detection in critical infrastructures, outperforming traditional methods by leveraging causal structures for adaptability and accuracy.

Details

Motivation: Addressing the limitations of traditional anomaly detection methods in handling distribution shifts and class imbalance in multivariate time series for critical infrastructure protection.

Method: CGAD uses a two-phase supervised framework: causal profiling (learning causal invariant graphs with Dynamic Bayesian Networks) and anomaly scoring (detecting anomalies via causal graph comparison).

Result: CGAD achieves higher precision, F1, and ROC-AUC scores compared to baselines, proving resilience in non-stationary and imbalanced environments.

Conclusion: CGAD redefines robustness in anomaly detection by uncovering causal structures, offering superior performance in detecting complex cyberattacks.

Abstract: With the growing complexity of cyberattacks targeting critical infrastructures such as water treatment networks, there is a pressing need for robust anomaly detection strategies that account for both system vulnerabilities and evolving attack patterns. Traditional methods – statistical, density-based, and graph-based models struggle with distribution shifts and class imbalance in multivariate time series, often leading to high false positive rates. To address these challenges, we propose CGAD, a Causal Graph-based Anomaly Detection framework designed for reliable cyberattack detection in public infrastructure systems. CGAD follows a two-phase supervised framework – causal profiling and anomaly scoring. First, it learns causal invariant graph structures representing the system’s behavior under “Normal” and “Attack” states using Dynamic Bayesian Networks. Second, it employs structural divergence to detect anomalies via causal graph comparison by evaluating topological deviations in causal graphs over time. By leveraging causal structures, CGAD achieves superior adaptability and accuracy in non-stationary and imbalanced time series environments compared to conventional machine learning approaches. By uncovering causal structures beneath volatile sensor data, our framework not only detects cyberattacks with markedly higher precision but also redefines robustness in anomaly detection, proving resilience where traditional models falter under imbalance and drift. Our framework achieves substantial gains in F1 and ROC-AUC scores over best-performing baselines across four industrial datasets, demonstrating robust detection of delayed and structurally complex anomalies.

[374] Enhancing Memory Recall in LLMs with Gauss-Tin: A Hybrid Instructional and Gaussian Replay Approach

Iing Muttakhiroh, Thomas Fevens

Main category: cs.LG

TL;DR: Gauss-Tin, a hybrid method combining replay strategy with Gaussian mixture models, improves LLM retention by 6% over traditional methods, mitigating catastrophic forgetting.

Details

Motivation: Address catastrophic forgetting in LLMs, where models lose old knowledge when learning new information.

Method: Integrates replay-based techniques with Gaussian mixture models for better sample selection and adds instructional guidance for past learning generation.

Result: 6% improvement in retention metrics compared to traditional methods.

Conclusion: Gauss-Tin effectively mitigates catastrophic forgetting, highlighting hybrid models’ potential for robust LLMs in dynamic environments.

Abstract: Despite the significant advancements in Large Language Models (LLMs), catastrophic forgetting remains a substantial challenge, where models lose previously acquired knowledge upon learning new information. Continual learning (CL) strategies have emerged as a potential solution to this problem, with replay-based techniques demonstrating superior performance in preserving learned knowledge. In this context, we introduce Gauss-Tin, a novel approach that integrates the replay strategy with a Gaussian mixture model to enhance the quality of sample selection during training, supplemented by instructional guidance to facilitate the generation of past learning. This method aims to improve LLMs’ retention capabilities by strategically reinforcing important past learnings while accommodating new information. Our experimental results indicate a promising 6% improvement in retention metrics over traditional methods, suggesting that Gauss-Tin is an effective strategy for mitigating catastrophic forgetting in LLMs. This study underscores the potential of hybrid models in enhancing the robustness and adaptability of LLMs in dynamic learning environments.

[375] Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks

Bokeng Zheng, Jianqiang Zhong, Jiayi Liu, Xiaoxi Zhang

Main category: cs.LG

TL;DR: A hierarchical federated fine-tuning framework for IoV systems uses LoRA and a novel UCB-DUAL algorithm to achieve efficient, low-latency multi-task adaptation, improving accuracy and reducing latency.

Details

Motivation: Addressing challenges in IoV systems like client mobility, resource heterogeneity, and intermittent connectivity for efficient multi-task adaptation.

Method: Hierarchical federated fine-tuning with LoRA and a decentralized, energy-aware rank adaptation mechanism (UCB-DUAL algorithm).

Result: Reduces latency by over 24% and improves average accuracy by more than 2.5% compared to baselines.

Conclusion: The proposed framework effectively balances accuracy and efficiency in dynamic IoV scenarios.

Abstract: Federated fine-tuning has emerged as a promising approach for adapting foundation models (FMs) to diverse downstream tasks in edge environments. In Internet of Vehicles (IoV) systems, enabling efficient and low-latency multi-task adaptation is particularly challenging due to client mobility, heterogeneous resources, and intermittent connectivity. This paper proposes a hierarchical federated fine-tuning framework that coordinates roadside units (RSUs) and vehicles to support resource-aware and mobility-resilient learning across dynamic IoV scenarios. Leveraging Low-Rank Adaptation (LoRA), we introduce a decentralized, energy-aware rank adaptation mechanism formulated as a constrained multi-armed bandit problem. A novel UCB-DUAL algorithm is developed to enable adaptive exploration under per-task energy budgets, achieving provable sublinear regret. To evaluate our method, we construct a large-scale IoV simulator based on real-world trajectories, capturing dynamic participation, RSU handoffs, and communication variability. Extensive experiments show that our approach achieves the best accuracy-efficiency trade-off among all baselines, reducing latency by over 24% and improving average accuracy by more than 2.5%.

[376] Time-Aware and Transition-Semantic Graph Neural Networks for Interpretable Predictive Business Process Monitoring

Fang Wang, Ernesto Damiani

Main category: cs.LG

TL;DR: A unified GNN framework improves PBPM by addressing localized vs. global modeling, introducing time decay attention, and embedding transition semantics, achieving competitive accuracy and interpretability.

Details

Motivation: Existing GNN-based PBPM models lack temporal relevance and transition semantics, limiting their effectiveness.

Method: Proposes prefix-based GCNs and full trace GATs, a time decay attention mechanism, and transition semantics in edge features.

Result: Achieves competitive Top-k accuracy and DL scores on five benchmarks without per-dataset tuning.

Conclusion: The framework offers a robust, generalizable, and explainable solution for next event prediction in PBPM.

Abstract: Predictive Business Process Monitoring (PBPM) aims to forecast future events in ongoing cases based on historical event logs. While Graph Neural Networks (GNNs) are well suited to capture structural dependencies in process data, existing GNN-based PBPM models remain underdeveloped. Most rely either on short prefix subgraphs or global architectures that overlook temporal relevance and transition semantics. We propose a unified, interpretable GNN framework that advances the state of the art along three key axes. First, we compare prefix-based Graph Convolutional Networks(GCNs) and full trace Graph Attention Networks(GATs) to quantify the performance gap between localized and global modeling. Second, we introduce a novel time decay attention mechanism that constructs dynamic, prediction-centered windows, emphasizing temporally relevant history and suppressing noise. Third, we embed transition type semantics into edge features to enable fine grained reasoning over structurally ambiguous traces. Our architecture includes multilevel interpretability modules, offering diverse visualizations of attention behavior. Evaluated on five benchmarks, the proposed models achieve competitive Top-k accuracy and DL scores without per-dataset tuning. By addressing architectural, temporal, and semantic gaps, this work presents a robust, generalizable, and explainable solution for next event prediction in PBPM.

[377] SYNAPSE-G: Bridging Large Language Models and Graph Learning for Rare Event Classification

Sasan Tavakkol, Lin Chen, Max Springer, Abigail Schantz, Blaž Bratanič, Vincent Cohen-Addad, MohammadHossein Bateni

Main category: cs.LG

TL;DR: SYNAPSE-G uses LLMs to generate synthetic data for rare event classification, expands labels via graph-based propagation, and outperforms baselines.

Details

Motivation: Label scarcity for rare events limits ML model training. SYNAPSE-G addresses this by generating synthetic data and expanding labels.

Method: Leverages LLMs for synthetic data, constructs a similarity graph for label propagation, and uses an oracle for final labeling.

Result: Outperforms baselines like nearest neighbor search on SST2 and MHS datasets.

Conclusion: SYNAPSE-G effectively addresses the cold-start problem for rare event classification.

Abstract: Scarcity of labeled data, especially for rare events, hinders training effective machine learning models. This paper proposes SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a novel pipeline leveraging Large Language Models (LLMs) to generate synthetic training data for rare event classification, addressing the cold-start problem. This synthetic data serve as seeds for semi-supervised label propagation on a similarity graph constructed between the seeds and a large unlabeled dataset. This identifies candidate positive examples, subsequently labeled by an oracle (human or LLM). The expanded dataset then trains/fine-tunes a classifier. We theoretically analyze how the quality (validity and diversity) of the synthetic data impacts the precision and recall of our method. Experiments on the imbalanced SST2 and MHS datasets demonstrate SYNAPSE-G’s effectiveness in finding positive labels, outperforming baselines including nearest neighbor search.

[378] Edge General Intelligence Through World Models and Agentic AI: Fundamentals, Solutions, and Challenges

Changyuan Zhao, Guangyuan Liu, Ruichen Zhang, Yinqiu Liu, Jiacheng Wang, Jiawen Kang, Dusit Niyato, Zan Li, Xuemin, Shen, Zhu Han, Sumei Sun, Chau Yuen, Dong In Kim

Main category: cs.LG

TL;DR: The paper explores the role of world models in Edge General Intelligence (EGI), detailing their architecture, applications in edge computing, and future challenges.

Details

Motivation: To bridge the gap in integrating world models into wireless edge computing for autonomous, intelligent systems.

Method: Analyzes architectural foundations of world models (latent representation, dynamics modeling, imagination-based planning) and their applications in EGI scenarios like vehicular networks and IoT.

Result: Highlights how world models enhance optimization under constraints (latency, energy, privacy) and synergize with foundation models and digital twins.

Conclusion: Identifies open challenges (safety, training efficiency, deployment) and provides a roadmap for next-gen autonomous edge systems.

Abstract: Edge General Intelligence (EGI) represents a transformative evolution of edge computing, where distributed agents possess the capability to perceive, reason, and act autonomously across diverse, dynamic environments. Central to this vision are world models, which act as proactive internal simulators that not only predict but also actively imagine future trajectories, reason under uncertainty, and plan multi-step actions with foresight. This proactive nature allows agents to anticipate potential outcomes and optimize decisions ahead of real-world interactions. While prior works in robotics and gaming have showcased the potential of world models, their integration into the wireless edge for EGI remains underexplored. This survey bridges this gap by offering a comprehensive analysis of how world models can empower agentic artificial intelligence (AI) systems at the edge. We first examine the architectural foundations of world models, including latent representation learning, dynamics modeling, and imagination-based planning. Building on these core capabilities, we illustrate their proactive applications across EGI scenarios such as vehicular networks, unmanned aerial vehicle (UAV) networks, the Internet of Things (IoT) systems, and network functions virtualization, thereby highlighting how they can enhance optimization under latency, energy, and privacy constraints. We then explore their synergy with foundation models and digital twins, positioning world models as the cognitive backbone of EGI. Finally, we highlight open challenges, such as safety guarantees, efficient training, and constrained deployment, and outline future research directions. This survey provides both a conceptual foundation and a practical roadmap for realizing the next generation of intelligent, autonomous edge systems.

[379] Online Prediction with Limited Selectivity

Licheng Liu, Mingda Qiao

Main category: cs.LG

TL;DR: The paper introduces Prediction with Limited Selectivity (PLS), a model where forecasters can only predict on a subset of the time horizon, and analyzes optimal prediction error both instance-by-instance and on average.

Details

Motivation: Existing selective prediction models allow forecasters to predict at any time, but this work explores scenarios where prediction is restricted to specific time windows.

Method: The study introduces PLS, analyzes optimal prediction error, and proposes a complexity measure for instance-dependent bounds.

Result: Bounds on optimal error are derived, matching with high probability for randomly-generated PLS instances.

Conclusion: PLS provides a framework for understanding prediction under limited selectivity, with practical implications for forecasting scenarios.

Abstract: Selective prediction [Dru13, QV19] models the scenario where a forecaster freely decides on the prediction window that their forecast spans. Many data statistics can be predicted to a non-trivial error rate without any distributional assumptions or expert advice, yet these results rely on that the forecaster may predict at any time. We introduce a model of Prediction with Limited Selectivity (PLS) where the forecaster can start the prediction only on a subset of the time horizon. We study the optimal prediction error both on an instance-by-instance basis and via an average-case analysis. We introduce a complexity measure that gives instance-dependent bounds on the optimal error. For a randomly-generated PLS instance, these bounds match with high probability.

[380] Goal Discovery with Causal Capacity for Efficient Reinforcement Learning

Yan Yu, Yaodong Yang, Zhengbo Lu, Chengdong Ma, Wengang Zhou, Houqiang Li

Main category: cs.LG

TL;DR: The paper proposes a Goal Discovery with Causal Capacity (GDCC) framework to enhance reinforcement learning by measuring and utilizing causality between actions and state transitions for efficient exploration.

Details

Motivation: Causal inference is vital for exploration in reinforcement learning, but measuring causality in complex environments is challenging. The paper aims to address this by quantifying causal influence.

Method: The GDCC framework introduces ‘causal capacity’ to measure an agent’s influence on future trajectories. It uses Monte Carlo methods to identify critical decision points (subgoals) in discrete and continuous state spaces.

Result: Empirical results show that states with high causal capacity align with expected subgoals, and GDCC outperforms baselines in success rates.

Conclusion: The GDCC framework effectively leverages causality for directed exploration, improving agent performance in multi-objective tasks.

Abstract: Causal inference is crucial for humans to explore the world, which can be modeled to enable an agent to efficiently explore the environment in reinforcement learning. Existing research indicates that establishing the causality between action and state transition will enhance an agent to reason how a policy affects its future trajectory, thereby promoting directed exploration. However, it is challenging to measure the causality due to its intractability in the vast state-action space of complex scenarios. In this paper, we propose a novel Goal Discovery with Causal Capacity (GDCC) framework for efficient environment exploration. Specifically, we first derive a measurement of causality in state space, \emph{i.e.,} causal capacity, which represents the highest influence of an agent’s behavior on future trajectories. After that, we present a Monte Carlo based method to identify critical points in discrete state space and further optimize this method for continuous high-dimensional environments. Those critical points are used to uncover where the agent makes important decisions in the environment, which are then regarded as our subgoals to guide the agent to make exploration more purposefully and efficiently. Empirical results from multi-objective tasks demonstrate that states with high causal capacity align with our expected subgoals, and our GDCC achieves significant success rate improvements compared to baselines.

[381] Physics- and geometry-aware spatio-spectral graph neural operator for time-independent and time-dependent PDEs

Subhankar Sarkar, Souvik Chakraborty

Main category: cs.LG

TL;DR: The paper introduces a Physics- and Geometry-Aware Spatio-Spectral Graph Neural Operator (πG-Sp²GNO) for solving PDEs efficiently, improving upon existing methods by incorporating geometry awareness and physics-informed learning.

Details

Motivation: Efficiently solving PDEs, especially for complex geometries and limited labeled data, is a critical challenge in science and engineering.

Method: The proposed πG-Sp²GNO enhances Sp²GNO with geometry awareness and physics-informed learning, using a hybrid loss function for time-dependent problems.

Result: The method outperforms state-of-the-art physics-informed neural operator algorithms on benchmark examples involving regular and complex domains.

Conclusion: The πG-Sp²GNO is effective for solving PDEs, demonstrating superior performance in handling geometry variations and time-dependent problems.

Abstract: Solving partial differential equations (PDEs) efficiently and accurately remains a cornerstone challenge in science and engineering, especially for problems involving complex geometries and limited labeled data. We introduce a Physics- and Geometry- Aware Spatio-Spectral Graph Neural Operator ($\pi$G-Sp$^2$GNO) for learning the solution operators of time-independent and time-dependent PDEs. The proposed approach first improves upon the recently developed Sp$^2$GNO by enabling geometry awareness and subsequently exploits the governing physics to learn the underlying solution operator in a simulation-free setup. While the spatio-spectral structure present in the proposed architecture allows multiscale learning, two separate strategies for enabling geometry awareness is introduced in this paper. For time dependent problems, we also introduce a novel hybrid physics informed loss function that combines higher-order time-marching scheme with upscaled theory inspired stochastic projection scheme. This allows accurate integration of the physics-information into the loss function. The performance of the proposed approach is illustrated on number of benchmark examples involving regular and complex domains, variation in geometry during inference, and time-independent and time-dependent problems. The results obtained illustrate the efficacy of the proposed approach as compared to the state-of-the-art physics-informed neural operator algorithms in the literature.

[382] TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling

Yifei Sun, Junming Liu, Ding Wang, Yirong Chen, Xuefeng Yan

Main category: cs.LG

TL;DR: TimeMKG integrates variable semantics with numerical data in time series modeling using knowledge graphs and large language models, improving performance and interpretability.

Details

Motivation: Traditional time series models ignore semantic information in variable names, missing critical domain knowledge for robust modeling.

Method: TimeMKG uses large language models to interpret variable semantics, constructs knowledge graphs, and employs a dual-modality encoder with cross-modality attention.

Result: Experiments show improved predictive performance and generalization by incorporating variable-level knowledge.

Conclusion: TimeMKG bridges the gap between low-level signal processing and knowledge-informed inference, enhancing interpretability and performance.

Abstract: Multivariate time series data typically comprises two distinct modalities: variable semantics and sampled numerical observations. Traditional time series models treat variables as anonymous statistical signals, overlooking the rich semantic information embedded in variable names and data descriptions. However, these textual descriptors often encode critical domain knowledge that is essential for robust and interpretable modeling. Here we present TimeMKG, a multimodal causal reasoning framework that elevates time series modeling from low-level signal processing to knowledge informed inference. TimeMKG employs large language models to interpret variable semantics and constructs structured Multivariate Knowledge Graphs that capture inter-variable relationships. A dual-modality encoder separately models the semantic prompts, generated from knowledge graph triplets, and the statistical patterns from historical time series. Cross-modality attention aligns and fuses these representations at the variable level, injecting causal priors into downstream tasks such as forecasting and classification, providing explicit and interpretable priors to guide model reasoning. The experiment in diverse datasets demonstrates that incorporating variable-level knowledge significantly improves both predictive performance and generalization.

[383] Thermal Tracks: A Gaussian process-based framework for universal melting curve analysis enabling unconstrained hit identification in thermal proteome profiling experiments

Johannes F. Hevler, Shivam Verma, Mirat Soijtra, Carolyn R. Bertozzi

Main category: cs.LG

TL;DR: Thermal Tracks is a Python framework using Gaussian Process models to analyze protein thermal stability data, overcoming limitations of traditional TPP methods.

Details

Motivation: Existing TPP methods assume sigmoidal melting curves and rely on empirical null distributions, limiting significant hits. Thermal Tracks addresses these constraints for broader applicability.

Method: Uses Gaussian Process models with squared-exponential kernels to flexibly model melting curves and generate unbiased null distributions.

Result: Enables analysis of proteome-wide perturbations and unconventional melting profiles, outperforming conventional TPP methods.

Conclusion: Thermal Tracks is a versatile, accessible tool for proteome-wide thermal profiling, available on GitHub.

Abstract: Thermal Tracks is a Python-based statistical framework for analyzing protein thermal stability data that overcomes key limitations of existing thermal proteome profiling (TPP) work-flows. Unlike standard approaches that assume sigmoidal melting curves and are constrained by empirical null distributions (limiting significant hits to approximately 5 % of data), Thermal Tracks uses Gaussian Process (GP) models with squared-exponential kernels to flexibly model any melting curve shape while generating unbiased null distributions through kernel priors. This framework is particularly valuable for analyzing proteome-wide perturbations that significantly alter protein thermal stability, such as pathway inhibitions, genetic modifications, or environmental stresses, where conventional TPP methods may miss biologically relevant changes due to their statistical constraints. Furthermore, Thermal Tracks excels at analyzing proteins with un-conventional melting profiles, including phase-separating proteins and membrane proteins, which often exhibit complex, non-sigmoidal thermal stability behaviors. Thermal Tracks is freely available from GitHub and is implemented in Python, providing an accessible and flexible tool for proteome-wide thermal profiling studies.

[384] Global Convergence Analysis of Vanilla Gradient Descent for Asymmetric Matrix Completion

Xu Zhang, Shuo Chen, Jinsheng Li, Xiangying Pang, Maoguo Gong

Main category: cs.LG

TL;DR: The paper shows that vanilla gradient descent without regularization can achieve linear convergence for asymmetric low-rank matrix completion, with empirical and theoretical support.

Details

Motivation: To challenge the necessity of regularization in gradient descent for matrix completion by demonstrating its implicit regularization properties.

Method: Uses vanilla gradient descent with spectral initialization and the leave-one-out technique for theoretical analysis.

Result: Proves linear convergence rate and shows implicit regularization, with lower computational cost in experiments.

Conclusion: Regularization may not be necessary for gradient descent in matrix completion, as vanilla methods perform comparably.

Abstract: This paper investigates the asymmetric low-rank matrix completion problem, which can be formulated as an unconstrained non-convex optimization problem with a nonlinear least-squares objective function, and is solved via gradient descent methods. Previous gradient descent approaches typically incorporate regularization terms into the objective function to guarantee convergence. However, numerical experiments and theoretical analysis of the gradient flow both demonstrate that the elimination of regularization terms in gradient descent algorithms does not adversely affect convergence performance. By introducing the leave-one-out technique, we inductively prove that the vanilla gradient descent with spectral initialization achieves a linear convergence rate with high probability. Besides, we demonstrate that the balancing regularization term exhibits a small norm during iterations, which reveals the implicit regularization property of gradient descent. Empirical results show that our algorithm has a lower computational cost while maintaining comparable completion performance compared to other gradient descent algorithms.

[385] Temporal Anchoring in Deepening Embedding Spaces: Event-Indexed Projections, Drift, Convergence, and an Internal Computational Architecture

Faruk Alpay, Bugra Kilictas, Hamdi Alakkad

Main category: cs.LG

TL;DR: The paper introduces an operator-theoretic framework for temporal anchoring in embedding spaces, with proofs for contraction lemmas, convergence theorems, and robustness. It formalizes a Manuscript Computer and analyzes attention layers, proving softmax Lipschitz properties.

Details

Motivation: To develop a rigorous mathematical framework for temporal anchoring in embedding spaces and provide proofs for key theoretical results, including convergence and robustness.

Method: Uses operator-theoretic techniques, including drift maps, event-indexed blocks, and affine projections. Formalizes a Manuscript Computer and analyzes attention layers.

Result: Proves a variable-block contraction lemma, drift-projection convergence theorem, and ontological convergence. Shows softmax is 1/2-Lipschitz and derives layer-contraction conditions.

Conclusion: The framework is robust and theoretically sound, with applications in embedding spaces and attention mechanisms, supported by rigorous proofs.

Abstract: We develop an operator-theoretic framework for temporal anchoring in embedding spaces, modeled as drift maps interleaved with event-indexed blocks culminating in affine projections. We provide complete proofs for a variable-block contraction lemma (products of Lipschitz factors), a drift–projection convergence theorem with explicit uniform-gap envelopes, and ontological convergence under nested affine anchors with a robustness variant. We formalize an internal Manuscript Computer (MC) whose computations are defined purely by these operators and prove a rigorous finite-run equivalence theorem (with perturbation bounds). For attention layers, we give a self-contained proof that softmax is $1/2$-Lipschitz in $\ell_2$ and derive sufficient layer-contraction conditions (orthogonal/non-orthogonal heads). All floats are placed exactly where written; the manuscript uses only in-paper pseudocode and appendix figures.

[386] Combating Noisy Labels via Dynamic Connection Masking

Xinlei Zhang, Fan Liu, Chuanyi Zhang, Fan Cheng, Yuhui Zheng

Main category: cs.LG

TL;DR: Proposes Dynamic Connection Masking (DCM) for MLPs and KANs to improve robustness against noisy labels by adaptively masking less important edges during training. Outperforms SOTA methods and explores KANs’ superior noise robustness.

Details

Motivation: Noisy labels degrade model performance, and existing solutions focus on loss functions and sample selection, neglecting architectural regularization.

Method: Introduces DCM to evaluate and mask less important edges in MLPs and KANs, reducing gradient error. Integrates with existing noise-robust techniques.

Result: DCM consistently outperforms SOTA methods on synthetic and real-world benchmarks. KANs show better noise robustness than MLPs.

Conclusion: DCM enhances model robustness against noisy labels, with KANs proving superior in noisy scenarios. Code will be released.

Abstract: Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels can cause significant performance degradation. Existing research on mitigating the negative effects of noisy labels has mainly focused on robust loss functions and sample selection, with comparatively limited exploration of regularization in model architecture. Inspired by the sparsity regularization used in Kolmogorov-Arnold Networks (KANs), we propose a Dynamic Connection Masking (DCM) mechanism for both Multi-Layer Perceptron Networks (MLPs) and KANs to enhance the robustness of classifiers against noisy labels. The mechanism can adaptively mask less important edges during training by evaluating their information-carrying capacity. Through theoretical analysis, we demonstrate its efficiency in reducing gradient error. Our approach can be seamlessly integrated into various noise-robust training methods to build more robust deep networks, including robust loss functions, sample selection strategies, and regularization techniques. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our method consistently outperforms state-of-the-art (SOTA) approaches. Furthermore, we are also the first to investigate KANs as classifiers against noisy labels, revealing their superior noise robustness over MLPs in real-world noisy scenarios. Our code will soon be publicly available.

[387] GraphTreeGen: Subtree-Centric Approach to Efficient and Supervised Graph Generation

Yitong Luo, Islem Rekik

Main category: cs.LG

TL;DR: GraphTreeGen (GTG) is a subtree-centric generative framework for efficient and accurate connectome synthesis, addressing limitations of current models like blurred local motifs and high computational costs.

Details

Motivation: Brain connectomes are costly to acquire, motivating generative approaches. Current models have limitations like poor local detail, reliance on node attributes, and high memory demands.

Method: GTG decomposes connectomes into entropy-guided k-hop trees, encodes them with a shared GCN, and uses a bipartite message-passing layer and dual-branch decoder to reconstruct adjacency matrices.

Result: GTG outperforms baselines in self-supervised tasks, achieves higher structural fidelity, and uses less memory.

Conclusion: GTG offers a scalable, accurate solution for connectome synthesis, with potential for extensions like super-resolution and cross-modality synthesis.

Abstract: Brain connectomes, representing neural connectivity as graphs, are crucial for understanding brain organization but costly and time-consuming to acquire, motivating generative approaches. Recent advances in graph generative modeling offer a data-driven alternative, enabling synthetic connectome generation and reducing dependence on large neuroimaging datasets. However, current models face key limitations: (i) compressing the whole graph into a single latent code (e.g., VGAEs) blurs fine-grained local motifs; (ii) relying on rich node attributes rarely available in connectomes reduces reconstruction quality; (iii) edge-centric models emphasize topology but overlook accurate edge-weight prediction, harming quantitative fidelity; and (iv) computationally expensive designs (e.g., edge-conditioned convolutions) impose high memory demands, limiting scalability. We propose GraphTreeGen (GTG), a subtree-centric generative framework for efficient, accurate connectome synthesis. GTG decomposes each connectome into entropy-guided k-hop trees capturing informative local structure, encoded by a shared GCN. A bipartite message-passing layer fuses subtree embeddings with global node features, while a dual-branch decoder jointly predicts edge existence and weights to reconstruct the adjacency matrix. GTG outperforms state-of-the-art baselines in self-supervised tasks and remains competitive in supervised settings, delivering higher structural fidelity and more precise weights with far less memory. Its modular design enables extensions to connectome super-resolution and cross-modality synthesis. Code: https://github.com/basiralab/GTG/

[388] Improving ARDS Diagnosis Through Context-Aware Concept Bottleneck Models

Anish Narain, Ritam Majumdar, Nikita Narayanan, Dominic Marshall, Sonali Parbhoo

Main category: cs.LG

TL;DR: The paper explores using large clinical datasets for disease understanding and therapy personalization, addressing incompleteness and interpretability issues with AI tools. It enhances Concept Bottleneck Models (CBMs) by incorporating clinical notes via a Large Language Model (LLM), improving performance and concept comprehensiveness for ARDS identification.

Details

Motivation: To address the limitations of existing AI tools in labeling incomplete clinical datasets and improving interpretability, especially for complex tasks like ARDS identification.

Method: Uses a Large Language Model (LLM) to process clinical notes and generate additional concepts, enhancing Concept Bottleneck Models (CBMs).

Result: Achieves a 10% performance gain over existing methods and learns more comprehensive concepts, reducing reliance on spurious shortcuts.

Conclusion: Incorporating contextual information from clinical notes via LLMs significantly improves CBM performance and interpretability for complex clinical tasks like ARDS identification.

Abstract: Large, publicly available clinical datasets have emerged as a novel resource for understanding disease heterogeneity and to explore personalization of therapy. These datasets are derived from data not originally collected for research purposes and, as a result, are often incomplete and lack critical labels. Many AI tools have been developed to retrospectively label these datasets, such as by performing disease classification; however, they often suffer from limited interpretability. Previous work has attempted to explain predictions using Concept Bottleneck Models (CBMs), which learn interpretable concepts that map to higher-level clinical ideas, facilitating human evaluation. However, these models often experience performance limitations when the concepts fail to adequately explain or characterize the task. We use the identification of Acute Respiratory Distress Syndrome (ARDS) as a challenging test case to demonstrate the value of incorporating contextual information from clinical notes to improve CBM performance. Our approach leverages a Large Language Model (LLM) to process clinical notes and generate additional concepts, resulting in a 10% performance gain over existing methods. Additionally, it facilitates the learning of more comprehensive concepts, thereby reducing the risk of information leakage and reliance on spurious shortcuts, thus improving the characterization of ARDS.

[389] Generative Modeling with Multi-Instance Reward Learning for E-commerce Creative Optimization

Qiaolei Gu, Yu Li, DingYi Zeng, Lu Wang, Ming Pang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao

Main category: cs.LG

TL;DR: The paper introduces GenCO, a framework combining generative modeling and multi-instance reward learning to optimize creative combinations in e-commerce ads, improving revenue.

Details

Motivation: Existing methods evaluate creative elements individually, missing the complexity of combinations, leading to suboptimal ad performance.

Method: GenCO uses a two-stage approach: a generative model produces diverse combinations, optimized via reinforcement learning, and a multi-instance learning model attributes rewards to individual elements.

Result: Deployed on a major e-commerce platform, GenCO significantly boosted advertising revenue.

Conclusion: The framework effectively navigates the creative combination space and is practical for real-world applications, supported by a released dataset.

Abstract: In e-commerce advertising, selecting the most compelling combination of creative elements – such as titles, images, and highlights – is critical for capturing user attention and driving conversions. However, existing methods often evaluate creative components individually, failing to navigate the exponentially large search space of possible combinations. To address this challenge, we propose a novel framework named GenCO that integrates generative modeling with multi-instance reward learning. Our unified two-stage architecture first employs a generative model to efficiently produce a diverse set of creative combinations. This generative process is optimized with reinforcement learning, enabling the model to effectively explore and refine its selections. Next, to overcome the challenge of sparse user feedback, a multi-instance learning model attributes combination-level rewards, such as clicks, to the individual creative elements. This allows the reward model to provide a more accurate feedback signal, which in turn guides the generative model toward creating more effective combinations. Deployed on a leading e-commerce platform, our approach has significantly increased advertising revenue, demonstrating its practical value. Additionally, we are releasing a large-scale industrial dataset to facilitate further research in this important domain.

[390] HKT: A Biologically Inspired Framework for Modular Hereditary Knowledge Transfer in Neural Networks

Yanick Chistian Tchenko, Felix Mohr, Hicham Hadj Abdelkader, Hedi Tabia

Main category: cs.LG

TL;DR: HKT is a biologically inspired framework for transferring task-relevant knowledge from large parent models to smaller child models, outperforming standard distillation.

Details

Motivation: To optimize small, deployable models by enhancing their capabilities through structured knowledge inheritance, addressing inefficiencies in large models.

Method: HKT uses a three-stage process (Extraction, Transfer, Mixture) with a Genetic Attention mechanism for selective feature transfer, inspired by biological inheritance.

Result: HKT improves child model performance across vision tasks (optical flow, classification, segmentation) while maintaining compactness, outperforming standard distillation.

Conclusion: HKT offers a scalable, interpretable solution for deploying high-performance neural networks in resource-constrained environments.

Abstract: A prevailing trend in neural network research suggests that model performance improves with increasing depth and capacity - often at the cost of integrability and efficiency. In this paper, we propose a strategy to optimize small, deployable models by enhancing their capabilities through structured knowledge inheritance. We introduce Hereditary Knowledge Transfer (HKT), a biologically inspired framework for modular and selective transfer of task-relevant features from a larger, pretrained parent network to a smaller child model. Unlike standard knowledge distillation, which enforces uniform imitation of teacher outputs, HKT draws inspiration from biological inheritance mechanisms - such as memory RNA transfer in planarians - to guide a multi-stage process of feature transfer. Neural network blocks are treated as functional carriers, and knowledge is transmitted through three biologically motivated components: Extraction, Transfer, and Mixture (ETM). A novel Genetic Attention (GA) mechanism governs the integration of inherited and native representations, ensuring both alignment and selectivity. We evaluate HKT across diverse vision tasks, including optical flow (Sintel, KITTI), image classification (CIFAR-10), and semantic segmentation (LiTS), demonstrating that it significantly improves child model performance while preserving its compactness. The results show that HKT consistently outperforms conventional distillation approaches, offering a general-purpose, interpretable, and scalable solution for deploying high-performance neural networks in resource-constrained environments.

[391] A Machine Learning Approach to Predict Biological Age and its Longitudinal Drivers

Nazira Dunbayeva, Yulong Li, Yutong Xie, Imran Razzak

Main category: cs.LG

TL;DR: A machine learning pipeline predicts biological age more accurately by incorporating longitudinal biomarker changes, outperforming static models.

Details

Motivation: To improve aging trajectory predictions by capturing dynamic biomarker changes over time, addressing limitations of static models.

Method: Developed a LightGBM model using longitudinal biomarker data (2019-2022), engineered slope features to capture biomarker rate of change.

Result: Model achieved high accuracy (R²=0.515 for males, R²=0.498 for females), outperforming linear and other tree-based models.

Conclusion: Dynamic biomarker tracking is crucial for predicting biological age, enabling personalized interventions for age-related diseases.

Abstract: Predicting an individual’s aging trajectory is a central challenge in preventative medicine and bioinformatics. While machine learning models can predict chronological age from biomarkers, they often fail to capture the dynamic, longitudinal nature of the aging process. In this work, we developed and validated a machine learning pipeline to predict age using a longitudinal cohort with data from two distinct time periods (2019-2020 and 2021-2022). We demonstrate that a model using only static, cross-sectional biomarkers has limited predictive power when generalizing to future time points. However, by engineering novel features that explicitly capture the rate of change (slope) of key biomarkers over time, we significantly improved model performance. Our final LightGBM model, trained on the initial wave of data, successfully predicted age in the subsequent wave with high accuracy ($R^2 = 0.515$ for males, $R^2 = 0.498$ for females), significantly outperforming both traditional linear models and other tree-based ensembles. SHAP analysis of our successful model revealed that the engineered slope features were among the most important predictors, highlighting that an individual’s health trajectory, not just their static health snapshot, is a key determinant of biological age. Our framework paves the way for clinical tools that dynamically track patient health trajectories, enabling early intervention and personalized prevention strategies for age-related diseases.

[392] $μ$-Parametrization for Mixture of Experts

Jan Małaśnicki, Kamil Ciebiera, Mateusz Boruń, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Recent years have seen a growing interest and adoption of LLMs, with $\mu$Transfer becoming a key technique for tuning hyperparameters in large-scale training. Meanwhile, Mixture-of-Experts (MoE) has emerged as a leading architecture in extremely large models. However, the intersection of these two advancements has remained unexplored. In this work, we derive a $\mu$-Parameterization ($\mu$P) for MoE, providing theoretical guarantees for feature learning across model widths in both the router and experts. We empirically validate our parameterization and further investigate how scaling the number of experts and granularity affects the optimal learning rate.

[393] TriForecaster: A Mixture of Experts Framework for Multi-Region Electric Load Forecasting with Tri-dimensional Specialization

Zhaoyang Zhu, Zhipeng Zeng, Qiming Chen, Linxiao Yang, Peiyuan Liu, Weiqi Chen, Liang Sun

Main category: cs.LG

TL;DR: The paper introduces TriForecaster, a framework for Multi-Region Electric Load Forecasting (MRELF), addressing regional, contextual, and temporal variations using Mixture of Experts (MoE) and Multi-Task Learning (MTL). It reduces forecast errors by 22.4% and is deployed in eastern China.

Details

Motivation: The need for accurate short-term load forecasting across multiple sub-regions due to similar load patterns in cities and the availability of high-quality smart grid data.

Method: Proposes TriForecaster with RegionMixer and CTSpecializer layers, leveraging MoE and MTL to handle regional, contextual, and temporal variations.

Result: Outperforms state-of-the-art models with a 22.4% reduction in forecast error, demonstrated on four real-world datasets and deployed in eastern China.

Conclusion: TriForecaster is flexible, broadly applicable, and practically useful, as shown by its deployment for city-level forecasts in eastern China.

Abstract: Electric load forecasting is pivotal for power system operation, planning and decision-making. The rise of smart grids and meters has provided more detailed and high-quality load data at multiple levels of granularity, from home to bus and cities. Motivated by similar patterns of loads across different cities in a province in eastern China, in this paper we focus on the Multi-Region Electric Load Forecasting (MRELF) problem, targeting accurate short-term load forecasting for multiple sub-regions within a large region. We identify three challenges for MRELF, including regional variation, contextual variation, and temporal variation. To address them, we propose TriForecaster, a new framework leveraging the Mixture of Experts (MoE) approach within a Multi-Task Learning (MTL) paradigm to overcome these challenges. TriForecaster features RegionMixer and Context-Time Specializer (CTSpecializer) layers, enabling dynamic cooperation and specialization of expert models across regional, contextual, and temporal dimensions. Based on evaluation on four real-world MRELF datasets with varied granularity, TriForecaster outperforms state-of-the-art models by achieving an average forecast error reduction of 22.4%, thereby demonstrating its flexibility and broad applicability. In particular, the deployment of TriForecaster on the eForecaster platform in eastern China exemplifies its practical utility, effectively providing city-level, short-term load forecasts for 17 cities, supporting a population exceeding 110 million and daily electricity usage over 100 gigawatt-hours.

[394] Prototype Training with Dual Pseudo-Inverse and Optimized Hidden Activations

Mauro Tucci

Main category: cs.LG

TL;DR: Proto-PINV+H is a fast training method combining closed-form weight computation with gradient-based optimization of synthetic inputs, soft labels, and hidden activations, achieving high accuracy on MNIST and Fashion-MNIST.

Details

Motivation: To shift trainable degrees of freedom from weight space to data/activation space for faster and efficient training.

Method: Combines closed-form weight computation via ridge-regularized pseudo-inverse solves with gradient-based updates of prototypes using Adam.

Result: Achieves 97.8% and 89.3% test accuracy on MNIST and Fashion-MNIST, respectively, in 3.9s–4.5s with 130k parameters and 250 epochs.

Conclusion: Proto-PINV+H offers superior accuracy-speed-size trade-offs compared to ELM, random-feature ridge, and shallow MLPs.

Abstract: We present Proto-PINV+H, a fast training paradigm that combines closed-form weight computation with gradient-based optimisation of a small set of synthetic inputs, soft labels, and-crucially-hidden activations. At each iteration we recompute all weight matrices in closed form via two (or more) ridge-regularised pseudo-inverse solves, while updating only the prototypes with Adam. The trainable degrees of freedom are thus shifted from weight space to data/activation space. On MNIST (60k train, 10k test) and Fashion-MNIST (60k train, 10k test), our method reaches 97.8% and 89.3% test accuracy on the official 10k test sets, respectively, in 3.9s–4.5s using approximately 130k trainable parameters and only 250 epochs on an RTX 5060 (16GB). We provide a multi-layer extension (optimised activations at each hidden stage), learnable ridge parameters, optional PCA/PLS projections, and theory linking the condition number of prototype matrices to generalisation. The approach yields favourable accuracy–speed–size trade-offs against ELM, random-feature ridge, and shallow MLPs trained by back-propagation.

[395] Bayesian autoregression to optimize temporal Matérn kernel Gaussian process hyperparameters

Wouter M. Kouw

Main category: cs.LG

TL;DR: A method for optimizing Matérn kernel temporal Gaussian processes via recursive Bayesian estimation, outperforming marginal likelihood maximization and Hamiltonian Monte Carlo in runtime and accuracy.

Details

Motivation: To improve the optimization of hyperparameters in Gaussian processes for better performance in probabilistic numerics.

Method: Casts the optimization problem as a recursive Bayesian estimation for autoregressive model parameters.

Result: Outperforms marginal likelihood maximization and Hamiltonian Monte Carlo in runtime and root mean square error.

Conclusion: The proposed recursive Bayesian estimation method is efficient and effective for optimizing Gaussian process hyperparameters.

Abstract: Gaussian processes are important models in the field of probabilistic numerics. We present a procedure for optimizing Mat'ern kernel temporal Gaussian processes with respect to the kernel covariance function’s hyperparameters. It is based on casting the optimization problem as a recursive Bayesian estimation procedure for the parameters of an autoregressive model. We demonstrate that the proposed procedure outperforms maximizing the marginal likelihood as well as Hamiltonian Monte Carlo sampling, both in terms of runtime and ultimate root mean square error in Gaussian process regression.

[396] Feature Impact Analysis on Top Long-Jump Performances with Quantile Random Forest and Explainable AI Techniques

Qi Gan, Stephan Clémençon, Mounîm A. El-Yacoubi, Sao Mai Nguyen, Eric Fenaux, Ons Jelassi

Main category: cs.LG

TL;DR: The study uses machine learning to analyze biomechanical features in long jump competitions, identifying key factors like knee angle for males and landing pose for females, alongside velocity, for top performance.

Details

Motivation: Traditional methods struggle to analyze complex biomechanical relationships in sports. Machine learning offers a data-driven approach to identify key performance features.

Method: Quantile regression models analyze expert-proposed biomechanical features, with SHAP, PDPs, and ICE plots for interpretation, focusing on elite-level jumps.

Result: Key findings: knee angle (>169°) for males, landing pose and approach technique for females, alongside velocity, are critical for top 10% performance.

Conclusion: The study provides a framework for analyzing biomechanical features’ impact on athletic performance, emphasizing top-tier events.

Abstract: Biomechanical features have become important indicators for evaluating athletes’ techniques. Traditionally, experts propose significant features and evaluate them using physics equations. However, the complexity of the human body and its movements makes it challenging to explicitly analyze the relationships between some features and athletes’ final performance. With advancements in modern machine learning and statistics, data analytics methods have gained increasing importance in sports analytics. In this study, we leverage machine learning models to analyze expert-proposed biomechanical features from the finals of long jump competitions in the World Championships. The objectives of the analysis include identifying the most important features contributing to top-performing jumps and exploring the combined effects of these key features. Using quantile regression, we model the relationship between the biomechanical feature set and the target variable (effective distance), with a particular focus on elite-level jumps. To interpret the model, we apply SHapley Additive exPlanations (SHAP) alongside Partial Dependence Plots (PDPs) and Individual Conditional Expectation (ICE) plots. The findings reveal that, beyond the well-documented velocity-related features, specific technical aspects also play a pivotal role. For male athletes, the angle of the knee of the supporting leg before take-off is identified as a key factor for achieving top 10% performance in our dataset, with angles greater than 169{\deg}contributing significantly to jump performance. In contrast, for female athletes, the landing pose and approach step technique emerge as the most critical features influencing top 10% performances, alongside velocity. This study establishes a framework for analyzing the impact of various features on athletic performance, with a particular emphasis on top-performing events.

[397] Provable In-Context Vector Arithmetic via Retrieving Task Concepts

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

Main category: cs.LG

TL;DR: The paper proposes a theoretical framework for in-context learning (ICL) in LLMs, explaining how transformers perform factual-recall tasks via vector arithmetic and demonstrating strong generalization.

Details

Motivation: Despite empirical evidence of latent task vectors and the role of Question-Answer data in ICL, a theoretical explanation for factual-recall capabilities is lacking.

Method: The authors develop an optimization theory for nonlinear residual transformers trained via gradient descent, proving 0-1 loss convergence and generalization.

Result: The framework shows transformers outperform static embedding models, with robustness to concept recombination and distribution shifts. Empirical simulations validate the theory.

Conclusion: The study provides a theoretical foundation for ICL in transformers, highlighting their advantages over predecessors.

Abstract: In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent task/function vector in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.

[398] RankList – A Listwise Preference Learning Framework for Predicting Subjective Preferences

Abinay Reddy Naini, Fernando Diaz, Carlos Busso

Main category: cs.LG

TL;DR: RankList is a listwise preference learning framework improving upon RankNet by modeling local and non-local ranking constraints, achieving better performance in tasks like speech emotion recognition and image aesthetic assessment.

Details

Motivation: Addressing the limitations of pairwise frameworks like RankNet, which struggle with global ranking consistency, the paper proposes RankList for structured list-level supervision.

Method: RankList introduces a probabilistic framework with log-sum-exp approximation for efficiency and skip-wise comparisons for enhanced global ranking.

Result: RankList outperforms baselines in Kendall’s Tau and ranking accuracy on SER and image aesthetic datasets, showing better generalization.

Conclusion: RankList provides a unified, extensible approach for modeling ordered preferences in subjective tasks, improving in-domain and cross-domain performance.

Abstract: Preference learning has gained significant attention in tasks involving subjective human judgments, such as \emph{speech emotion recognition} (SER) and image aesthetic assessment. While pairwise frameworks such as RankNet offer robust modeling of relative preferences, they are inherently limited to local comparisons and struggle to capture global ranking consistency. To address these limitations, we propose RankList, a novel listwise preference learning framework that generalizes RankNet to structured list-level supervision. Our formulation explicitly models local and non-local ranking constraints within a probabilistic framework. The paper introduces a log-sum-exp approximation to improve training efficiency. We further extend RankList with skip-wise comparisons, enabling progressive exposure to complex list structures and enhancing global ranking fidelity. Extensive experiments demonstrate the superiority of our method across diverse modalities. On benchmark SER datasets (MSP-Podcast, IEMOCAP, BIIC Podcast), RankList achieves consistent improvements in Kendall’s Tau and ranking accuracy compared to standard listwise baselines. We also validate our approach on aesthetic image ranking using the Artistic Image Aesthetics dataset, highlighting its broad applicability. Through ablation and cross-domain studies, we show that RankList not only improves in-domain ranking but also generalizes better across datasets. Our framework offers a unified, extensible approach for modeling ordered preferences in subjective learning scenarios.

[399] FedShard: Federated Unlearning with Efficiency Fairness and Performance Fairness

Siyuan Wen, Meng Zhang, Yang Yang, Ningning Ding

Main category: cs.LG

TL;DR: FedShard is a federated unlearning algorithm ensuring efficiency and performance fairness, addressing challenges in convergence and unlearning fairness, with faster unlearning speeds.

Details

Motivation: Current federated unlearning methods lack focus on efficiency and performance fairness among clients, leaving gaps in addressing fairness risks like cascaded leaving and poisoning attacks.

Method: FedShard adaptively tackles dilemmas among convergence, unlearning efficiency, and fairness, introducing two novel fairness metrics.

Result: FedShard speeds up unlearning 1.3-6.2x faster than retraining and 4.9x faster than state-of-the-art methods, while mitigating unfairness risks.

Conclusion: FedShard effectively balances unlearning costs and fairness, validated by theoretical analysis and experiments.

Abstract: To protect clients’ right to be forgotten in federated learning, federated unlearning aims to remove the data contribution of leaving clients from the global learned model. While current studies mainly focused on enhancing unlearning efficiency and effectiveness, the crucial aspects of efficiency fairness and performance fairness among decentralized clients during unlearning have remained largely unexplored. In this study, we introduce FedShard, the first federated unlearning algorithm designed to concurrently guarantee both efficiency fairness and performance fairness. FedShard adaptively addresses the challenges introduced by dilemmas among convergence, unlearning efficiency, and unlearning fairness. Furthermore, we propose two novel metrics to quantitatively assess the fairness of unlearning algorithms, which we prove to satisfy well-known properties in other existing fairness measurements. Our theoretical analysis and numerical evaluation validate FedShard’s fairness in terms of both unlearning performance and efficiency. We demonstrate that FedShard mitigates unfairness risks such as cascaded leaving and poisoning attacks and realizes more balanced unlearning costs among clients. Experimental results indicate that FedShard accelerates the data unlearning process 1.3-6.2 times faster than retraining from scratch and 4.9 times faster than the state-of-the-art exact unlearning methods.

[400] Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning

Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, Jun Wang

Main category: cs.LG

TL;DR: The paper introduces a data-efficient distillation framework (DED) to optimize reasoning in LLMs without extensive scaling, achieving state-of-the-art results with minimal curated data.

Details

Motivation: Current methods for improving reasoning in LLMs often require large datasets and high computational costs, prompting the need for a more efficient approach.

Method: DED selects optimal teacher models, uses a curated smaller corpus for balanced performance, and employs diverse reasoning trajectories to enhance robustness.

Result: DED achieves top results on mathematical reasoning and code generation tasks with only 0.8k examples, outperforming existing methods.

Conclusion: DED provides a practical and efficient solution for advancing reasoning in LLMs while maintaining general capabilities.

Abstract: Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills. We validate our method through evaluations on mathematical reasoning (AIME 2024/2025, MATH-500) and code generation (LiveCodeBench), achieving state-of-the-art results with only 0.8k carefully curated examples, bypassing the need for extensive scaling. Our systematic analysis demonstrates that DED outperforms existing methods by considering factors beyond superficial hardness, token length, or teacher model capability. This work offers a practical and efficient pathway to advanced reasoning while preserving general capabilities.

[401] Modern Neural Networks for Small Tabular Datasets: The New Default for Field-Scale Digital Soil Mapping?

Viacheslav Barkov, Jonas Schmidinger, Robin Gebbers, Martin Atzmueller

Main category: cs.LG

TL;DR: Modern artificial neural networks (ANNs) outperform classical machine learning methods in predicting soil properties at field-scale, with TabPFN emerging as the top-performing model.

Details

Motivation: The study aims to challenge the dominance of classical machine learning methods (e.g., Random Forest, Partial Least Squares Regression) in field-scale predictive soil modeling (PSM) by evaluating the suitability of modern ANNs.

Method: A comprehensive benchmark evaluates state-of-the-art ANN architectures, including MLP-based models, attention-based transformers, retrieval-augmented approaches, and an in-context learning foundation model (TabPFN), across 31 datasets.

Result: Modern ANNs consistently outperform classical methods, with TabPFN showing the strongest overall performance and robustness.

Conclusion: The study recommends adopting modern ANNs, particularly TabPFN, as the new default choice for field-scale PSM in pedometrics.

Abstract: In the field of pedometrics, tabular machine learning is the predominant method for predicting soil properties from remote and proximal soil sensing data, forming a central component of digital soil mapping. At the field-scale, this predictive soil modeling (PSM) task is typically constrained by small training sample sizes and high feature-to-sample ratios in soil spectroscopy. Traditionally, these conditions have proven challenging for conventional deep learning methods. Classical machine learning algorithms, particularly tree-based models like Random Forest and linear models such as Partial Least Squares Regression, have long been the default choice for field-scale PSM. Recent advances in artificial neural networks (ANN) for tabular data challenge this view, yet their suitability for field-scale PSM has not been proven. We introduce a comprehensive benchmark that evaluates state-of-the-art ANN architectures, including the latest multilayer perceptron (MLP)-based models (TabM, RealMLP), attention-based transformer variants (FT-Transformer, ExcelFormer, T2G-Former, AMFormer), retrieval-augmented approaches (TabR, ModernNCA), and an in-context learning foundation model (TabPFN). Our evaluation encompasses 31 field- and farm-scale datasets containing 30 to 460 samples and three critical soil properties: soil organic matter or soil organic carbon, pH, and clay content. Our results reveal that modern ANNs consistently outperform classical methods on the majority of tasks, demonstrating that deep learning has matured sufficiently to overcome the long-standing dominance of classical machine learning for PSM. Notably, TabPFN delivers the strongest overall performance, showing robustness across varying conditions. We therefore recommend the adoption of modern ANNs for field-scale PSM and propose TabPFN as the new default choice in the toolkit of every pedometrician.

[402] Rare anomalies require large datasets: About proving the existence of anomalies

Simon Klüttermann, Emmanuel Müller

Main category: cs.LG

TL;DR: The paper establishes a lower bound on dataset size required to confirm anomaly presence, linking it to contamination rate and an algorithm-dependent constant.

Details

Motivation: To address the underexplored question of conclusively determining anomaly presence in datasets.

Method: Conducted over three million statistical tests across various anomaly detection tasks and algorithms.

Result: Found that dataset size must satisfy $N \ge \frac{\alpha_{\text{algo}}}{\nu^2}$ to confirm anomalies, revealing limits on anomaly rarity.

Conclusion: The study provides a practical threshold for anomaly detection feasibility based on dataset size and contamination rate.

Abstract: Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant $ \alpha_{\text{algo}} $. Our results demonstrate that, for an unlabeled dataset of size $ N $ and contamination rate $ \nu $, the condition $ N \ge \frac{\alpha_{\text{algo}}}{\nu^2} $ represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible.

[403] Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs

Arjun Ashok, Andrew Robert Williams, Vincent Zhihao Zheng, Irina Rish, Nicolas Chapados, Étienne Marcotte, Valentina Zantedeschi, Alexandre Drouin

Main category: cs.LG

TL;DR: The paper introduces four strategies (ReDP, CorDP, IC-DP, RouteDP) to enhance LLMs’ zero-shot forecasting capabilities by integrating contextual information, improving interpretability, accuracy, and resource efficiency.

Details

Motivation: To explore the untapped potential of LLMs in context-aided forecasting beyond naive prompting, addressing gaps in interpretability, applicability, accuracy, and resource use.

Method: Four strategies: ReDP for interpretability via reasoning traces, CorDP for refining forecasts, IC-DP for embedding historical examples, and RouteDP for task difficulty-based routing.

Result: The strategies outperform naive prompting across LLM sizes and families, demonstrating improved interpretability, accuracy, and efficiency.

Conclusion: The proposed strategies offer simple yet effective improvements for LLM-based context-aided forecasting, paving the way for further advancements.

Abstract: Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often available in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via na"ive direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model’s reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over na"ive prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting.

[404] Residual Reservoir Memory Networks

Matteo Pinna, Andrea Ceni, Claudio Gallicchio

Main category: cs.LG

TL;DR: Residual Reservoir Memory Networks (ResRMNs) combine linear and non-linear reservoirs with residual connections for improved long-term input propagation, outperforming traditional RC models.

Details

Motivation: To enhance long-term input propagation in Reservoir Computing (RC) by introducing residual connections in a novel RNN architecture.

Method: ResRMNs integrate a linear memory reservoir with a non-linear reservoir using residual orthogonal connections. Linear stability analysis and empirical testing on time-series and 1-D classification tasks are conducted.

Result: ResRMNs demonstrate superior performance compared to conventional RC models.

Conclusion: The proposed ResRMN architecture effectively improves long-term input propagation and outperforms existing RC models.

Abstract: We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models.

[405] Prototype-Guided Diffusion: Visual Conditioning without External Memory

Bilal Faye, Hanane Azzag, Mustapha Lebbah

Main category: cs.LG

TL;DR: The paper introduces Prototype Diffusion Model (PDM), an efficient alternative to retrieval-augmented diffusion models, using dynamic visual prototypes for semantic conditioning without external memory.

Details

Motivation: Diffusion models are computationally intensive, and retrieval-based methods introduce storage and adaptability issues. PDM aims to address these drawbacks.

Method: PDM integrates prototype learning into diffusion, constructing dynamic visual prototypes from clean image features via contrastive learning to guide denoising.

Result: PDM maintains high generation quality while reducing computational and storage costs, outperforming retrieval-based methods.

Conclusion: PDM offers a scalable, efficient solution for semantic conditioning in diffusion models, eliminating the need for external memory.

Abstract: Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.

[406] Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models

Luca Eyring, Shyamgopal Karthik, Alexey Dosovitskiy, Nataniel Ruiz, Zeynep Akata

Main category: cs.LG

TL;DR: The paper proposes a Noise Hypernetwork to replace reward-guided test-time noise optimization in diffusion models, reducing computational overhead while preserving quality gains.

Details

Motivation: Test-time scaling improves model performance but increases computation time, making it impractical for many applications. The goal is to retain benefits without the overhead.

Method: Introduces a Noise Hypernetwork to modulate initial input noise, using a tractable noise-space objective to optimize for desired characteristics while maintaining fidelity to the base model.

Result: The approach recovers much of the quality gains from test-time optimization at a significantly lower computational cost.

Conclusion: The proposed method effectively integrates test-time scaling knowledge into models post-training, balancing performance and efficiency.

Abstract: The new paradigm of test-time scaling has yielded remarkable breakthroughs in Large Language Models (LLMs) (e.g. reasoning models) and in generative vision models, allowing models to allocate additional computation during inference to effectively tackle increasingly complex problems. Despite the improvements of this approach, an important limitation emerges: the substantial increase in computation time makes the process slow and impractical for many applications. Given the success of this paradigm and its growing usage, we seek to preserve its benefits while eschewing the inference overhead. In this work we propose one solution to the critical problem of integrating test-time scaling knowledge into a model during post-training. Specifically, we replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise. We propose a theoretically grounded framework for learning this reward-tilted distribution for distilled generators, through a tractable noise-space objective that maintains fidelity to the base model while optimizing for desired characteristics. We show that our approach recovers a substantial portion of the quality gains from explicit test-time optimization at a fraction of the computational cost. Code is available at https://github.com/ExplainableML/HyperNoise

[407] Dynamic Mixture-of-Experts for Incremental Graph Learning

Lecheng Kong, Theodore Vasiloudis, Seongjun Yun, Han Xie, Xiang Song

Main category: cs.LG

TL;DR: The paper proposes DyMoE, a dynamic mixture-of-experts approach for graph incremental learning, addressing catastrophic forgetting by adding specialized experts and using a customized regularization loss. It achieves a 4.92% accuracy improvement over baselines.

Details

Motivation: Existing graph incremental learning methods suffer from catastrophic forgetting and fail to account for varying contributions of prior knowledge.

Method: Introduces DyMoE, adding new expert networks for incoming data and a regularization loss to balance old and new knowledge. Uses sparse MoE to reduce computational cost.

Result: Achieves a 4.92% relative accuracy increase over baselines in class incremental learning.

Conclusion: DyMoE effectively mitigates catastrophic forgetting and improves incremental learning performance.

Abstract: Graph incremental learning is a learning paradigm that aims to adapt trained models to continuously incremented graphs and data over time without the need for retraining on the full dataset. However, regular graph machine learning methods suffer from catastrophic forgetting when applied to incremental learning settings, where previously learned knowledge is overridden by new knowledge. Previous approaches have tried to address this by treating the previously trained model as an inseparable unit and using techniques to maintain old behaviors while learning new knowledge. These approaches, however, do not account for the fact that previously acquired knowledge at different timestamps contributes differently to learning new tasks. Some prior patterns can be transferred to help learn new data, while others may deviate from the new data distribution and be detrimental. To address this, we propose a dynamic mixture-of-experts (DyMoE) approach for incremental learning. Specifically, a DyMoE GNN layer adds new expert networks specialized in modeling the incoming data blocks. We design a customized regularization loss that utilizes data sequence information so existing experts can maintain their ability to solve old tasks while helping the new expert learn the new data effectively. As the number of data blocks grows over time, the computational cost of the full mixture-of-experts (MoE) model increases. To address this, we introduce a sparse MoE approach, where only the top-$k$ most relevant experts make predictions, significantly reducing the computation time. Our model achieved 4.92% relative accuracy increase compared to the best baselines on class incremental learning, showing the model’s exceptional power.

[408] Discrete Neural Algorithmic Reasoning

Gleb Rodionov, Liudmila Prokhorenkova

Main category: cs.LG

TL;DR: Neural algorithmic reasoning improves generalization by enforcing predefined discrete states, aligning perfectly with classical algorithms.

Details

Motivation: Current neural reasoners struggle with out-of-distribution generalization, unlike classical algorithms unaffected by distributional shifts.

Method: Separate discrete and continuous data flows, supervise state transitions to align with classical algorithms.

Result: Achieves perfect test scores on algorithmic problems in single-task and multitask setups.

Conclusion: Proposed method ensures correctness for any test data and improves generalization.

Abstract: Neural algorithmic reasoning aims to capture computations with neural networks by training models to imitate the execution of classical algorithms. While common architectures are expressive enough to contain the correct model in the weight space, current neural reasoners struggle to generalize well on out-of-distribution data. On the other hand, classical computations are not affected by distributional shifts as they can be described as transitions between discrete computational states. In this work, we propose to force neural reasoners to maintain the execution trajectory as a combination of finite predefined states. To achieve this, we separate discrete and continuous data flows and describe the interaction between them. Trained with supervision on the algorithm’s state transitions, such models are able to perfectly align with the original algorithm. To show this, we evaluate our approach on multiple algorithmic problems and achieve perfect test scores both in single-task and multitask setups. Moreover, the proposed architectural choice allows us to prove the correctness of the learned algorithms for any test data.

[409] LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data

Grigor Bezirganyan, Sana Sellami, Laure Berti-Équille, Sébastien Fournier

Main category: cs.LG

TL;DR: The paper introduces LUMA, a multimodal dataset for studying uncertainty in deep learning, extending CIFAR with audio and text data, and providing tools for controlled uncertainty injection.

Details

Motivation: To enhance trustworthy multimodal deep learning by understanding and addressing uncertainty in models.

Method: Developed LUMA, a dataset with audio, image, and text data, and provided tools for uncertainty injection and quantification methods.

Result: LUMA enables controlled experiments with uncertainty, supporting robust multimodal model development.

Conclusion: LUMA aims to advance trustworthy machine learning for safety-critical applications.

Abstract: Multimodal Deep Learning enhances decision-making by integrating diverse information sources, such as texts, images, audio, and videos. To develop trustworthy multimodal approaches, it is essential to understand how uncertainty impacts these models. We propose LUMA, a unique multimodal dataset, featuring audio, image, and textual data from 50 classes, specifically designed for learning from uncertain data. It extends the well-known CIFAR 10/100 dataset with audio samples extracted from three audio corpora, and text data generated using the Gemma-7B Large Language Model (LLM). The LUMA dataset enables the controlled injection of varying types and degrees of uncertainty to achieve and tailor specific experiments and benchmarking initiatives. LUMA is also available as a Python package including the functions for generating multiple variants of the dataset with controlling the diversity of the data, the amount of noise for each modality, and adding out-of-distribution samples. A baseline pre-trained model is also provided alongside three uncertainty quantification methods: Monte-Carlo Dropout, Deep Ensemble, and Reliable Conflictive Multi-View Learning. This comprehensive dataset and its tools are intended to promote and support the development, evaluation, and benchmarking of trustworthy and robust multimodal deep learning approaches. We anticipate that the LUMA dataset will help the research community to design more trustworthy and robust machine learning approaches for safety critical applications. The code and instructions for downloading and processing the dataset can be found at: https://github.com/bezirganyan/LUMA/ .

[410] Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs

Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke

Main category: cs.LG

TL;DR: A novel method using steering vectors to mitigate bias in LLMs shows significant improvements over existing techniques, with minimal impact on model performance.

Details

Motivation: Addressing social biases in LLMs is critical for AI safety and fairness. This work explores steering vectors as an efficient solution.

Method: Compute and apply 8 steering vectors for different bias axes (e.g., age, gender) using the BBQ dataset, comparing them to 3 other bias mitigation methods across 4 datasets.

Result: Steering vectors improved bias metrics by 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, outperforming prompting, Self-Debias, and fine-tuning in most cases. They also had the least impact on MMLU scores.

Conclusion: Steering vectors are a powerful, efficient strategy for bias mitigation in LLMs, with broader implications for AI safety.

Abstract: We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We compute 8 steering vectors, each corresponding to a different social bias axis, such as age, gender, or race, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and improvements over fine-tuning in 12 out of 17 evaluations. In addition, steering vectors showed the lowest impact on MMLU scores of the four bias mitigation methods tested. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that they are a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.

[411] GTPO: Trajectory-Based Policy Optimization in Large Language Models

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

Main category: cs.LG

TL;DR: The paper identifies limitations in GRPO, introduces GTPO to address them, and validates its effectiveness on benchmarks.

Details

Motivation: To address conflicting gradient updates and policy collapse in GRPO, which degrade model performance.

Method: Introduces GTPO, which skips negative updates for conflict tokens and filters high-entropy completions.

Result: GTPO improves stability and performance without needing KL-divergence regularization or a reference model.

Conclusion: GTPO is a more stable and effective policy optimization strategy than GRPO.

Abstract: Policy-based optimizations are widely adopted today for the training and alignment of language models, where one of the most recent and effective approaches is Group-relative Policy Optimization (GRPO). In this paper, we reveals and analyze two major limitations of GRPO: (i) tokens frequently appear in completions with both positive and negative rewards, leading to conflicting gradient updates that can reduce their output probability, even though can be essential for maintaining proper structure; (ii) negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, progressively flattening the output distribution and degrading learning. To address these issues and provide a more stable and effective policy optimization strategy, we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards, protects them by skipping negative updates, while amplifying positive ones. To further prevent policy collapse, GTPO filters out completions whose entropy exceeds a provable threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, validated through multiple experiments on GSM8K, MATH and AIME 2024 benchmarks.

[412] LEAVES: Learning Views for Time-Series Biobehavioral Data in Contrastive Learning

Han Yu, Huiyuan Yang, Akane Sano

Main category: cs.LG

TL;DR: LEAVES introduces an efficient adversarial training module for automatic view generation in contrastive learning for time-series biobehavioral data, outperforming baselines with minimal parameters.

Details

Motivation: Optimizing augmentations in contrastive learning is resource-intensive, especially for time-series biobehavioral data, which lacks automated solutions.

Method: LEAVES uses adversarial training to learn augmentation hyperparameters within contrastive learning frameworks (SimCLR, BYOL), requiring only ~20 parameters.

Result: LEAVES achieves competitive/superior performance to baselines (e.g., ViewMaker) with far fewer parameters (~20 vs. 580k) and no extensive tuning.

Conclusion: LEAVES is efficient, practical, and scalable for healthcare applications, balancing accuracy and resource constraints.

Abstract: Contrastive learning has been utilized as a promising self-supervised learning approach to extract meaningful representations from unlabeled data. The majority of these methods take advantage of data-augmentation techniques to create diverse views from the original input. However, optimizing augmentations and their parameters for generating more effective views in contrastive learning frameworks is often resource-intensive and time-consuming. While several strategies have been proposed for automatically generating new views in computer vision, research in other domains, such as time-series biobehavioral data, remains limited. In this paper, we introduce a simple yet powerful module for automatic view generation in contrastive learning frameworks applied to time-series biobehavioral data, which is essential for modern health care, termed learning views for time-series data (LEAVES). This proposed module employs adversarial training to learn augmentation hyperparameters within contrastive learning frameworks. We assess the efficacy of our method on multiple time-series datasets using two well-known contrastive learning frameworks, namely SimCLR and BYOL. Across four diverse biobehavioral datasets, LEAVES requires only approximately 20 learnable parameters – dramatically fewer than the about 580k parameters demanded by frameworks like ViewMaker, a previously proposed adversarially trained convolutional module in contrastive learning, while achieving competitive and often superior performance to existing baseline methods. Crucially, these efficiency gains are obtained without extensive manual hyperparameter tuning, which makes LEAVES particularly suitable for large-scale or real-time healthcare applications that demand both accuracy and practicality.

[413] Forecasting steam mass flow in power plants using the parallel hybrid network

Andrii Kurkin, Jonas Hegemann, Mo Kordzanganeh, Alexey Melnikov

Main category: cs.LG

TL;DR: A parallel hybrid neural network combining quantum and classical methods improves steam mass flow prediction in thermal power plants, outperforming standalone models.

Details

Motivation: Accurate steam mass flow prediction is critical for efficiency and cost reduction in thermal power plants.

Method: A parallel hybrid neural network architecture using a parametrized quantum circuit and a feed-forward neural network for time-series prediction.

Result: The hybrid model reduces mean squared error by 5.7x and 4.9x compared to pure classical and quantum models, respectively, and halves relative errors.

Conclusion: The study demonstrates the potential of hybrid quantum-classical models for real-world energy sector challenges, optimizing power plant operations.

Abstract: Efficient and sustainable power generation is a crucial concern in the energy sector. In particular, thermal power plants grapple with accurately predicting steam mass flow, which is crucial for operational efficiency and cost reduction. In this study, we use a parallel hybrid neural network architecture that combines a parametrized quantum circuit and a conventional feed-forward neural network specifically designed for time-series prediction in industrial settings to enhance predictions of steam mass flow 15 minutes into the future. Our results show that the parallel hybrid model outperforms standalone classical and quantum models, achieving more than 5.7 and 4.9 times lower mean squared error loss on the test set after training compared to pure classical and pure quantum networks, respectively. Furthermore, the hybrid model demonstrates smaller relative errors between the ground truth and the model predictions on the test set, up to 2 times better than the pure classical model. These findings contribute to the broader scientific understanding of how integrating quantum and classical machine learning techniques can be applied to real-world challenges faced by the energy sector, ultimately leading to optimized power plant operations. To our knowledge, this study constitutes the first parallel hybrid quantum-classical architecture deployed on a real-world power-plant dataset, illustrating how near-term quantum resources can already augment classical analytics in the energy sector.

[414] Learning to Defer in Congested Systems: The AI-Human Interplay

Thodoris Lykouris, Wentao Weng

Main category: cs.LG

TL;DR: The paper introduces a model for AI-human collaboration in content moderation, proposing a near-optimal learning algorithm to minimize misclassification costs by balancing classification loss, idiosyncratic loss, and delay loss.

Details

Motivation: Current AI-human pipelines in content moderation use fixed thresholds, ignoring AI uncertainty, time-varying factors, and selective sampling, leading to inefficiencies.

Method: The model captures AI-human interplay, where AI makes classification and admission decisions, and humans review jobs, overturn errors, and provide delayed training data. A learning algorithm balances classification loss, idiosyncratic loss, and delay loss.

Result: Numerical experiments show the algorithm significantly reduces misclassifications compared to existing practices.

Conclusion: The proposed algorithm improves content moderation efficiency by addressing key limitations of current AI-human pipelines.

Abstract: High-stakes applications rely on combining Artificial Intelligence (AI) and humans for responsive and reliable decision making. For example, content moderation in social media platforms often employs an AI-human pipeline to promptly remove policy violations without jeopardizing legitimate content. A typical heuristic estimates the risk of incoming content and uses fixed thresholds to decide whether to auto-delete the content (classification) and whether to send it for human review (admission). This approach can be inefficient as it disregards the uncertainty in AI’s estimation, the time-varying element of content arrivals and human review capacity, and the selective sampling in the online dataset (humans only review content filtered by the AI). In this paper, we introduce a model to capture such an AI-human interplay. In this model, the AI observes contextual information for incoming jobs, makes classification and admission decisions, and schedules admitted jobs for human review. During these reviews, humans observe a job’s true cost and may overturn an erroneous AI classification decision. These reviews also serve as new data to train the AI but are delayed due to congestion in the human review system. The objective is to minimize the costs of eventually misclassified jobs. We propose a near-optimal learning algorithm that carefully balances the classification loss from a selectively sampled dataset, the idiosyncratic loss of non-reviewed jobs, and the delay loss of having congestion in the human review system. To the best of our knowledge, this is the first result for online learning in contextual queueing systems. Moreover, numerical experiments based on online comment datasets show that our algorithm can substantially reduce the number of misclassifications compared to existing content moderation practice.

[415] Semi-Bandit Learning for Monotone Stochastic Optimization

Arpit Agarwal, Rohan Ghuge, Viswanath Nagarajan, Zhengjia Zhuo

Main category: cs.LG

TL;DR: The paper introduces an online learning algorithm for monotone stochastic problems when underlying distributions are unknown, achieving √(T log T) regret relative to the best approximation algorithm.

Details

Motivation: Current stochastic optimization methods require full knowledge of probability distributions, which is often unrealistic. This work addresses the challenge of learning these distributions through repeated interactions.

Method: A generic online learning algorithm is proposed, operating in a semi-bandit setting with censored or binary feedback, where only probed variables are observed.

Result: The algorithm achieves √(T log T) regret compared to the best approximation algorithm under known distributions, applicable to problems like prophet inequality and stochastic knapsack.

Conclusion: The framework successfully extends to various fundamental problems, demonstrating robustness in settings with limited feedback.

Abstract: Stochastic optimization is a widely used approach for optimization under uncertainty, where uncertain input parameters are modeled by random variables. Exact or approximation algorithms have been obtained for several fundamental problems in this area. However, a significant limitation of this approach is that it requires full knowledge of the underlying probability distributions. Can we still get good (approximation) algorithms if these distributions are unknown, and the algorithm needs to learn them through repeated interactions? In this paper, we resolve this question for a large class of ‘‘monotone’’ stochastic problems, by providing a generic online learning algorithm with $\sqrt{T\log(T)}$ regret relative to the best approximation algorithm (under known distributions). Importantly, our online algorithm works in a semi-bandit setting, where in each period, the algorithm only observes samples from the random variables that were actually probed. Moreover, our result extends to settings with censored and binary feedback, where the policy only observes truncated or thresholded versions of the probed variables. Our framework applies to several fundamental problems such as prophet inequality, Pandora’s box, stochastic knapsack, single-resource revenue management and sequential posted pricing.

[416] No-Regret M${}^{\natural}$-Concave Function Maximization: Stochastic Bandit Algorithms and Hardness of Adversarial Full-Information Setting

Taihei Oki, Shinsaku Sakaue

Main category: cs.LG

TL;DR: The paper studies online M${}^{\natural}$-concave function maximization, presenting algorithms for stochastic settings and proving impossibility in adversarial settings.

Details

Motivation: Practical scenarios lack perfect knowledge of M${}^{\natural}$-concave functions, necessitating interactive optimization based on feedback.

Method: Proposes $O(T^{-1/2})$-simple regret and $O(T^{2/3})$-regret algorithms for stochastic bandit settings, leveraging the robustness of the greedy algorithm. Also proves hardness in adversarial settings via reduction from matroid intersection.

Result: Positive results for stochastic settings with efficient algorithms, but impossibility in adversarial settings due to computational hardness.

Conclusion: The work bridges theory and practice for M${}^{\natural}$-concave optimization, highlighting limitations in adversarial environments.

Abstract: M${}^{\natural}$-concave functions, a.k.a. gross substitute valuation functions, play a fundamental role in many fields, including discrete mathematics and economics. In practice, perfect knowledge of M${}^{\natural}$-concave functions is often unavailable a priori, and we can optimize them only interactively based on some feedback. Motivated by such situations, we study online M${}^{\natural}$-concave function maximization problems, which are interactive versions of the problem studied by Murota and Shioura (1999). For the stochastic bandit setting, we present $O(T^{-1/2})$-simple regret and $O(T^{2/3})$-regret algorithms under $T$ times access to unbiased noisy value oracles of M${}^{\natural}$-concave functions. A key to proving these results is the robustness of the greedy algorithm to local errors in M${}^{\natural}$-concave function maximization, which is one of our main technical results. While we obtain those positive results for the stochastic setting, another main result of our work is an impossibility in the adversarial setting. We prove that, even with full-information feedback, no algorithms that run in polynomial time per round can achieve $O(T^{1-c})$ regret for any constant $c > 0$. Our proof is based on a reduction from the matroid intersection problem for three matroids, which would be a novel approach to establishing the hardness in online learning.

[417] Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

Jialin Zhao, Yingtao Zhang, Xinghang Li, Huaping Liu, Carlo Vittorio Cannistraci

Main category: cs.LG

TL;DR: Sparse Spectral Training (SST) is introduced as a memory-efficient method for neural network pre-training, outperforming LoRA and ReLoRA by selectively updating singular values and vectors, and reducing the perplexity gap by 97.4% on LLaMA-1.3B.

Details

Motivation: Address the inefficiencies of existing memory reduction techniques like LoRA and ReLoRA, which struggle with low-rank constraints and saddle point issues during intensive tasks like pre-training.

Method: SST updates all singular values and selectively updates singular vectors using multinomial sampling weighted by singular value magnitude. It also uses SVD for initialization and periodic reinitialization of low-rank parameters.

Result: SST outperforms other low-rank methods and matches full-rank training in some cases, reducing the perplexity gap by 97.4% on LLaMA-1.3B with only 18.7% trainable parameters.

Conclusion: SST is an effective, parameter-efficient technique for model pre-training, offering significant memory savings without compromising performance.

Abstract: The growing demands on GPU memory posed by the increasing number of neural network parameters call for training approaches that are more memory-efficient. Previous memory reduction training techniques, such as Low-Rank Adaptation (LoRA) and ReLoRA, face challenges, with LoRA being constrained by its low-rank structure, particularly during intensive tasks like pre-training, and ReLoRA suffering from saddle point issues. In this paper, we propose Sparse Spectral Training (SST) to optimize memory usage for pre-training. SST updates all singular values and selectively updates singular vectors through a multinomial sampling method weighted by the magnitude of the singular values. Furthermore, SST employs singular value decomposition to initialize and periodically reinitialize low-rank parameters, reducing distortion relative to full-rank training compared to other low-rank methods. Through comprehensive testing on both Euclidean and hyperbolic neural networks across various tasks, SST demonstrates its ability to outperform existing memory reduction training methods and is comparable to full-rank training in various cases. On LLaMA-1.3B, with only 18.7% of the parameters trainable compared to full-rank training (using a rank equivalent to 6% of the embedding dimension), SST reduces the perplexity gap between other low-rank methods and full-rank training by 97.4%. This result highlights SST as an effective parameter-efficient technique for model pre-training.

[418] Distributed Lag Transformer based on Time-Variable-Aware Learning for Explainable Multivariate Time Series Forecasting

Younghwi Kim, Dohee Kim, Joongrock Kim, Sunghyun Sim

Main category: cs.LG

TL;DR: DLFormer is a novel Transformer architecture for explainable and scalable multivariate time series forecasting, combining distributed lag embedding and time-variable aware learning to improve accuracy and interpretability.

Details

Motivation: Transformer models lack explainability in critical applications despite their performance in multivariate time series forecasting.

Method: DLFormer integrates distributed lag embedding and time-variable aware learning (TVAL) to model temporal dependencies and capture past variable influences.

Result: DLFormer achieves state-of-the-art predictive accuracy and provides interpretable insights on ten benchmark and real-world datasets.

Conclusion: DLFormer bridges the gap between performance and explainability, making it practical for big data forecasting.

Abstract: Time series data is a key element of big data analytics, commonly found in domains such as finance, healthcare, climate forecasting, and transportation. In large scale real world settings, such data is often high dimensional and multivariate, requiring advanced forecasting methods that are both accurate and interpretable. Although Transformer based models perform well in multivariate time series forecasting (MTSF), their lack of explainability limits their use in critical applications. To overcome this, we propose Distributed Lag Transformer (DLFormer), a novel Transformer architecture for explainable and scalable MTSF. DLFormer integrates a distributed lag embedding and a time variable aware learning (TVAL) mechanism to structurally model both local and global temporal dependencies and explicitly capture the influence of past variables on future outcomes. Experiments on ten benchmark and real world datasets show that DLFormer achieves state of the art predictive accuracy while offering robust, interpretable insights into variable wise and temporal dynamics. These results highlight ability of DLFormer to bridge the gap between performance and explainability, making it highly suitable for practical big data forecasting tasks.

[419] Federated Learning for Smart Grid: A Survey on Applications and Potential Vulnerabilities

Zikai Zhang, Suman Rath, Jiahao Xu, Tingsong Xiao

Main category: cs.LG

TL;DR: The paper surveys federated learning (FL) applications in Smart Grids (SGs), focusing on privacy, efficiency, and accuracy. It reviews FL advancements in SG stages, identifies vulnerabilities, and proposes future research directions. It also introduces FedGridShield, an open-source framework for attack and defense methods.

Details

Motivation: Addressing growing concerns about data security and privacy in SGs, the paper explores FL as a solution for collaborative model training without sharing private data.

Method: The survey reviews FL-based SG systems across generation, transmission, distribution, and consumption stages, analyzing vulnerabilities and gaps between SOTA FL research and practical applications.

Result: The paper identifies unique security concerns in FL-based SG systems and introduces FedGridShield, an open-source framework for attack and defense implementations.

Conclusion: The survey highlights the potential of FL in SGs, calls for further research to improve robustness, and provides tools like FedGridShield to inspire advancements.

Abstract: The Smart Grid (SG) is a critical energy infrastructure that collects real-time electricity usage data to forecast future energy demands using information and communication technologies (ICT). Due to growing concerns about data security and privacy in SGs, federated learning (FL) has emerged as a promising training framework. FL offers a balance between privacy, efficiency, and accuracy in SGs by enabling collaborative model training without sharing private data from IoT devices. In this survey, we thoroughly review recent advancements in designing FL-based SG systems across three stages: generation, transmission and distribution, and consumption. Additionally, we explore potential vulnerabilities that may arise when implementing FL in these stages. Furthermore, we discuss the gap between state-of-the-art (SOTA) FL research and its practical applications in SGs, and we propose future research directions. Unlike traditional surveys addressing security issues in centralized machine learning methods for SG systems, this survey is the first to specifically examine the applications and security concerns unique to FL-based SG systems. We also introduce FedGridShield, an open-source framework featuring implementations of SOTA attack and defense methods. Our aim is to inspire further research into applications and improvements in the robustness of FL-based SG systems.

[420] Downscaling Extreme Precipitation with Wasserstein Regularized Diffusion

Yuhao Liu, James Doss-Gollin, Qiushi Dai, Ashok Veeraraghavan, Guha Balakrishnan

Main category: cs.LG

TL;DR: WassDiff, a diffusion framework with Wasserstein regularization, downscales precipitation fields to high resolution, outperforming existing methods in capturing extreme weather events.

Details

Motivation: Extreme rainfall analysis needs high-resolution and long-term data, but current sources lack either resolution or coverage.

Method: WassDiff uses a diffusion framework with Wasserstein regularization to downscale low-resolution precipitation data.

Result: WassDiff outperforms state-of-the-art methods in recovering extreme weather phenomena and reproduces fine-scale structures accurately.

Conclusion: WassDiff enables high-resolution rainfall analysis from coarse data, improving flood-risk and climate-adaptation planning.

Abstract: Understanding the risks posed by extreme rainfall events requires analysis of precipitation fields with high resolution (to assess localized hazards) and extensive historical coverage (to capture sufficient examples of rare occurrences). Radar and mesonet networks provide precipitation fields at 1 km resolution but with limited historical and geographical coverage, while gauge-based records and reanalysis products cover decades of time on a global scale, but only at 30-50 km resolution. To help provide high-resolution precipitation estimates over long time scales, this study presents Wasserstein Regularized Diffusion (WassDiff), a diffusion framework to downscale (super-resolve) precipitation fields from low-resolution gauge and reanalysis products. Crucially, unlike related deep generative models, WassDiff integrates a Wasserstein distribution-matching regularizer to the denoising process to reduce empirical biases at extreme intensities. Comprehensive evaluations demonstrate that WassDiff quantitatively outperforms existing state-of-the-art generative downscaling methods at recovering extreme weather phenomena such as tropical storms and cold fronts. Case studies further qualitatively demonstrate WassDiff’s ability to reproduce realistic fine-scale weather structures and accurate peak intensities. By unlocking decades of high-resolution rainfall information from globally available coarse records, WassDiff offers a practical pathway toward more accurate flood-risk assessments and climate-adaptation planning.

[421] Differentiation Through Black-Box Quadratic Programming Solvers

Connor W. Magoon, Fengyu Yang, Noam Aigerman, Shahar Z. Kovalsky

Main category: cs.LG

TL;DR: dQP is a modular, solver-agnostic framework for differentiating QP solutions, enabling broader use in neural networks and bi-level optimization without solver restrictions.

Details

Motivation: Existing QP differentiation methods rely on specific solvers, limiting flexibility and applicability.

Method: dQP decouples QP solution and differentiation by expressing them as related linear systems using the active set.

Result: The framework integrates with over 15 solvers, showing robustness and scalability, especially in large-scale sparse problems.

Conclusion: dQP provides a flexible, efficient solution for QP differentiation, overcoming solver limitations.

Abstract: Differentiable optimization has attracted significant research interest, particularly for quadratic programming (QP). Existing approaches for differentiating the solution of a QP with respect to its defining parameters often rely on specific integrated solvers. This integration limits their applicability, including their use in neural network architectures and bi-level optimization tasks, restricting users to a narrow selection of solver choices. To address this limitation, we introduce dQP, a modular and solver-agnostic framework for plug-and-play differentiation of virtually any QP solver. Our key theoretical insight is that the solution and its derivative can each be expressed in terms of closely-related and simple linear systems by using the active set at the solution. This insight enables efficient decoupling of the QP’s solution, obtained by any solver, from its differentiation. Our open-source, minimal-overhead implementation will be made publicly available and seamlessly integrates with more than 15 state-of-the-art solvers. Comprehensive benchmark experiments demonstrate dQP’s robustness and scalability, particularly highlighting its advantages in large-scale sparse problems.

[422] Retrieval-Augmented Decision Transformer: External Memory for In-context RL

Thomas Schmied, Fabian Paischer, Vihang Patil, Markus Hofmarcher, Razvan Pascanu, Sepp Hochreiter

Main category: cs.LG

TL;DR: RA-DT introduces a retrieval-augmented method for in-context RL, outperforming baselines with shorter context lengths and addressing limitations in complex environments.

Details

Motivation: Existing in-context RL methods require full episodes, limiting applicability to simple environments. RA-DT aims to overcome this by retrieving relevant sub-trajectories.

Method: RA-DT uses an external memory mechanism to retrieve domain-agnostic sub-trajectories, avoiding the need for full episodes in context.

Result: RA-DT outperforms baselines in grid-worlds with shorter context lengths and demonstrates potential in robotics and video games.

Conclusion: RA-DT advances in-context RL by reducing context length requirements, though challenges remain for complex environments. Datasets are released for future research.

Abstract: In-context learning (ICL) is the ability of a model to learn a new task by observing a few exemplars in its context. While prevalent in NLP, this capability has recently also been observed in Reinforcement Learning (RL) settings. Prior in-context RL methods, however, require entire episodes in the agent’s context. Given that complex environments typically lead to long episodes with sparse rewards, these methods are constrained to simple environments with short episodes. To address these challenges, we introduce Retrieval-Augmented Decision Transformer (RA-DT). RA-DT employs an external memory mechanism to store past experiences from which it retrieves only sub-trajectories relevant for the current situation. The retrieval component in RA-DT does not require training and can be entirely domain-agnostic. We evaluate the capabilities of RA-DT on grid-world environments, robotics simulations, and procedurally-generated video games. On grid-worlds, RA-DT outperforms baselines, while using only a fraction of their context length. Furthermore, we illuminate the limitations of current in-context RL methods on complex environments and discuss future directions. To facilitate future research, we release datasets for four of the considered environments.

[423] Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Taiji Suzuki, Qingfu Zhang, Hau-San Wong

Main category: cs.LG

TL;DR: The paper explores the connection between the linear regularity of multi-concept semantics in transformer-based LLMs and their in-context learning (ICL) capabilities, providing a fine-grained mathematical analysis to explain their innovative power and out-of-distribution performance.

Details

Motivation: Existing studies lack a theoretical understanding of how the linear regularity of multi-concept semantics in LLMs relates to their ICL capabilities, and prior work often simplifies scenarios unrealistically. This paper aims to bridge this gap.

Method: The study uses a concept-based low-noise sparse coding prompt model and advanced techniques to analyze transformers’ training dynamics, including softmax self-attention, ReLU-activated MLPs, and cross-entropy loss.

Result: The analysis demonstrates exponential 0-1 loss convergence over non-convex training dynamics, explaining how transformers leverage multi-concept semantics for powerful ICL and out-of-distribution performance. Empirical simulations support these findings.

Conclusion: The work provides insights into transformers’ ability to innovate solutions for unseen tasks with multi-concept semantics, advancing theoretical understanding of ICL in LLMs.

Abstract: Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs’ impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence rates. In contrast, this work provides a fine-grained mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities, offering insights into how transformers innovate solutions for certain unseen tasks encoded with multiple cross-concept semantics. Inspired by empirical studies on the linear latent geometry of LLMs, the analysis is based on a concept-based low-noise sparse coding prompt model. Leveraging advanced techniques, this work showcases the exponential 0-1 loss convergence over the highly non-convex training dynamics, which pioneeringly incorporates the challenges of softmax self-attention, ReLU-activated MLPs, and cross-entropy loss. Empirical simulations corroborate the theoretical findings.

[424] Generative Feature Training of Thin 2-Layer Networks

Johannes Hertrich, Sebastian Neumayer

Main category: cs.LG

TL;DR: The paper proposes a method to improve training of 2-layer neural networks by initializing hidden weights using a learned generative model and refining them with gradient-based post-processing.

Details

Motivation: Gradient-based training often gets stuck in local minima due to the non-convex energy landscape of 2-layer neural networks.

Method: Initialize hidden weights with samples from a learned generative model, solve for optimal output weights, and refine weights with gradient-based post-processing and regularization.

Result: Numerical examples demonstrate the effectiveness of the approach.

Conclusion: The proposed method mitigates local minima issues and improves training efficiency.

Abstract: We consider the approximation of functions by 2-layer neural networks with a small number of hidden weights based on the squared loss and small datasets. Due to the highly non-convex energy landscape, gradient-based training often suffers from local minima. As a remedy, we initialize the hidden weights with samples from a learned proposal distribution, which we parameterize as a deep generative model. To train this model, we exploit the fact that with fixed hidden weights, the optimal output weights solve a linear equation. After learning the generative model, we refine the sampled weights with a gradient-based post-processing in the latent space. Here, we also include a regularization scheme to counteract potential noise. Finally, we demonstrate the effectiveness of our approach by numerical examples.

[425] Scalable Out-of-distribution Robustness in the Presence of Unobserved Confounders

Parjanya Prashant, Seyedeh Baharan Khatami, Bruno Ribeiro, Babak Salimi

Main category: cs.LG

TL;DR: The paper addresses OOD generalization under unobserved confounders, proposing a simpler predictor with superior performance.

Details

Motivation: Traditional methods fail under unobserved confounders affecting both covariates and labels, creating challenges for OOD robustness.

Method: Introduces identifiability assumptions enabling a simple predictor using only one additional variable.

Result: Demonstrates superior empirical performance on benchmark tasks.

Conclusion: A simpler, effective solution for OOD generalization under confounding.

Abstract: We consider the task of out-of-distribution (OOD) generalization, where the distribution shift is due to an unobserved confounder ($Z$) affecting both the covariates ($X$) and the labels ($Y$). This confounding introduces heterogeneity in the predictor, i.e., $P(Y | X) = E_{P(Z | X)}[P(Y | X,Z)]$, making traditional covariate and label shift assumptions unsuitable. OOD generalization differs from traditional domain adaptation in that it does not assume access to the covariate distribution ($X^\text{te}$) of the test samples during training. These conditions create a challenging scenario for OOD robustness: (a) $Z^\text{tr}$ is an unobserved confounder during training, (b) $P^\text{te}(Z) \neq P^\text{tr}(Z)$, (c) $X^\text{te}$ is unavailable during training, and (d) the predictive distribution depends on $P^\text{te}(Z)$. While prior work has developed complex predictors requiring multiple additional variables for identifiability of the latent distribution, we explore a set of identifiability assumptions that yield a surprisingly simple predictor using only a single additional variable. Our approach demonstrates superior empirical performance on several benchmark tasks.

[426] Indirect Query Bayesian Optimization with Integrated Feedback

Mengyan Zhang, Shahine Bouabid, Cheng Soon Ong, Seth Flaxman, Dino Sejdinovic

Main category: cs.LG

TL;DR: The paper introduces Indirect Query Bayesian Optimization (IQBO), a framework for optimizing functions using indirect feedback via conditional expectations, and proposes the CMES acquisition function and a hierarchical search algorithm for efficiency.

Details

Motivation: The work is motivated by real-world constraints like privacy, hardware, or computational limitations that prevent direct feedback access.

Method: The authors propose the Conditional Max-Value Entropy Search (CMES) acquisition function and a hierarchical search algorithm with multi-resolution feedback.

Result: The paper provides regret bounds for the proposed methods and demonstrates their effectiveness on simulated tasks.

Conclusion: The IQBO framework and CMES method effectively address optimization problems with indirect feedback, offering practical solutions for constrained scenarios.

Abstract: We develop the framework of Indirect Query Bayesian Optimization (IQBO), a new class of Bayesian optimization problems where the integrated feedback is given via a conditional expectation of the unknown function $f$ to be optimized. The underlying conditional distribution can be unknown and learned from data. The goal is to find the global optimum of $f$ by adaptively querying and observing in the space transformed by the conditional distribution. This is motivated by real-world applications where one cannot access direct feedback due to privacy, hardware or computational constraints. We propose the Conditional Max-Value Entropy Search (CMES) acquisition function to address this novel setting, and propose a hierarchical search algorithm with multi-resolution feedback to improve computational efficiency. We show regret bounds for our proposed methods and demonstrate the effectiveness of our approaches on simulated optimization tasks.

[427] Evaluation of Bio-Inspired Models under Different Learning Settings For Energy Efficiency in Network Traffic Prediction

Theodoros Tsiolakis, Nikolaos Pavlidis, Vasileios Perifanis, Pavlos Efraimidis

Main category: cs.LG

TL;DR: The study explores bio-inspired models (SNNs and ESNs) for cellular traffic forecasting, comparing their energy efficiency and predictive performance with traditional ML models (CNNs, MLPs) in centralized and federated settings. Results show bio-inspired models achieve comparable accuracy with significant energy savings.

Details

Motivation: The exponential growth of cellular data and the environmental impact of ML models motivate the investigation of energy-efficient, bio-inspired alternatives for traffic forecasting.

Method: Implemented SNNs and ESNs, compared with CNNs and MLPs, in centralized and federated settings using data from Barcelona. Evaluated predictive performance and energy efficiency.

Result: Bio-inspired models (SNNs, ESNs) matched traditional models in accuracy while saving energy. Federated settings showed promise for decentralized efficiency.

Conclusion: Bio-inspired models offer sustainable, privacy-preserving solutions for cellular traffic forecasting, balancing accuracy and energy efficiency.

Abstract: Cellular traffic forecasting is a critical task that enables network operators to efficiently allocate resources and address anomalies in rapidly evolving environments. The exponential growth of data collected from base stations poses significant challenges to processing and analysis. While machine learning (ML) algorithms have emerged as powerful tools for handling these large datasets and providing accurate predictions, their environmental impact, particularly in terms of energy consumption, is often overlooked in favor of their predictive capabilities. This study investigates the potential of two bio-inspired models: Spiking Neural Networks (SNNs) and Reservoir Computing through Echo State Networks (ESNs) for cellular traffic forecasting. The evaluation focuses on both their predictive performance and energy efficiency. These models are implemented in both centralized and federated settings to analyze their effectiveness and energy consumption in decentralized systems. Additionally, we compare bio-inspired models with traditional architectures, such as Convolutional Neural Networks (CNNs) and Multi-Layer Perceptrons (MLPs), to provide a comprehensive evaluation. Using data collected from three diverse locations in Barcelona, Spain, we examine the trade-offs between predictive accuracy and energy demands across these approaches. The results indicate that bio-inspired models, such as SNNs and ESNs, can achieve significant energy savings while maintaining predictive accuracy comparable to traditional architectures. Furthermore, federated implementations were tested to evaluate their energy efficiency in decentralized settings compared to centralized systems, particularly in combination with bio-inspired models. These findings offer valuable insights into the potential of bio-inspired models for sustainable and privacy-preserving cellular traffic forecasting.

[428] MVICAD2: Multi-View Independent Component Analysis with Delays and Dilations

Ambroise Heurtebise, Omar Chehab, Pierre Ablin, Alexandre Gramfort

Main category: cs.LG

TL;DR: The paper proposes MVICAD2, an extension of MVICAD, to handle temporal delays and dilations in multi-view ICA for neuroscience data, outperforming existing methods.

Details

Motivation: Challenges in multi-view learning for neuroscience, especially in MEG data analysis, due to individual variability and temporal effects like delays and dilations.

Method: Introduces MVICAD2, allowing sources to differ in delays and dilations, with identifiable sources, closed-form likelihood approximation, and optimization techniques.

Result: MVICAD2 outperforms existing multi-view ICA methods in simulations and shows effectiveness in the Cam-CAN dataset, linking delays/dilations to aging.

Conclusion: MVICAD2 addresses limitations of current methods by incorporating temporal variability, improving accuracy in neuroscience applications.

Abstract: Machine learning techniques in multi-view settings face significant challenges, particularly when integrating heterogeneous data, aligning feature spaces, and managing view-specific biases. These issues are prominent in neuroscience, where data from multiple subjects exposed to the same stimuli are analyzed to uncover brain activity dynamics. In magnetoencephalography (MEG), where signals are captured at the scalp level, estimating the brain’s underlying sources is crucial, especially in group studies where sources are assumed to be similar for all subjects. Common methods, such as Multi-View Independent Component Analysis (MVICA), assume identical sources across subjects, but this assumption is often too restrictive due to individual variability and age-related changes. Multi-View Independent Component Analysis with Delays (MVICAD) addresses this by allowing sources to differ up to a temporal delay. However, temporal dilation effects, particularly in auditory stimuli, are common in brain dynamics, making the estimation of time delays alone insufficient. To address this, we propose Multi-View Independent Component Analysis with Delays and Dilations (MVICAD2), which allows sources to differ across subjects in both temporal delays and dilations. We present a model with identifiable sources, derive an approximation of its likelihood in closed form, and use regularization and optimization techniques to enhance performance. Through simulations, we demonstrate that MVICAD2 outperforms existing multi-view ICA methods. We further validate its effectiveness using the Cam-CAN dataset, and showing how delays and dilations are related to aging.

[429] Conformal Prediction of Classifiers with Many Classes based on Noisy Labels

Coby Penso, Jacob Goldberger, Ethan Fetaya

Main category: cs.LG

TL;DR: Noise-Aware Conformal Prediction (NACP) addresses CP calibration with noisy labels, ensuring reliable prediction sets even under label noise.

Details

Motivation: To handle CP calibration when only noisy labeled data is available, ensuring robust uncertainty quantification.

Method: Estimates noise-free conformal thresholds from noisy labeled data, with finite sample coverage guarantees for uniform noise.

Result: Demonstrates effective performance on standard image classification datasets with many classes.

Conclusion: NACP provides a practical solution for CP calibration under label noise, maintaining coverage guarantees.

Abstract: Conformal Prediction (CP) controls the prediction uncertainty of classification systems by producing a small prediction set, ensuring a predetermined probability that the true class lies within this set. This is commonly done by defining a score, based on the model predictions, and setting a threshold on this score using a validation set. In this study, we address the problem of CP calibration when we only have access to a calibration set with noisy labels. We show how we can estimate the noise-free conformal threshold based on the noisy labeled data. We derive a finite sample coverage guarantee for uniform noise that remains effective even in tasks with a large number of classes. We dub our approach Noise-Aware Conformal Prediction (NACP). We illustrate the performance of the proposed results on several standard image classification datasets with a large number of classes.

[430] Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci

Main category: cs.LG

TL;DR: Pivoting Factorization (PIFA) is a novel lossless meta low-rank representation for model compression, improving memory savings and inference speed while matching semi-structured pruning performance.

Details

Motivation: Address the performance gap and inefficiency of low-rank pruning compared to semi-structured pruning in Large Language Models.

Method: Introduces PIFA, which learns compact low-rank representations by identifying pivot rows and expressing non-pivot rows as linear combinations. Also proposes a retraining-free reconstruction method (M) to minimize error.

Result: Achieves 24.2% additional memory savings and 24.6% faster inference at rank = 50% of dimension, outperforming existing low-rank pruning methods.

Conclusion: MPIFA (combining M and PIFA) matches semi-structured pruning performance while being more GPU-efficient and compatible, offering a superior model compression solution.

Abstract: The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its GPU compatibility across all densities. However, low-rank pruning struggles to match the performance of semi-structured pruning, often doubling perplexity at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods, and achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility. Our code is available at https://github.com/biomedical-cybernetics/pivoting-factorization.

[431] One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs

Jacob Dunefsky, Arman Cohan

Main category: cs.LG

TL;DR: The paper proposes optimizing steering vectors (SVs) for LLMs using gradient descent on a single example, demonstrating their effectiveness in controlling safety-relevant behaviors and misalignment.

Details

Motivation: Current SV methods require impractical contrastive datasets, which may capture spurious correlations. The paper aims to simplify SV optimization and explore its generalization.

Method: Directly optimize SVs via gradient descent on one training example, testing various techniques and evaluating generalization across tasks like refusal suppression and misalignment.

Result: One-shot optimized SVs effectively mediate harmful behaviors, achieving a 96.9% attack success rate in Harmbench and inducing misalignment in unrelated prompts.

Conclusion: Single-example SV optimization can control diverse misaligned behaviors in LLMs, offering a practical alternative to large contrastive datasets.

Abstract: Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, we extend work on “emergent misalignment” and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully on unrelated open-ended prompts. Finally, we use one-shot SV optimization to investigate how an instruction-tuned LLM recovers from outputting false information, and find that this ability is independent of the model’s explicit verbalization that the information was false. Overall, our findings suggest that optimizing SVs on a single example can mediate a wide array of misaligned behaviors in LLMs. Code can be found at https://github.com/jacobdunefsky/one-shot-steering-repro and https://github.com/jacobdunefsky/one-shot-steering-misalignment.

[432] Accelerating Linear Recurrent Neural Networks for the Edge with Unstructured Sparsity

Alessandro Pierro, Steven Abreu, Jonathan Timcheck, Philipp Stratmann, Andreas Wild, Sumit Bam Shrestha

Main category: cs.LG

TL;DR: Sparse linear RNNs outperform dense models in efficiency and performance, achieving 2x less compute and 36% less memory at iso-accuracy, with real-world gains in latency and energy on edge hardware.

Details

Motivation: To optimize linear RNNs for resource-constrained edge environments by leveraging unstructured sparsity for better efficiency-performance trade-offs.

Method: Conduct a scaling study to explore performance-efficiency trade-offs, apply sparsity, quantize models, and deploy on Intel Loihi 2 neuromorphic hardware.

Result: Sparse models achieve 42x lower latency and 149x lower energy consumption vs. dense models on edge GPUs, with state-of-the-art audio denoising results.

Conclusion: Unstructured sparsity enables highly efficient RNNs for real-world edge applications, demonstrating transformative potential in resource-constrained settings.

Abstract: Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. Unstructured sparsity offers a compelling solution, enabling substantial reductions in compute and memory requirements–when accelerated by compatible hardware platforms. In this paper, we conduct a scaling study to investigate the Pareto front of performance and efficiency across inference compute budgets. We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines, with 2x less compute and 36% less memory at iso-accuracy. Our models achieve state-of-the-art results on a real-time streaming task for audio denoising. By quantizing our sparse models to fixed-point arithmetic and deploying them on the Intel Loihi 2 neuromorphic chip for real-time processing, we translate model compression into tangible gains of 42x lower latency and 149x lower energy consumption compared to a dense model on an edge GPU. Our findings showcase the transformative potential of unstructured sparsity, paving the way for highly efficient recurrent neural networks in real-world, resource-constrained environments.

[433] RIZE: Regularized Imitation Learning via Distributional Reinforcement Learning

Adib Karimi, Mohammad Mehdi Ebadzadeh

Main category: cs.LG

TL;DR: A novel IRL method with adaptive TD regularizer and distributional RL improves reward flexibility and achieves expert-level performance on MuJoCo tasks.

Details

Motivation: Addresses rigidity in fixed reward structures and limited flexibility in implicit reward regularization in IRL.

Method: Incorporates a squared TD regularizer with adaptive targets and integrates distributional RL for richer return information.

Result: Achieves expert-level performance on MuJoCo tasks, outperforming baselines on Humanoid with 3 demonstrations.

Conclusion: Validates effectiveness through experiments, offering insights into reward dynamics in imitation learning.

Abstract: We propose a novel Inverse Reinforcement Learning (IRL) method that mitigates the rigidity of fixed reward structures and the limited flexibility of implicit reward regularization. Building on the Maximum Entropy IRL framework, our approach incorporates a squared temporal-difference (TD) regularizer with adaptive targets that evolve dynamically during training, thereby imposing adaptive bounds on recovered rewards and promoting robust decision-making. To capture richer return information, we integrate distributional RL into the learning process. Empirically, our method achieves expert-level performance on complex MuJoCo tasks, surpassing baseline methods on the Humanoid task with 3 demonstrations. Extensive experiments and ablation studies further validate the effectiveness of the approach and provide insights into reward dynamics in imitation learning.

[434] LEAPS: A discrete neural sampler via locally equivariant networks

Peter Holderrieth, Michael S. Albergo, Tommi Jaakkola

Main category: cs.LG

TL;DR: LEAPS is an algorithm for sampling from discrete distributions using a CTMC-based approach, combining annealed importance sampling with novel locally equivariant functions for scalable training.

Details

Motivation: The paper aims to address the challenge of sampling from discrete distributions by leveraging continuous-time Markov chains (CTMCs) to minimize variance in importance weights.

Method: LEAPS introduces a continuous-time formulation of annealed importance sampling, using Radon-Nikodym derivatives and locally equivariant functions to parameterize rate matrices efficiently.

Result: The algorithm demonstrates efficacy in statistical physics problems, showing reduced variance in importance weights.

Conclusion: LEAPS provides a scalable and efficient method for sampling from discrete distributions by combining CTMCs with locally equivariant neural network architectures.

Abstract: We propose “LEAPS”, an algorithm to sample from discrete distributions known up to normalization by learning a rate matrix of a continuous-time Markov chain (CTMC). LEAPS can be seen as a continuous-time formulation of annealed importance sampling and sequential Monte Carlo methods, extended so that the variance of the importance weights is offset by the inclusion of the CTMC. To derive these importance weights, we introduce a set of Radon-Nikodym derivatives of CTMCs over their path measures. Because the computation of these weights is intractable with standard neural network parameterizations of rate matrices, we devise a new compact representation for rate matrices via what we call “locally equivariant” functions. To parameterize them, we introduce a family of locally equivariant multilayer perceptrons, attention layers, and convolutional networks, and provide an approach to make deep networks that preserve the local equivariance. This property allows us to propose a scalable training algorithm for the rate matrix such that the variance of the importance weights associated to the CTMC are minimal. We demonstrate the efficacy of LEAPS on problems in statistical physics.

[435] Fast, Accurate Manifold Denoising by Tunneling Riemannian Optimization

Shiyu Wang, Mariam Avagyan, Yihan Shen, Arnaud Lamy, Tingran Wang, Szabolcs Márka, Zsuzsa Márka, John Wright

Main category: cs.LG

TL;DR: The paper introduces a test-time efficient manifold denoising framework by reframing denoising as optimization, using online learning and mixed-order methods for global optimality and efficiency.

Details

Motivation: Existing denoising methods are inefficient or lack interpretability, prompting the need for a novel approach leveraging manifold structure in data.

Method: Proposes online learning to optimize over clean signal manifolds and mixed-order methods to ensure global optimality and efficiency.

Result: Theoretical and experimental results show improved complexity-performance tradeoffs over existing methods like nearest neighbor search.

Conclusion: The framework offers efficient and near-optimal denoising by combining learning-to-optimize with mixed-order traversal.

Abstract: Learned denoisers play a fundamental role in various signal generation (e.g., diffusion models) and reconstruction (e.g., compressed sensing) architectures, whose success derives from their ability to leverage low-dimensional structure in data. Existing denoising methods, however, either rely on local approximations that require a linear scan of the entire dataset or treat denoising as generic function approximation problems, often sacrificing efficiency and interpretability. We consider the problem of efficiently denoising a new noisy data point sampled from an unknown $d$-dimensional manifold $M \in \mathbb{R}^D$, using only noisy samples. This work proposes a framework for test-time efficient manifold denoising, by framing the concept of “learning-to-denoise” as “learning-to-optimize”. We have two technical innovations: (i) online learning methods which learn to optimize over the manifold of clean signals using only noisy data, effectively “growing” an optimizer one sample at a time. (ii) mixed-order methods which guarantee that the learned optimizers achieve global optimality, ensuring both efficiency and near-optimal denoising performance. We corroborate these claims with theoretical analyses of both the complexity and denoising performance of mixed-order traversal. Our experiments on scientific manifolds demonstrate significantly improved complexity-performance tradeoffs compared to nearest neighbor search, which underpins existing provable denoising approaches based on exhaustive search.

[436] Underdamped Diffusion Bridges with Applications to Sampling

Denis Blessing, Julius Berner, Lorenz Richter, Gerhard Neumann

Main category: cs.LG

TL;DR: The paper introduces a framework for learning diffusion bridges, including underdamped versions, and shows its equivalence to likelihood maximization. It proposes underdamped diffusion bridges for sampling unnormalized densities, achieving state-of-the-art performance.

Details

Motivation: The motivation is to improve generative modeling and sampling by leveraging underdamped stochastic processes, which offer better convergence and compatibility with numerical integration.

Method: The method involves learning diffusion bridges, including underdamped versions, and applying score matching to maximize likelihood. It introduces underdamped diffusion bridges for general density evolution.

Result: The approach outperforms alternatives in sampling unnormalized densities, requiring fewer steps and no hyperparameter tuning.

Conclusion: The framework is effective for generative modeling and sampling, with underdamped diffusion bridges offering superior performance and efficiency.

Abstract: We provide a general framework for learning diffusion bridges that transport prior to target distributions. It includes existing diffusion models for generative modeling, but also underdamped versions with degenerate diffusion matrices, where the noise only acts in certain dimensions. Extending previous findings, our framework allows to rigorously show that score matching in the underdamped case is indeed equivalent to maximizing a lower bound on the likelihood. Motivated by superior convergence properties and compatibility with sophisticated numerical integration schemes of underdamped stochastic processes, we propose \emph{underdamped diffusion bridges}, where a general density evolution is learned rather than prescribed by a fixed noising process. We apply our method to the challenging task of sampling from unnormalized densities without access to samples from the target distribution. Across a diverse range of sampling problems, our approach demonstrates state-of-the-art performance, notably outperforming alternative methods, while requiring significantly fewer discretization steps and no hyperparameter tuning.

[437] Mosaic: Composite Projection Pruning for Resource-efficient LLMs

Bailey J. Eccles, Leon Wong, Blesson Varghese

Main category: cs.LG

TL;DR: The paper introduces projection pruning, a fine-grained method for pruning LLMs, enhanced by composite projection pruning, and presents Mosaic, a system for creating and deploying pruned models with significant improvements in speed, accuracy, and resource efficiency.

Details

Motivation: Large language models (LLMs) face deployment challenges due to high compute and memory demands. Existing coarse-grained pruning methods are inefficient and degrade model quality.

Method: The paper proposes projection pruning and composite projection pruning (combining unstructured and structured pruning) and introduces Mosaic, a system for implementing these methods.

Result: Mosaic is 7.19x faster in model production, achieves up to 84.2% lower perplexity, 31.4% higher accuracy, 67% faster inference, and 68% lower GPU memory usage compared to coarse-grained pruning.

Conclusion: Mosaic offers a superior approach to pruning LLMs, balancing accuracy and efficiency, and is publicly available for use.

Abstract: Extensive compute and memory requirements limit the deployment of large language models (LLMs) on any hardware. Compression methods, such as pruning, can reduce model size, which in turn reduces resource requirements. State-of-the-art pruning is based on coarse-grained methods. They are time-consuming and inherently remove critical model parameters, adversely impacting the quality of the pruned model. This paper introduces projection pruning, a novel fine-grained method for pruning LLMs. In addition, LLM projection pruning is enhanced by a new approach we refer to as composite projection pruning - the synergistic combination of unstructured pruning that retains accuracy and structured pruning that reduces model size. We develop Mosaic, a novel system to create and deploy pruned LLMs using composite projection pruning. Mosaic is evaluated using a range of performance and quality metrics on multiple hardware platforms, LLMs, and datasets. Mosaic is 7.19x faster in producing models than existing approaches. Mosaic models achieve up to 84.2% lower perplexity and 31.4% higher accuracy than models obtained from coarse-grained pruning. Up to 67% faster inference and 68% lower GPU memory use is noted for Mosaic models. Mosaic is available for public use from https://github.com/blessonvar/Mosaic

[438] FedRecon: Missing Modality Reconstruction in Heterogeneous Distributed Environments

Junming Liu, Yanting Gao, Yifei Sun, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng

Main category: cs.LG

TL;DR: FedRecon addresses missing modality reconstruction and Non-IID adaptation in multimodal federated learning using a lightweight MVAE and distribution mapping.

Details

Motivation: Real-world multimodal data is often incomplete and Non-IID, posing challenges for federated learning.

Method: Uses a Multimodal Variational Autoencoder (MVAE) for missing modality reconstruction and a distribution mapping mechanism for sample-level alignment. Global generator freezing prevents catastrophic forgetting.

Result: Outperforms state-of-the-art methods in modality reconstruction under Non-IID conditions.

Conclusion: FedRecon effectively tackles coupled challenges of missing modalities and Non-IID data in FL, demonstrating superior performance.

Abstract: Multimodal data are often incomplete and exhibit Non-Independent and Identically Distributed (Non-IID) characteristics in real-world scenarios. These inherent limitations lead to both modality heterogeneity through partial modality absence and data heterogeneity from distribution divergence, creating fundamental challenges for effective federated learning (FL). To address these coupled challenges, we propose FedRecon, the first method targeting simultaneous missing modality reconstruction and Non-IID adaptation in multimodal FL. Our approach first employs a lightweight Multimodal Variational Autoencoder (MVAE) to reconstruct missing modalities while preserving cross-modal consistency. Distinct from conventional imputation methods, we achieve sample-level alignment through a novel distribution mapping mechanism that guarantees both data consistency and completeness. Additionally, we introduce a strategy employing global generator freezing to prevent catastrophic forgetting, which in turn mitigates Non-IID fluctuations. Extensive evaluations on multimodal datasets demonstrate FedRecon’s superior performance in modality reconstruction under Non-IID conditions, surpassing state-of-the-art methods. The code will be released upon paper acceptance.

[439] Dequantified Diffusion-Schr{ö}dinger Bridge for Density Ratio Estimation

Wei Chen, Shigui Li, Jiacheng Li, Junmei Yang, John Paisley, Delu Zeng

Main category: cs.LG

TL;DR: D3RE is a robust, stable, and efficient framework for density ratio estimation, addressing density-chasm and support-chasm problems with novel interpolants (DDBI and DSBI).

Details

Motivation: Existing methods for density ratio estimation fail under significantly different distributions or inadequately overlapping supports, and suffer from instability near boundaries.

Method: Proposes D3RE with dequantified diffusion bridge interpolant (DDBI) and dequantified Schrödinger bridge interpolant (DSBI) to expand support coverage and stabilize time scores.

Result: The method provides uniform approximation, bounded time scores theoretically, and outperforms baselines in mutual information and density estimation tasks.

Conclusion: D3RE effectively addresses key challenges in density ratio estimation, offering improved robustness, stability, and efficiency.

Abstract: Density ratio estimation is fundamental to tasks involving $f$-divergences, yet existing methods often fail under significantly different distributions or inadequately overlapping supports – the density-chasm and the support-chasm problems. Additionally, prior approaches yield divergent time scores near boundaries, leading to instability. We design $\textbf{D}^3\textbf{RE}$, a unified framework for \textbf{robust}, \textbf{stable} and \textbf{efficient} density ratio estimation. We propose the dequantified diffusion bridge interpolant (DDBI), which expands support coverage and stabilizes time scores via diffusion bridges and Gaussian dequantization. Building on DDBI, the proposed dequantified Schr{"o}dinger bridge interpolant (DSBI) incorporates optimal transport to solve the Schr{"o}dinger bridge problem, enhancing accuracy and efficiency. Our method offers uniform approximation and bounded time scores in theory, and outperforms baselines empirically in mutual information and density estimation tasks.

[440] Halting Recurrent GNNs and the Graded $μ$-Calculus

Jeroen Bollen, Jan Van den Bussche, Stijn Vansummeren, Jonni Virtema

Main category: cs.LG

TL;DR: A halting mechanism for recurrent GNNs is proposed, enabling termination guarantees and expressive power matching graded modal mu-calculus.

Details

Motivation: Address the lack of termination guarantees and graph-size assumptions in current recurrent GNNs.

Method: Develop a halting mechanism and a new approximate semantics for graded mu-calculus, leading to a counting algorithm.

Result: The halting model can express all node classifiers in graded modal mu-calculus, independent of graph size.

Conclusion: The proposed halting recurrent GNNs are both expressive and practical, with termination guarantees.

Abstract: Graph Neural Networks (GNNs) are a class of machine-learning models that operate on graph-structured data. Their expressive power is intimately related to logics that are invariant under graded bisimilarity. Current proposals for recurrent GNNs either assume that the graph size is given to the model, or suffer from a lack of termination guarantees. In this paper, we propose a halting mechanism for recurrent GNNs. We prove that our halting model can express all node classifiers definable in graded modal mu-calculus, even for the standard GNN variant that is oblivious to the graph size. To prove our main result, we develop a new approximate semantics for graded mu-calculus, which we believe to be of independent interest. We leverage this new semantics into a new model-checking algorithm, called the counting algorithm, which is oblivious to the graph size. In a final step we show that the counting algorithm can be implemented on a halting recurrent GNN.

[441] Understanding Nonlinear Implicit Bias via Region Counts in Input Space

Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang

Main category: cs.LG

TL;DR: The paper explores implicit bias in neural networks by defining it through the count of connected regions in the input space with the same predicted label, linking it to generalization performance.

Details

Motivation: To better understand implicit bias in non-linear, overparameterized models, which remains poorly defined and understood.

Method: Proposes using region count as a metric for implicit bias, analyzing its correlation with decision boundary simplicity and generalization. Empirical and theoretical analysis of hyper-parameters (e.g., learning rate, batch size) on region count.

Result: Small region counts correlate with simple decision boundaries and good generalization. Larger learning rates and smaller batch sizes reduce region counts.

Conclusion: Region count is a useful metric for implicit bias, connecting hyper-parameters to generalization via decision boundary simplicity.

Abstract: One explanation for the strong generalization ability of neural networks is implicit bias. Yet, the definition and mechanism of implicit bias in non-linear contexts remains little understood. In this work, we propose to characterize implicit bias by the count of connected regions in the input space with the same predicted label. Compared with parameter-dependent metrics (e.g., norm or normalized margin), region count can be better adapted to nonlinear, overparameterized models, because it is determined by the function mapping and is invariant to reparametrization. Empirically, we found that small region counts align with geometrically simple decision boundaries and correlate well with good generalization performance. We also observe that good hyper-parameter choices such as larger learning rates and smaller batch sizes can induce small region counts. We further establish the theoretical connections and explain how larger learning rate can induce small region counts in neural networks.

[442] Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada

Main category: cs.LG

TL;DR: The paper explores using the Bellman optimality operator in actor-critic methods for continuous action spaces, proposing an annealing approach to balance learning speed and bias.

Details

Motivation: Improve sample efficiency in continuous action RL by incorporating the Bellman optimality operator, addressing limitations of current methods.

Method: Proposes an annealing technique transitioning from Bellman optimality to Bellman operator, tested with TD3 and SAC.

Result: Accelerates learning but introduces overestimation bias; annealing mitigates this, outperforming existing methods in locomotion and manipulation tasks.

Conclusion: The annealing approach enhances performance and robustness, validated by superior results in experiments.

Abstract: For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality. The code for this study is available at https://github.com/motokiomura/annealed-q-learning.

[443] Mini-Game Lifetime Value Prediction in WeChat

Aochuan Chen, Yifan Niu, Ziqi Gao, Yujie Sun, Shoujun Liu, Gong Chen, Yang Liu, Jia Li

Main category: cs.LG

TL;DR: The paper introduces GRePO-LTV, a framework combining graph representation learning and Pareto-optimization to improve LTV prediction in sparse data and correlated task settings.

Details

Motivation: Accurate LTV prediction is crucial for aligning ads with user interests, but data scarcity and task interdependencies complicate this.

Method: Uses graph representation learning for data scarcity and Pareto-optimization for task interdependence.

Result: Proposes a novel framework to enhance LTV prediction accuracy.

Conclusion: GRePO-LTV effectively addresses key challenges in LTV prediction.

Abstract: The LifeTime Value (LTV) prediction, which endeavors to forecast the cumulative purchase contribution of a user to a particular item, remains a vital challenge that advertisers are keen to resolve. A precise LTV prediction system enhances the alignment of user interests with meticulously designed advertisements, thereby generating substantial profits for advertisers. Nonetheless, this issue is complicated by the paucity of data typically observed in real-world advertising scenarios. The purchase rate among registered users is often as critically low as 0.1%, resulting in a dataset where the majority of users make only several purchases. Consequently, there is insufficient supervisory signal for effectively training the LTV prediction model. An additional challenge emerges from the interdependencies among tasks with high correlation. It is a common practice to estimate a user’s contribution to a game over a specified temporal interval. Varying the lengths of these intervals corresponds to distinct predictive tasks, which are highly correlated. For instance, predictions over a 7-day period are heavily reliant on forecasts made over a 3-day period, where exceptional cases can adversely affect the accuracy of both tasks. In order to comprehensively address the aforementioned challenges, we introduce an innovative framework denoted as Graph-Represented Pareto-Optimal LifeTime Value prediction (GRePO-LTV). Graph representation learning is initially employed to address the issue of data scarcity. Subsequently, Pareto-Optimization is utilized to manage the interdependence of prediction tasks.

[444] Leveraging Predictive Equivalence in Decision Trees

Hayden McTavish, Zachery Boner, Jon Donnelly, Margo Seltzer, Cynthia Rudin

Main category: cs.LG

TL;DR: The paper addresses predictive equivalence in decision trees, introduces a boolean logical representation to avoid it, and demonstrates its benefits in robustness, variable importance, and cost optimization.

Details

Motivation: Decision trees face predictive equivalence, where different trees yield the same decision boundary, complicating model selection and interpretation.

Method: Proposes a boolean logical representation of decision trees to eliminate predictive equivalence, ensuring faithfulness to the decision boundary.

Result: Shows decision trees’ robustness to missing feature values, improves variable importance quantification, and optimizes prediction cost.

Conclusion: The boolean representation resolves predictive equivalence, enhancing interpretability and performance in downstream tasks.

Abstract: Decision trees are widely used for interpretable machine learning due to their clearly structured reasoning process. However, this structure belies a challenge we refer to as predictive equivalence: a given tree’s decision boundary can be represented by many different decision trees. The presence of models with identical decision boundaries but different evaluation processes makes model selection challenging. The models will have different variable importance and behave differently in the presence of missing values, but most optimization procedures will arbitrarily choose one such model to return. We present a boolean logical representation of decision trees that does not exhibit predictive equivalence and is faithful to the underlying decision boundary. We apply our representation to several downstream machine learning tasks. Using our representation, we show that decision trees are surprisingly robust to test-time missingness of feature values; we address predictive equivalence’s impact on quantifying variable importance; and we present an algorithm to optimize the cost of reaching predictions.

[445] The Importance of Being Lazy: Scaling Limits of Continual Learning

Jacopo Graldi, Alessandro Breccia, Giulia Lanzillotta, Thomas Hofmann, Lorenzo Noci

Main category: cs.LG

TL;DR: The paper reconciles contradictory findings on model scale in continual learning by distinguishing lazy vs. rich training regimes. It shows width benefits only reduce feature learning (laziness) and analyzes CF in the feature learning regime, revealing optimal performance at a critical feature learning level dependent on task non-stationarity.

Details

Motivation: To understand the impact of model scale and feature learning on catastrophic forgetting (CF) in continual learning, addressing contradictory observations in existing literature.

Method: Systematic study using variable parameterization to differentiate lazy and rich regimes, dynamical mean field theory for infinite width dynamics, and analysis of task similarity and non-stationarity.

Result: Increasing width benefits only when reducing feature learning (laziness). High feature learning helps only with highly similar tasks. Optimal performance occurs at a critical feature learning level.

Conclusion: Provides a unified view on scale and feature learning in continual learning, showing their interplay with task non-stationarity and CF.

Abstract: Despite recent efforts, neural networks still struggle to learn in non-stationary environments, and our understanding of catastrophic forgetting (CF) is far from complete. In this work, we perform a systematic study on the impact of model scale and the degree of feature learning in continual learning. We reconcile existing contradictory observations on scale in the literature, by differentiating between lazy and rich training regimes through a variable parameterization of the architecture. We show that increasing model width is only beneficial when it reduces the amount of feature learning, yielding more laziness. Using the framework of dynamical mean field theory, we then study the infinite width dynamics of the model in the feature learning regime and characterize CF, extending prior theoretical results limited to the lazy regime. We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks. We identify a transition modulated by task similarity where the model exits an effectively lazy regime with low forgetting to enter a rich regime with significant forgetting. Finally, our findings reveal that neural networks achieve optimal performance at a critical level of feature learning, which depends on task non-stationarity and transfers across model scales. This work provides a unified perspective on the role of scale and feature learning in continual learning.

[446] Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges

Changxi Chi, Jun Xia, Yufei Huang, Jingbo Zhou, Siyuan Li, Yunfan Liu, Chang Yu, Stan Z. Li

Main category: cs.LG

TL;DR: Proposes Unlasting, a dual conditional diffusion model framework (DDIB) to address unpaired single-cell perturbation data, integrating GRN and masking for improved predictions and introducing a biologically grounded evaluation metric.

Details

Motivation: Single-cell sequencing is destructive, leading to unpaired data under perturbed and unperturbed conditions. Existing methods either force pairings or ignore inherent relationships, limiting accuracy.

Method: Uses Dual Diffusion Implicit Bridges (DDIB) to map data distributions, integrates GRN for biologically meaningful signal propagation, and employs masking to predict silent genes. Introduces a new evaluation metric for bimodal distributions.

Result: Unlasting effectively handles unpaired data, improves generation quality, and captures intrinsic heterogeneity in single-cell responses.

Conclusion: The framework addresses key challenges in single-cell perturbation analysis, offering a robust solution with enhanced biological relevance and evaluation.

Abstract: Estimating single-cell responses across various perturbations facilitates the identification of key genes and enhances drug screening, significantly boosting experimental efficiency. However, single-cell sequencing is a destructive process, making it impossible to capture the same cell’s phenotype before and after perturbation. Consequently, data collected under perturbed and unperturbed conditions are inherently unpaired. Existing methods either attempt to forcibly pair unpaired data using random sampling, or neglect the inherent relationship between unperturbed and perturbed cells during the modeling. In this work, we propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions, effectively addressing the challenge of unpaired data. We further interpret this framework as a form of data augmentation. We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way, and further incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles. Moreover, gene expression under the same perturbation often varies significantly across cells, frequently exhibiting a bimodal distribution that reflects intrinsic heterogeneity. To capture this, we introduce a more suitable evaluation metric. We propose Unlasting, dual conditional diffusion models that overcome the problem of unpaired single-cell perturbation data and strengthen the model’s insight into perturbations under the guidance of the GRN, with a dedicated mask model designed to improve generation quality by predicting silent genes. In addition, we introduce a biologically grounded evaluation metric that better reflects the inherent heterogeneity in single-cell responses.

[447] Faster Diffusion Models via Higher-Order Approximation

Gen Li, Yuchen Zhou, Yuting Wei, Yuxin Chen

Main category: cs.LG

TL;DR: A training-free sampling algorithm accelerates diffusion models without retraining, requiring fewer score function evaluations for accurate distribution approximation.

Details

Motivation: To improve sampling efficiency in diffusion models without additional training or restrictive assumptions on data distributions.

Method: Proposes a high-order ODE solver-inspired algorithm using Lagrange interpolation and successive refinement for the probability flow ODE.

Result: Achieves provable acceleration with fewer score evaluations, robust to inexact score estimation.

Conclusion: The framework demonstrates the potential of high-order methods for efficient sampling in diffusion models.

Abstract: In this paper, we explore provable acceleration of diffusion models without any additional retraining. Focusing on the task of approximating a target data distribution in $\mathbb{R}^d$ to within $\varepsilon$ total-variation distance, we propose a principled, training-free sampling algorithm that requires only the order of $$ d^{1+2/K} \varepsilon^{-1/K} $$ score function evaluations (up to log factor) in the presence of accurate scores, where $K>0$ is an arbitrary fixed integer. This result applies to a broad class of target data distributions, without the need for assumptions such as smoothness or log-concavity. Our theory is robust vis-a-vis inexact score estimation, degrading gracefully as the score estimation error increases – without demanding higher-order smoothness on the score estimates as assumed in previous work. The proposed algorithm draws insight from high-order ODE solvers, leveraging high-order Lagrange interpolation and successive refinement to approximate the integral derived from the probability flow ODE. More broadly, our work develops a theoretical framework towards understanding the efficacy of high-order methods for accelerated sampling.

[448] Audio-3DVG: Unified Audio – Point Cloud Fusion for 3D Visual Grounding

Duc Cao-Dinh, Khai Le-Duc, Anh Dao, Bach Phan Tat, Chris Ngo, Duy M. H. Nguyen, Nguyen X. Khanh, Thanh Nguyen-Tang

Main category: cs.LG

TL;DR: Audio-3DVG integrates audio and spatial info for 3D visual grounding, outperforming text-based methods by decomposing speech into object mention detection and audio-guided attention.

Details

Motivation: Prior work focuses on text-based 3DVG, leaving audio-based grounding underexplored. Advances in ASR and speech representation learning motivate this work.

Method: Decomposes audio into (i) Object Mention Detection (multi-label classification) and (ii) Audio-Guided Attention for scene reasoning. Benchmarked on synthesized audio from 3DVG datasets.

Result: Achieves state-of-the-art in audio-based grounding and competes with text-based methods.

Conclusion: Audio-3DVG shows promise for integrating spoken language into 3D vision tasks.

Abstract: 3D Visual Grounding (3DVG) involves localizing target objects in 3D point clouds based on natural language. While prior work has made strides using textual descriptions, leveraging spoken language-known as Audio-based 3D Visual Grounding-remains underexplored and challenging. Motivated by advances in automatic speech recognition (ASR) and speech representation learning, we propose Audio-3DVG, a simple yet effective framework that integrates audio and spatial information for enhanced grounding. Rather than treating speech as a monolithic input, we decompose the task into two complementary components. First, we introduce (i) Object Mention Detection, a multi-label classification task that explicitly identifies which objects are referred to in the audio, enabling more structured audio-scene reasoning. Second, we propose an (ii) Audio-Guided Attention module that models the interactions between target candidates and mentioned objects, enhancing discrimination in cluttered 3D environments. To support benchmarking, we (iii) synthesize audio descriptions for standard 3DVG datasets, including ScanRefer, Sr3D, and Nr3D. Experimental results demonstrate that Audio-3DVG not only achieves new state-of-the-art performance in audio-based grounding, but also competes with text-based methods, highlight the promise of integrating spoken language into 3D vision tasks.

[449] Quantum Machine Learning in Transportation: A Case Study of Pedestrian Stress Modelling

Bara Rababah, Bilal Farooq

Main category: cs.LG

TL;DR: The paper explores quantum machine learning (QML) for modeling pedestrian stress using skin conductance response (SCR) data. Quantum Support Vector Machine (QSVM) and Quantum Neural Network (QNN) models were tested, with QNN outperforming QSVM and classical methods.

Details

Motivation: To leverage quantum computing for complex machine learning tasks, specifically modeling pedestrian stress in intelligent transportation systems using SCR data.

Method: Developed QSVM and QNN models with eight-qubit ZZ feature maps on Pennylane, using SCR measurements categorized into amplitude-based classes.

Result: QSVM showed overfitting (45% test accuracy), while QNN achieved better performance (55% test accuracy), surpassing classical methods.

Conclusion: QNN is more reliable for SCR-based stress classification, demonstrating the potential of QML in transportation systems.

Abstract: Quantum computing has opened new opportunities to tackle complex machine learning tasks, for instance, high-dimensional data representations commonly required in intelligent transportation systems. We explore quantum machine learning to model complex skin conductance response (SCR) events that reflect pedestrian stress in a virtual reality road crossing experiment. For this purpose, Quantum Support Vector Machine (QSVM) with an eight-qubit ZZ feature map and a Quantum Neural Network (QNN) using a Tree Tensor Network ansatz and an eight-qubit ZZ feature map, were developed on Pennylane. The dataset consists of SCR measurements along with features such as the response amplitude and elapsed time, which have been categorized into amplitude-based classes. The QSVM achieved good training accuracy, but had an overfitting problem, showing a low test accuracy of 45% and therefore impacting the reliability of the classification model. The QNN model reached a higher test accuracy of 55%, making it a better classification model than the QSVM and the classic versions.

[450] Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown

Emile Anand, Sarah Liaw

Main category: cs.LG

TL;DR: FG-TS improves exploration in contextual bandits with optimism bonuses, outperforming vanilla TS in linear/logistic settings but struggling with neural bandits. Robustness varies with posterior accuracy.

Details

Motivation: Address the lack of aggressive exploration in high-dimensional problems with Thompson Sampling by introducing optimism bonuses.

Method: Propose Feel-Good Thompson Sampling (FG-TS) and its smoothed variant (SFG-TS), testing them across 11 benchmarks with exact and approximate posteriors.

Result: FG-TS outperforms vanilla TS in linear/logistic bandits but is weaker in neural settings. Performance depends on posterior accuracy and bonus scaling.

Conclusion: FG-TS is recommended as a competitive, easy-to-use baseline for contextual-bandit benchmarks, with source code provided for reproducibility.

Abstract: Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that biases toward high-reward models, and it achieves the asymptotically minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with \emph{approximate} posteriors – common in large-scale or neural problems – has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across eleven real-world and synthetic benchmarks. To evaluate their robustness, we compare performance across settings with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Nevertheless, because FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks. Finally, we provide source code for all our experiments in https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown.

[451] How Much is Too Much? Learning Personalised Risk Thresholds in Real-World Driving

Amir Hossein Kalantari, Eleonora Papadimitriou, Amir Pooyan Afghari

Main category: cs.LG

TL;DR: The paper introduces an adaptive, personalized risk detection framework for naturalistic driving data, addressing limitations of existing methods by dynamically adjusting risk labels and model learning to individual driver behavior.

Details

Motivation: Existing frameworks for identifying driving risks rely on fixed thresholds and assume uniform behavior, failing to capture individual variability or contextual changes. This study aims to bridge this gap.

Method: The proposed framework uses a rolling time window with bi-level optimization, dynamically calibrating model hyperparameters and driver-specific risk thresholds. It tests three models (Random Forest, XGBoost, DNN) on two safety indicators.

Result: Speed-weighted time headway outperformed harsh-event counts in stability and context sensitivity. XGBoost showed consistent performance, while DNN excelled in early-risk detection but with higher variability.

Conclusion: The framework enhances real-time safety feedback and supports driver-specific interventions, demonstrating the value of adaptive, personalized risk detection in intelligent transport systems.

Abstract: While naturalistic driving studies have become foundational for providing real-world driver behaviour data, the existing frameworks for identifying risk based on such data have two fundamental limitations: (i) they rely on predefined time windows and fixed thresholds to disentangle risky and normal episodes of driving behaviour, and (ii) they assume stationary behavioural distribution across drivers and trips. These limitations have hindered the ability of the existing frameworks to capture behavioural nuances, adapt to individual variability, or respond to stochastic fluctuations in driving contexts. Thus, there is a need for a unified framework that jointly adapts risk labels and model learning to per-driver behavioural dynamics, a gap this study aims to bridge. We present an adaptive and personalised risk detection framework, built on Belgian naturalistic driving data, integrating a rolling time window with bi-level optimisation and dynamically calibrating both model hyperparameters and driver-specific risk thresholds at the same time. The framework was tested using two safety indicators, speed-weighted time headway and harsh driving events, and three models: Random Forest, XGBoost, and Deep Neural Network (DNN). Speed-weighted time headway yielded more stable and context-sensitive classifications than harsh-event counts. XGBoost maintained consistent performance under changing thresholds, while the DNN excelled in early-risk detection at lower thresholds but exhibited higher variability. The ensemble calibration integrates model-specific thresholds and confidence scores into a unified risk decision, balancing sensitivity and stability. Overall, the framework demonstrates the potential of adaptive and personalised risk detection to enhance real-time safety feedback and support driver-specific interventions within intelligent transport systems.

[452] Estimating Worst-Case Frontier Risks of Open-Weight LLMs

Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, Chris Koch

Main category: cs.LG

TL;DR: The paper studies the risks of releasing GPT-OSS by introducing malicious fine-tuning (MFT) to maximize capabilities in biology and cybersecurity. Results show MFT GPT-OSS underperforms closed-weight models and marginally increases risks compared to open-weight models, supporting its release.

Details

Motivation: To assess the worst-case frontier risks of releasing GPT-OSS by testing its capabilities in high-risk domains (biology and cybersecurity).

Method: Malicious fine-tuning (MFT) is used to maximize GPT-OSS capabilities in biology (threat creation) and cybersecurity (CTF challenges). Performance is compared against open- and closed-weight LLMs.

Result: MFT GPT-OSS underperforms closed-weight models (e.g., OpenAI o3) and only marginally increases risks compared to open-weight models.

Conclusion: The findings support the decision to release GPT-OSS, and the MFT approach provides guidance for evaluating harm in future open-weight releases.

Abstract: In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to our decision to release the model, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

[453] FairPOT: Balancing AUC Performance and Fairness with Proportional Optimal Transport

Pengxi Liu, Yi Shen, Matthew M. Engelhard, Benjamin A. Goldstein, Michael J. Pencina, Nicoleta J. Economou-Zavlanos, Michael M. Zavlanos

Main category: cs.LG

TL;DR: FairPOT is a post-processing framework using optimal transport to balance fairness and AUC performance by selectively transforming risk scores in disadvantaged groups.

Details

Motivation: Addressing the trade-off between enforcing fairness and maintaining AUC performance in high-stakes domains like healthcare and finance.

Method: FairPOT aligns risk score distributions across groups by transforming a controllable proportion (top-lambda quantile) of scores in disadvantaged groups using optimal transport.

Result: FairPOT outperforms existing methods, achieving better fairness with minimal AUC degradation or even utility gains, and is computationally efficient.

Conclusion: FairPOT is a practical, adaptable solution for real-world fairness interventions in high-stakes applications.

Abstract: Fairness metrics utilizing the area under the receiver operator characteristic curve (AUC) have gained increasing attention in high-stakes domains such as healthcare, finance, and criminal justice. In these domains, fairness is often evaluated over risk scores rather than binary outcomes, and a common challenge is that enforcing strict fairness can significantly degrade AUC performance. To address this challenge, we propose Fair Proportional Optimal Transport (FairPOT), a novel, model-agnostic post-processing framework that strategically aligns risk score distributions across different groups using optimal transport, but does so selectively by transforming a controllable proportion, i.e., the top-lambda quantile, of scores within the disadvantaged group. By varying lambda, our method allows for a tunable trade-off between reducing AUC disparities and maintaining overall AUC performance. Furthermore, we extend FairPOT to the partial AUC setting, enabling fairness interventions to concentrate on the highest-risk regions. Extensive experiments on synthetic, public, and clinical datasets show that FairPOT consistently outperforms existing post-processing techniques in both global and partial AUC scenarios, often achieving improved fairness with slight AUC degradation or even positive gains in utility. The computational efficiency and practical adaptability of FairPOT make it a promising solution for real-world deployment.

[454] Dual Signal Decomposition of Stochastic Time Series

Alex Glushkovsky

Main category: cs.LG

TL;DR: The paper presents a method to decompose stochastic time series into mean and dispersion signals while isolating noise using machine learning, with applications in smoothing, denoising, and forecasting.

Details

Motivation: To address the challenge of decomposing stochastic time series into meaningful components (mean and dispersion) while isolating noise, enabling better analysis and forecasting.

Method: Machine learning techniques are applied to fit a dual signal (mean and dispersion) by minimizing a loss function that balances fitting the original series and penalizing irregularities. Regularization terms include derivatives, and Statistical Process Control weights preserve patterns. Two learning approaches (sequential and joint) and optimization methods (direct or neural networks) are used.

Result: The method effectively decomposes time series into mean and dispersion signals, isolates noise, and can be tuned for specific applications (e.g., stepped or smoothed signals). It also enables 2D representation for further analysis.

Conclusion: The proposed decomposition is versatile, serving as a smoothing or denoising tool, and supports forecasting and cross-effect analysis in multi-series contexts.

Abstract: The decomposition of a stochastic time series into three component series representing a dual signal - namely, the mean and dispersion - while isolating noise is presented. The decomposition is performed by applying machine learning techniques to fit the dual signal. Machine learning minimizes the loss function which compromises between fitting the original time series and penalizing irregularities of the dual signal. The latter includes terms based on the first and second order derivatives along time. To preserve special patterns, weighting of the regularization components of the loss function has been introduced based on Statistical Process Control methodology. The proposed decomposition can be applied as a smoothing algorithm against the mean and dispersion of the time series. By isolating noise, the proposed decomposition can be seen as a denoising algorithm. Two approaches of the learning process have been considered: sequential and jointly. The former approach learns the mean signal first and then dispersion. The latter approach fits the dual signal jointly. Jointly learning can uncover complex relationships for the time series with heteroskedasticity. Learning has been set by solving the direct non-linear unconstrained optimization problem or by applying neural networks that have sequential or twin output architectures. Tuning of the loss function hyperparameters focuses on the isolated noise to be a stationary stochastic process without autocorrelation properties. Depending on the applications, the hyperparameters of the learning can be tuned towards either the discrete states by stepped signal or smoothed series. The decomposed dual signal can be represented on the 2D space and used to learn inherent structures, to forecast both mean and dispersion, or to analyze cross effects in case of multiple time series.

[455] Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

Mateusz Praski, Jakub Adamczyk, Wojciech Czech

Main category: cs.LG

TL;DR: The study compares 25 pretrained neural models for molecular chemistry tasks, finding most offer negligible improvement over the baseline ECFP fingerprint, with only CLAMP performing significantly better.

Details

Motivation: To rigorously evaluate the effectiveness of pretrained neural networks in molecular chemistry tasks compared to traditional methods like ECFP fingerprints.

Method: A fair comparison framework assessed 25 models across 25 datasets, using hierarchical Bayesian statistical testing.

Result: Most neural models showed no significant improvement over ECFP; only CLAMP outperformed alternatives.

Conclusion: The findings highlight evaluation rigor concerns in existing studies and suggest practical solutions and recommendations.

Abstract: Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.

[456] Generalizing Scaling Laws for Dense and Sparse Large Language Models

Md Arafat Hossain, Xingfu Wu, Valerie Taylor, Ali Jannesari

Main category: cs.LG

TL;DR: The paper proposes a generalized scaling law for large language models, applicable to both dense and sparse architectures, addressing the challenge of resource allocation and model size prediction.

Details

Motivation: The rapid growth of language model sizes and computational costs has created a need for efficient training techniques, but existing scaling laws are architecture-specific.

Method: The authors revisit existing scaling laws and introduce a generalized scaling law framework for dense and sparse models, evaluating its effectiveness through comparison.

Result: The proposed scaling law demonstrates effectiveness in unifying predictions for both dense and sparse architectures.

Conclusion: The generalized scaling law provides a unified solution for optimizing resource allocation and model size prediction across diverse architectures.

Abstract: Over the past few years, the size of language models has grown exponentially, as has the computational cost to train these large models. This rapid growth has motivated researchers to develop new techniques aimed at enhancing the efficiency of the training process. Despite these advancements, optimally predicting the model size or allocating optimal resources remains a challenge. Several efforts have addressed the challenge by proposing different scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws to demonstrate its effectiveness.

[457] C-MAG: Cascade Multimodal Attributed Graphs for Supply Chain Link Prediction

Yunqing Li, Zixiang Tang, Jiaying Zhuang, Zhenyu Yang, Farhad Ameri, Jianbang Zhang

Main category: cs.LG

TL;DR: PMGraph is a benchmark for supply-chain graphs, and C-MAG is a two-stage architecture improving link prediction by fusing multimodal data.

Details

Motivation: Traditional methods fail to capture complex supply-chain data like capabilities, certifications, and multimodal profiles.

Method: C-MAG aligns and aggregates textual/visual attributes into group embeddings, then propagates them via multiscale message passing.

Result: C-MAG enhances link prediction accuracy and handles noisy real-world data effectively.

Conclusion: PMGraph and C-MAG address gaps in supply-chain analysis, offering practical solutions for multimodal data fusion.

Abstract: Workshop version accepted at KDD 2025 (AI4SupplyChain). Connecting an ever-expanding catalogue of products with suitable manufacturers and suppliers is critical for resilient, efficient global supply chains, yet traditional methods struggle to capture complex capabilities, certifications, geographic constraints, and rich multimodal data of real-world manufacturer profiles. To address these gaps, we introduce PMGraph, a public benchmark of bipartite and heterogeneous multimodal supply-chain graphs linking 8,888 manufacturers, over 70k products, more than 110k manufacturer-product edges, and over 29k product images. Building on this benchmark, we propose the Cascade Multimodal Attributed Graph C-MAG, a two-stage architecture that first aligns and aggregates textual and visual attributes into intermediate group embeddings, then propagates them through a manufacturer-product hetero-graph via multiscale message passing to enhance link prediction accuracy. C-MAG also provides practical guidelines for modality-aware fusion, preserving predictive performance in noisy, real-world settings.

[458] Probabilistic Emissivity Retrieval from Hyperspectral Data via Physics-Guided Variational Inference

Joshua R. Tempelman, Kevin Mitchell, Adam J. Wachtor, Eric B. Flynn

Main category: cs.LG

TL;DR: A physics-conditioned generative model for hyperspectral imaging (HSI) target identification, offering interpretable uncertainty quantification and material matching without bias to predefined training sets.

Details

Motivation: Overcome limitations of per-pixel deep learning frameworks in HSI, which lack interpretability and are restricted to predefined training materials.

Method: Inverse modeling using a probabilistic latent-variable model, conditioned on atmospheric and background data, with physics-based loss criteria and Monte-Carlo sampling.

Result: Produces emissivity distributions, interpretable uncertainty measures, and likely material matches, enhancing contextual understanding of HSI scenes.

Conclusion: The approach integrates contextual HSI information, captures material spectrum variation, and provides interpretable probability measures for material identification.

Abstract: Recent research has proven neural networks to be a powerful tool for performing hyperspectral imaging (HSI) target identification. However, many deep learning frameworks deliver a single material class prediction and operate on a per-pixel basis; such approaches are limited in their interpretability and restricted to predicting materials that are accessible in available training libraries. In this work, we present an inverse modeling approach in the form of a physics-conditioned generative model.A probabilistic latent-variable model learns the underlying distribution of HSI radiance measurements and produces the conditional distribution of the emissivity spectrum. Moreover, estimates of the HSI scene’s atmosphere and background are used as a physically relevant conditioning mechanism to contextualize a given radiance measurement during the encoding and decoding processes. Furthermore, we employ an in-the-loop augmentation scheme and physics-based loss criteria to avoid bias towards a predefined training material set and to encourage the model to learn physically consistent inverse mappings. Monte-Carlo sampling of the model’s conditioned posterior delivers a sought emissivity distribution and allows for interpretable uncertainty quantification. Moreover, a distribution-based material matching scheme is presented to return a set of likely material matches for an inferred emissivity distribution. Hence, we present a strategy to incorporate contextual information about a given HSI scene, capture the possible variation of underlying material spectra, and provide interpretable probability measures of a candidate material accounting for given remotely-sensed radiance measurement.

[459] Regret minimization in Linear Bandits with offline data via extended D-optimal exploration

Sushant Vijayan, Arun Suggala, Karthikeyan Shanmugam, Soumyabrata Pal

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: We consider the problem of online regret minimization in linear bandits with access to prior observations (offline data) from the underlying bandit model. There are numerous applications where extensive offline data is often available, such as in recommendation systems, online advertising. Consequently, this problem has been studied intensively in recent literature. Our algorithm, Offline-Online Phased Elimination (OOPE), effectively incorporates the offline data to substantially reduce the online regret compared to prior work. To leverage offline information prudently, OOPE uses an extended D-optimal design within each exploration phase. OOPE achieves an online regret is $\tilde{O}(\sqrt{\deff T \log \left(|\mathcal{A}|T\right)}+d^2)$. $\deff \leq d)$ is the effective problem dimension which measures the number of poorly explored directions in offline data and depends on the eigen-spectrum $(\lambda_k){k \in [d]}$ of the Gram matrix of the offline data. The eigen-spectrum $(\lambda_k){k \in [d]}$ is a quantitative measure of the \emph{quality} of offline data. If the offline data is poorly explored ($\deff \approx d$), we recover the established regret bounds for purely online setting while, when offline data is abundant ($\Toff » T$) and well-explored ($\deff = o(1) $), the online regret reduces substantially. Additionally, we provide the first known minimax regret lower bounds in this setting that depend explicitly on the quality of the offline data. These lower bounds establish the optimality of our algorithm in regimes where offline data is either well-explored or poorly explored. Finally, by using a Frank-Wolfe approximation to the extended optimal design we further improve the $O(d^{2})$ term to $O\left(\frac{d^{2}}{\deff} \min { \deff,1} \right)$, which can be substantial in high dimensions with moderate quality of offline data $\deff = \Omega(1)$.

[460] Dynamic Rank Adjustment for Accurate and Efficient Neural Network Training

Hyuntak Shin, Aecheon Jung, Sungeun Hong, Sunwoo Lee

Main category: cs.LG

TL;DR: A dynamic-rank training framework is proposed to prevent rank collapse in low-rank training by interleaving full-rank epochs, achieving accuracy comparable to full-rank training with low computational cost.

Details

Motivation: Low-rank training methods limit weight matrix rank, hindering learning of complex patterns and accelerating rank decline during training.

Method: Interleave full-rank training epochs within low-rank training to restore weight matrix rank dynamically.

Result: The framework achieves comparable accuracy to full-rank training with computational cost similar to low-rank training.

Conclusion: Dynamic-rank training effectively balances computational efficiency and model performance by preventing rank collapse.

Abstract: Low-rank training methods reduce the number of trainable parameters by re-parameterizing the weights with matrix decompositions (e.g., singular value decomposition). However, enforcing a fixed low-rank structure caps the rank of the weight matrices and can hinder the model’s ability to learn complex patterns. Furthermore, the effective rank of the model’s weights tends to decline during training, and this drop is accelerated when the model is reparameterized into a low-rank structure. In this study, we argue that strategically interleaving full-rank training epochs within low-rank training epochs can effectively restore the rank of the model’s weights. Based on our findings, we propose a general dynamic-rank training framework that is readily applicable to a wide range of neural-network tasks. We first describe how to adjust the rank of weight matrix to alleviate the inevitable rank collapse that arises during training, and then present extensive empirical results that validate our claims and demonstrate the efficacy of the proposed framework. Our empirical study shows that the proposed method achieves almost the same computational cost as SVD-based low-rank training while achieving a comparable accuracy to full-rank training across various benchmarks.

[461] TempOpt – Unsupervised Alarm Relation Learning for Telecommunication Networks

Sathiyanaryanan Sampath, Pratyush Uppuluri, Thirumaran Ekambaram

Main category: cs.LG

TL;DR: The paper proposes TempOpt, an unsupervised technique for learning alarm relations in telecom networks to improve root alarm identification.

Details

Motivation: Handling the enormous volume of interconnected alarms in telecom networks is challenging for NOC engineers, requiring better methods to learn alarm relations for accurate and faster resolution.

Method: The paper introduces TempOpt, a novel unsupervised alarm relation learning technique, addressing limitations of existing temporal dependency methods.

Result: Experiments on real-world datasets show TempOpt learns higher-quality alarm relations compared to temporal dependency methods.

Conclusion: TempOpt is a practical and effective solution for improving root alarm identification in telecom networks.

Abstract: In a telecommunications network, fault alarms generated by network nodes are monitored in a Network Operations Centre (NOC) to ensure network availability and continuous network operations. The monitoring process comprises of tasks such as active alarms analysis, root alarm identification, and resolution of the underlying problem. Each network node potentially can generate alarms of different types, while nodes can be from multiple vendors, a network can have hundreds of nodes thus resulting in an enormous volume of alarms at any time. Since network nodes are inter-connected, a single fault in the network would trigger multiple sequences of alarms across a variety of nodes and from a monitoring point of view, it is a challenging task for a NOC engineer to be aware of relations between the various alarms, when trying to identify, for example, a root alarm on which an action needs to be taken. To effectively identify root alarms, it is essential to learn relation among the alarms for accurate and faster resolution. In this work we propose a novel unsupervised alarm relation learning technique Temporal Optimization (TempOpt) that is practical and overcomes the limitations of an existing class of alarm relational learning method-temporal dependency methods. Experiments have been carried on real-world network datasets, that demonstrate the improved quality of alarm relations learned by TempOpt as compared to temporal dependency method.

cs.MA

[462] Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems

Yutong Wu, Jie Zhang, Yiming Li, Chao Zhang, Qing Guo, Nils Lukas, Tianwei Zhang

Main category: cs.MA

TL;DR: The paper introduces Cowpox, a defense approach to enhance the robustness of multi-agent systems against adversarial attacks by limiting infection spread and improving recovery.

Details

Motivation: Existing multi-agent systems lack robustness against adversarial attacks, where exploits can spread and compromise the entire system.

Method: Cowpox uses a distributed mechanism to generate and distribute a special cure sample, immunizing agents pre-exposure and aiding recovery of infected ones.

Result: Empirical demonstrations show Cowpox’s effectiveness, supported by theoretical robustness guarantees.

Conclusion: Cowpox provably improves the robustness of multi-agent systems by mitigating adversarial attack impacts.

Abstract: Vision Language Model (VLM)-based agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language. Multi-agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property is robustness, stating that the system should maintain its integrity under adversarial attacks. However, the design of existing multi-agent systems lacks the robustness consideration, as a successful exploit against one agent can spread and infect other agents to undermine the entire system’s assurance. To address this, we propose a new defense approach, Cowpox, to provably enhance the robustness of multi-agent systems. It incorporates a distributed mechanism, which improves the recovery rate of agents by limiting the expected number of infections to other agents. The core idea is to generate and distribute a special cure sample that immunizes an agent against the attack before exposure and helps recover the already infected agents. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees.

[463] Emergence of Hierarchies in Multi-Agent Self-Organizing Systems Pursuing a Joint Objective

Gang Chen, Guoxin Wang, Anton van Beek, Zhenjun Ming, Yan Yan

Main category: cs.MA

TL;DR: The paper explores how dependency hierarchies emerge in multi-agent self-organizing systems (MASOS) during task execution, using multi-agent reinforcement learning (MARL) to analyze their dynamic evolution and influencing factors.

Details

Motivation: To understand the unpredictable emergent behaviors in MASOS, specifically how dependency hierarchies arise and evolve during collaborative tasks.

Method: Employed MARL to train MASOS for a box-pushing task, quantifying inter-agent dependencies via action gradients and analyzing hierarchy emergence.

Result: Hierarchies emerge dynamically, influenced by task requirements, environment, and network initialization, without pre-configured rules. Roles shift based on agents’ ‘Talent’ and ‘Effort.’

Conclusion: Dependency hierarchies in MASOS arise organically from collective objectives, shaped by agent interactions and environmental factors, highlighting their adaptive nature.

Abstract: Multi-agent self-organizing systems (MASOS) exhibit key characteristics including scalability, adaptability, flexibility, and robustness, which have contributed to their extensive application across various fields. However, the self-organizing nature of MASOS also introduces elements of unpredictability in their emergent behaviors. This paper focuses on the emergence of dependency hierarchies during task execution, aiming to understand how such hierarchies arise from agents’ collective pursuit of the joint objective, how they evolve dynamically, and what factors govern their development. To investigate this phenomenon, multi-agent reinforcement learning (MARL) is employed to train MASOS for a collaborative box-pushing task. By calculating the gradients of each agent’s actions in relation to the states of other agents, the inter-agent dependencies are quantified, and the emergence of hierarchies is analyzed through the aggregation of these dependencies. Our results demonstrate that hierarchies emerge dynamically as agents work towards a joint objective, with these hierarchies evolving in response to changing task requirements. Notably, these dependency hierarchies emerge organically in response to the shared objective, rather than being a consequence of pre-configured rules or parameters that can be fine-tuned to achieve specific results. Furthermore, the emergence of hierarchies is influenced by the task environment and network initialization conditions. Additionally, hierarchies in MASOS emerge from the dynamic interplay between agents’ “Talent” and “Effort” within the “Environment.” “Talent” determines an agent’s initial influence on collective decision-making, while continuous “Effort” within the “Environment” enables agents to shift their roles and positions within the system.

[464] Extending the OWASP Multi-Agentic System Threat Modeling Guide: Insights from Multi-Agent Security Research

Klaudia Krawiecka, Christian Schroeder de Witt

Main category: cs.MA

TL;DR: The paper extends the OWASP MAS Threat Modeling Guide by addressing gaps in modeling failures for LLM-driven multi-agent systems, introducing new threat classes, and proposing evaluation strategies.

Details

Motivation: To improve security and resilience in complex, autonomous multi-agent systems by translating recent MASEC research into practical guidance.

Method: Identifies gaps in existing OWASP taxonomy, introduces new threat classes (e.g., reasoning collapse, metric overfitting), and outlines evaluation strategies (e.g., robustness testing, emergent behavior monitoring).

Result: Expands OWASP’s framework to cover unique challenges in LLM-driven multi-agent architectures, enhancing security coverage.

Conclusion: The extension improves the applicability of OWASP’s framework to modern, adaptive multi-agent systems, aiming for better real-world security.

Abstract: We propose an extension to the OWASP Multi-Agentic System (MAS) Threat Modeling Guide, translating recent anticipatory research in multi-agent security (MASEC) into practical guidance for addressing challenges unique to large language model (LLM)-driven multi-agent architectures. Although OWASP’s existing taxonomy covers many attack vectors, our analysis identifies gaps in modeling failures, including, but not limited to: reasoning collapse across planner-executor chains, metric overfitting, unsafe delegation escalation, emergent covert coordination, and heterogeneous multi-agent exploits. We introduce additional threat classes and scenarios grounded in practical MAS deployments, highlighting risks from benign goal drift, cross-agent hallucination propagation, affective prompt framing, and multi-agent backdoors. We also outline evaluation strategies, including robustness testing, coordination assessment, safety enforcement, and emergent behavior monitoring, to ensure complete coverage. This work complements the framework of OWASP by expanding its applicability to increasingly complex, autonomous, and adaptive multi-agent systems, with the goal of improving security posture and resilience in real world deployments.

[465] Game-Theoretic Multiagent Reinforcement Learning

Yaodong Yang, Chengdong Ma, Zihan Ding, Stephen McAleer, Chi Jin, Jun Wang, Tuomas Sandholm

Main category: cs.MA

TL;DR: A comprehensive monograph on multiagent reinforcement learning (MARL) covering fundamentals, game-theoretic foundations, and recent advances since 2010.

Details

Motivation: The lack of an updated, self-contained overview of MARL literature, especially post-2010 developments, and the need for a game-theoretic perspective.

Method: The work provides a monograph summarizing MARL fundamentals and recent advancements, focusing on game-theoretic foundations.

Result: A detailed, up-to-date resource on MARL techniques, bridging gaps in existing surveys.

Conclusion: The monograph serves as a valuable reference for new and experienced researchers, offering insights into current trends and future directions in MARL.

Abstract: Tremendous advances have been made in multiagent reinforcement learning (MARL). MARL corresponds to the learning problem in a multiagent system in which multiple agents learn simultaneously. It is an interdisciplinary field of study with a long history that includes game theory, machine learning, stochastic control, psychology, and optimization. Despite great successes in MARL, there is a lack of a self-contained overview of the literature that covers game-theoretic foundations of modern MARL methods and summarizes the recent advances. The majority of existing surveys are outdated and do not fully cover the recent developments since 2010. In this work, we provide a monograph on MARL that covers both the fundamentals and the latest developments on the research frontier. The goal of this monograph is to provide a self-contained assessment of the current state-of-the-art MARL techniques from a game-theoretic perspective. We expect this work to serve as a stepping stone for both new researchers who are about to enter this fast-growing field and experts in the field who want to obtain a panoramic view and identify new directions based on recent advances.

[466] ABIDES-Economist: Agent-Based Simulator of Economic Systems with Learning Agents

Kshama Dwarakanath, Tucker Balch, Svitlana Vyetrenko

Main category: cs.MA

TL;DR: ABIDES-Economist is an agent-based simulator for economic systems with heterogeneous agents, integrating reinforcement learning and real-world data. It validates by matching stylized facts and outperforms rule-based policies.

Details

Motivation: To create a realistic economic simulator that incorporates agent heterogeneity, adaptability, and reinforcement learning for policy design.

Method: Uses agent-based modeling with reinforcement learning (OpenAI Gym), calibrated with economic literature and U.S. data. Validates against stylized facts.

Result: Simulated data aligns with stylized facts; learned policies outperform rule-based approaches in regulatory scenarios.

Conclusion: ABIDES-Economist is a validated tool for economic simulation and policy design, addressing heterogeneity and adaptability.

Abstract: We present ABIDES-Economist, an agent-based simulator for economic systems that includes heterogeneous households, firms, a central bank, and a government. Agent behavior can be defined using domain-specific behavioral rules or learned through reinforcement learning by specifying their objectives. We integrate reinforcement learning capabilities for all agents using the OpenAI Gym environment framework for the multi-agent system. To enhance the realism of our model, we base agent parameters and action spaces on economic literature and real U.S. economic data. To tackle the challenges of calibrating heterogeneous agent-based economic models, we conduct a comprehensive survey of stylized facts related to both microeconomic and macroeconomic time series data. We then validate ABIDES-Economist by demonstrating its ability to generate simulated data that aligns with the relevant stylized facts for the economic scenario under consideration, following the learning of all agent behaviors via reinforcement learning. Specifically, we train our economic agents’ policies under two broad configurations. The first configuration demonstrates that the learned economic agents produce system data consistent with macroeconomic and microeconomic stylized facts. The second configuration illustrates the utility of the validated simulation platform in designing regulatory policies for the central bank and government. These policies outperform standard rule-based approaches from the literature, which often overlook agent heterogeneity, shocks, and agent adaptability.

cs.MM

Nick Oh, Giorgos D. Vrakas, Siân J. M. Brooke, Sasha Morinière, Toju Duke

Main category: cs.MM

TL;DR: PETLP is a compliance framework integrating GDPR, copyright, and platform terms for social media data research, offering practical workflows and legal safeguards.

Details

Motivation: Existing frameworks fail to unify regulatory domains like GDPR, copyright, and platform terms, leaving researchers without clear guidance.

Method: Introduces PETLP, embedding legal safeguards into ETL pipelines and treating Data Protection Impact Assessments as evolving documents.

Result: Demonstrates differing extraction rights for research vs. commercial entities, highlights anonymisation challenges, and exposes legal gaps in dataset creation and model distribution.

Conclusion: PETLP bridges legal requirements and research practice, simplifying compliance and institutional data management.

Abstract: Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms – yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We reveal why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.

[468] AI Blob! LLM-Driven Recontextualization of Italian Television Archives

Roberto Balestri

Main category: cs.MM

TL;DR: AI Blob! uses semantic cataloging and LLMs to retrieve and reinterpret archival TV footage, leveraging ASR, embeddings, and RAG for automated narrative construction.

Details

Motivation: Explore the potential of semantic technologies for dynamic archival retrieval and recontextualization, moving beyond static metadata.

Method: Integrates ASR, semantic embeddings, and RAG to process 1,547 Italian TV videos, enabling thematic querying and narrative montage generation.

Result: Demonstrates automated narrative construction and cultural analysis, offering a framework and dataset for further research.

Conclusion: AI Blob! advances media historiography and AI-driven archival research, enabling novel forms of engagement and interdisciplinary experimentation.

Abstract: This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing methodological inspiration from Italian television programs such as Blob (RAI Tre, 1989-), AI Blob! integrates automatic speech recognition (ASR), semantic embeddings, and retrieval-augmented generation (RAG) to organize and reinterpret archival content. The system processes a curated dataset of 1,547 Italian television videos by transcribing audio, segmenting it into sentence-level units, and embedding these segments into a vector database for semantic querying. Upon user input of a thematic prompt, the LLM generates a range of linguistically and conceptually related queries, guiding the retrieval and recombination of audiovisual fragments. These fragments are algorithmically selected and structured into narrative sequences producing montages that emulate editorial practices of ironic juxtaposition and thematic coherence. By foregrounding dynamic, content-aware retrieval over static metadata schemas, AI Blob! demonstrates how semantic technologies can facilitate new approaches to archival engagement, enabling novel forms of automated narrative construction and cultural analysis. The project contributes to ongoing debates in media historiography and AI-driven archival research, offering both a conceptual framework and a publicly available dataset to support further interdisciplinary experimentation.

[469] In-place Double Stimulus Methodology for Subjective Assessment of High Quality Images

Shima Mohammadi, Mohsen Jenadeleh, Michela Testolina, Jon Sneyers, Touradj Ebrahimi, Dietmar Saupe, João Ascenso

Main category: cs.MM

TL;DR: A novel double stimulus method (IDSQS) for evaluating high-quality images, addressing limitations in detecting subtle differences, with crowdsourced validation and public dataset release.

Details

Motivation: To overcome existing protocols' limitations in detecting subtle perceptual differences in high-quality images.

Method: In-place Double Stimulus Quality Scale (IDSQS) allows alternating reference and distorted images at the same location, enhancing intuitive quality comparison.

Result: High correlation with precise benchmarks, public dataset, and Beta distribution modeling for quality score variability.

Conclusion: IDSQS effectively detects subtle quality differences, validated by crowdsourcing, with resources made publicly available.

Abstract: This paper introduces a novel double stimulus subjective assessment methodology for the evaluation of high quality images to address the limitations of existing protocols in detecting subtle perceptual differences. The In-place Double Stimulus Quality Scale (IDSQS) allows subjects to alternately view a reference and a distorted image at the same spatial location, facilitating a more intuitive detection of differences in quality, especially at high to visually lossless quality levels. A large-scale crowdsourcing study employing this methodology was conducted, generating a comprehensive public dataset to evaluate perceived image quality across several compression algorithms and distortion levels. An additional contribution is the modeling of quality scores using a Beta distribution, allowing for the assessment of variability and subject consistency. Our findings demonstrate the effectiveness of the IDSQS methodology in achieving high correlation with more precise subjective evaluation benchmarks. The dataset, subjective data, and graphical user interface developed for this study are publicly available at https://github.com/shimamohammadi/IDSQS

[470] Multimodal LLM-based Query Paraphrasing for Video Search

Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, Sheng-Hua Zhong, Xiong-Yong Wei, Qing Li

Main category: cs.MM

TL;DR: The paper addresses limitations in text-to-video retrieval by using LLMs to paraphrase queries, decompose complex queries, and verify paraphrases for factual correctness, improving retrieval performance.

Details

Motivation: Current text-to-video retrieval methods struggle with out-of-vocabulary problems and complex queries involving logical/spatial constraints.

Method: Leverages LLMs for query paraphrasing (T2T, T2I, I2T), decomposes complex queries, and uses consistency-based verification to filter incorrect paraphrases.

Result: Improves retrieval performance on TRECVid datasets, resolving traditionally difficult queries.

Conclusion: Query paraphrasing with LLMs and verification effectively enhances text-to-video retrieval for complex and out-of-vocabulary queries.

Abstract: Text-to-video retrieval answers user queries through searches based on concepts and embeddings. However, due to limitations in the size of the concept bank and the amount of training data, answering queries in the wild is not always effective because of the out-of-vocabulary problem. Furthermore, neither concept-based nor embedding-based search can perform reasoning to consolidate search results for complex queries that include logical and spatial constraints. To address these challenges, we leverage large language models (LLMs) to paraphrase queries using text-to-text (T2T), text-to-image (T2I), and image-to-text (I2T) transformations. These transformations rephrase abstract concepts into simpler terms to mitigate the out-of-vocabulary problem. Additionally, complex relationships within a query can be decomposed into simpler sub-queries, improving retrieval performance by effectively fusing the search results of these sub-queries. To mitigate the issue of LLM hallucination, this paper also proposes a novel consistency-based verification strategy to filter out factually incorrect paraphrased queries. Extensive experiments are conducted for ad-hoc video search and known-item search on the TRECVid datasets. We provide empirical insights into how traditionally difficult-to-answer queries can be effectively resolved through query paraphrasing.

[471] Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding

Dawei Huang, Qing Li, Chuan Yan, Zebang Cheng, Zihao Han, Yurong Huang, Xiang Li, Bin Li, Xiaohui Wang, Zheng Lian, Zhi-Qi Cheng, Xiaojiang Peng

Main category: cs.MM

TL;DR: Emotion-Qwen is a multimodal framework for robust emotion understanding in videos while preserving general vision-language reasoning, using a Hybrid Compressor and a three-stage pre-training pipeline.

Details

Motivation: Existing Large Multimodal Models (LMMs) struggle with emotion-specific tasks due to catastrophic forgetting and performance drops.

Method: Proposes Emotion-Qwen with a Hybrid Compressor (MoE-based) and a three-stage pre-training pipeline, plus the Video Emotion Reasoning (VER) dataset.

Result: Achieves state-of-the-art performance in emotion benchmarks and maintains competitive results in general VL tasks.

Conclusion: Emotion-Qwen effectively balances emotion-specific and general multimodal reasoning, advancing video emotion understanding.

Abstract: Accurate emotion understanding in videos necessitates effectively recognizing and interpreting emotional states by integrating visual, textual, auditory, and contextual cues. Although recent Large Multimodal Models (LMMs) have exhibited significant progress in general vision-language (VL) tasks, their performance often deteriorates in emotion-specific scenarios, exhibiting catastrophic forgetting when fine-tuned on emotion-centric tasks. To overcome these limitations, we propose Emotion-Qwen, a unified multimodal framework designed to simultaneously enable robust emotion understanding and preserve general VL reasoning capabilities. Emotion-Qwen introduces a novel Hybrid Compressor based on a Mixture-of-Experts (MoE) architecture, dynamically routing inputs to optimally balance emotion-specific processing and general multimodal reasoning. We further propose a carefully structured three-stage pre-training pipeline, leveraging extensive general and emotion-focused datasets to strengthen multimodal representation robustness and model adaptability. Additionally, we develop the Video Emotion Reasoning (VER) dataset, a large-scale bilingual resource containing over 40K video clips annotated with detailed context-aware emotional descriptions, significantly facilitating research on fine-grained emotional reasoning. Extensive experiments confirm that Emotion-Qwen achieves state-of-the-art performance across multiple emotion recognition and reasoning benchmarks, while maintaining highly competitive results in general VL tasks.

[472] VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

Main category: cs.MM

TL;DR: VGGSounder is introduced to address VGGSound’s limitations for evaluating audio-visual foundation models, offering better annotations and modality-specific analysis.

Details

Motivation: VGGSound's flaws (incomplete labels, overlapping classes, misaligned modalities) distort evaluations of audio-visual models.

Method: VGGSounder re-annotates VGGSound as a multi-label test set with detailed modality annotations and introduces a modality confusion metric.

Result: VGGSounder enables precise modality-specific performance analysis and reveals model limitations via performance degradation.

Conclusion: VGGSounder improves evaluation reliability for audio-visual foundation models.

Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

[473] Fact-Checking at Scale: Multimodal AI for Authenticity and Context Verification in Online Media

Van-Hoang Phan, Tung-Duong Le-Duc, Long-Khanh Pham, Anh-Thu Le, Quynh-Huong Dinh-Nguyen, Dang-Quan Vo, Hoang-Quoc Nguyen-Son, Anh-Duy Tran, Dang Vu, Minh-Son Dao

Main category: cs.MM

TL;DR: A system for verifying multimedia content’s authenticity and contextual accuracy, integrating visual forensics, textual analysis, and multimodal reasoning, with proven effectiveness in real-world scenarios.

Details

Motivation: The rapid spread of misinformation via multimedia content necessitates robust verification tools, especially during crises.

Method: A unified pipeline combining visual forensics, textual analysis, and multimodal reasoning, with a hybrid approach for detecting out-of-context media.

Result: The system performs effectively in diverse real-world scenarios, as demonstrated by evaluations.

Conclusion: The work advances multimedia verification and provides practical tools for combating misinformation.

Abstract: The proliferation of multimedia content on social media platforms has dramatically transformed how information is consumed and disseminated. While this shift enables real-time coverage of global events, it also facilitates the rapid spread of misinformation and disinformation, especially during crises such as wars, natural disasters, or elections. The rise of synthetic media and the reuse of authentic content in misleading contexts have intensified the need for robust multimedia verification tools. In this paper, we present a comprehensive system developed for the ACM Multimedia 2025 Grand Challenge on Multimedia Verification. Our system assesses the authenticity and contextual accuracy of multimedia content in multilingual settings and generates both expert-oriented verification reports and accessible summaries for the general public. We introduce a unified verification pipeline that integrates visual forensics, textual analysis, and multimodal reasoning, and propose a hybrid approach to detect out-of-context (OOC) media through semantic similarity, temporal alignment, and geolocation cues. Extensive evaluations on the Grand Challenge benchmark demonstrate the system’s effectiveness across diverse real-world scenarios. Our contributions advance the state of the art in multimedia verification and offer practical tools for journalists, fact-checkers, and researchers confronting information integrity challenges in the digital age.

eess.AS

[474] Objective Soups: Multilingual Multi-Task Modeling for Speech Processing

A F M Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, Tianyi Chen

Main category: eess.AS

TL;DR: The paper explores hierarchical multi-objective optimization (MOO) for multilingual, multi-task speech processing (MSP), showing it outperforms flat optimization by separating conflicting tasks like recognition and translation.

Details

Motivation: Conflicting objectives in MSP (e.g., speech recognition vs. translation) hinder joint optimization, prompting investigation into hierarchical MOO to mitigate conflicts.

Method: Three multi-objective MSP formulations (“objective soup recipes”) are tested, with a lightweight layer-selection mechanism to compute conflict-avoiding gradients efficiently.

Result: Bi-level optimization (separating recognition and translation) outperforms flat optimization in experiments on CoVoST v2, LibriSpeech, and AISHELL-1.

Conclusion: Hierarchical MOO is more effective and scalable for MSP, with released code for reproducibility.

Abstract: Training a single model for multilingual, multi-task speech processing (MSP) is severely hampered by conflicting objectives between tasks like speech recognition and translation. While multi-objective optimization (MOO) aims to align gradient updates, its effectiveness diminishes as the number of tasks grows, making it difficult to find a common descent direction. This raises a fundamental question: should highly conflicting objectives be optimized jointly or separated into a hierarchical structure? To address this question, this paper investigates three multi-objective MSP formulations, which we refer to as \textbf{objective soup recipes}. These formulations apply multi-objective optimization at different optimization levels to mitigate potential conflicts among all objectives. To ensure efficiency, we introduce a lightweight layer-selection mechanism that computes the conflict-avoiding gradient using only the most problematic layers, minimizing computational and memory overhead. Extensive experiments on CoVoST v2, LibriSpeech, and AISHELL-1 reveal that a bi-level recipe separating recognition and translation tasks consistently outperforms standard flat optimization. Our work demonstrates that hierarchical MOO is a more effective and scalable approach for building state-of-the-art MSP models. Our code has been released at https://github.com/afmsaif/Objective_Soups.

[475] Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention’s Alternative

Xi Xuan, Zimo Zhu, Wenxin Zhang, Yi-Cheng Lin, Tomi Kinnunen

Main category: eess.AS

TL;DR: Fake-Mamba, a deepfake speech detection model using bidirectional Mamba and XLSR, outperforms SOTA models with efficient encoders and real-time inference.

Details

Motivation: The rise of advanced speech synthesis increases security risks, driving the need for real-time deepfake detection.

Method: Integrates XLSR with bidirectional Mamba and introduces three efficient encoders (TransBiMamba, ConBiMamba, PN-BiMamba) to detect synthetic speech.

Result: Achieves 0.97%, 1.74%, and 5.85% EER on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, outperforming SOTA models.

Conclusion: Fake-Mamba demonstrates strong generalization, practical viability, and real-time performance, with code publicly available.

Abstract: Advances in speech synthesis intensify security threats, motivating real-time deepfake detection research. We investigate whether bidirectional Mamba can serve as a competitive alternative to Self-Attention in detecting synthetic speech. Our solution, Fake-Mamba, integrates an XLSR front-end with bidirectional Mamba to capture both local and global artifacts. Our core innovation introduces three efficient encoders: TransBiMamba, ConBiMamba, and PN-BiMamba. Leveraging XLSR’s rich linguistic representations, PN-BiMamba can effectively capture the subtle cues of synthetic speech. Evaluated on ASVspoof 21 LA, 21 DF, and In-The-Wild benchmarks, Fake-Mamba achieves 0.97%, 1.74%, and 5.85% EER, respectively, representing substantial relative gains over SOTA models XLSR-Conformer and XLSR-Mamba. The framework maintains real-time inference across utterance lengths, demonstrating strong generalization and practical viability. The code is available at https://github.com/xuanxixi/Fake-Mamba.

[476] ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs

Eray Eren, Qingju Liu, Hyeongwoo Kim, Pablo Garrido, Abeer Alwan

Main category: eess.AS

TL;DR: A stand-alone text-to-prosodic feature model (ProMode) improves F0 and energy predictions, outperforming state-of-the-art style encoders and enhancing TTS prosody.

Details

Motivation: Prosody captures emotional, semantic, and individual speech traits, but existing models lack standalone mapping of text to prosodic features for downstream tasks like TTS.

Method: ProMode encodes masked acoustic and textual inputs into fixed-length prosodic embeddings, then decodes to predict masked acoustics using unmasked text. Trained on GigaSpeech.

Result: Consistent improvements in F0 and energy predictions at various granularities. Perceptual tests show higher prosody preference in TTS applications.

Conclusion: ProMode effectively models prosody, demonstrating potential for tasks requiring nuanced prosodic features, such as TTS.

Abstract: Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies. We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy and can be used in downstream tasks such as TTS. The ProMode encoder takes as input acoustic features and time-aligned textual content, both are partially masked, and obtains a fixed-length latent prosodic embedding. The decoder predicts acoustics in the masked region using both the encoded prosody input and unmasked textual content. Trained on the GigaSpeech dataset, we compare our method with state-of-the-art style encoders. For F0 and energy predictions, we show consistent improvements for our model at different levels of granularity. We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines, demonstrating the model’s potential in tasks where prosody modeling is important.

[477] $\text{M}^3\text{PDB}$: A Multimodal, Multi-Label, Multilingual Prompt Database for Speech Generation

Boyu Zhu, Cheng Gong, Muyang Wu, Ruihao Jing, Fan Liu, Xiaolei Zhang, Chi Zhang, Xuelong Li

Main category: eess.AS

TL;DR: The paper introduces M3PDB, a large-scale, multi-modal, multi-label, and multilingual prompt database to address limitations in zero-shot speech generation caused by poor-quality or mismatched speech prompts. It also proposes a lightweight prompt selection strategy for real-time inference.

Details

Motivation: Current zero-shot speech generation models struggle with real-world scenarios where speech prompts are low-quality, incomplete, or out of domain due to mismatches between training data and inference prompts.

Method: The authors develop M3PDB using a novel multi-modal, multi-agent annotation framework for precise labeling and introduce a lightweight prompt selection strategy for real-time inference.

Result: Experiments show that M3PDB and the selection strategy effectively support diverse and challenging speech generation scenarios.

Conclusion: The work encourages the community to focus on realistic and diverse speech generation applications rather than just standard benchmarks. Code and dataset are publicly available.

Abstract: Recent advancements in zero-shot speech generation have enabled models to synthesize speech that mimics speaker identity and speaking style from speech prompts. However, these models’ effectiveness is significantly limited in real-world scenarios where high-quality speech prompts are absent, incomplete, or out of domain. This issue arises primarily from a significant quality mismatch between the speech data utilized for model training and the input prompt speech during inference. To address this, we introduce $\text{M}^3\text{PDB}$, the first large-scale, multi-modal, multi-label, and multilingual prompt database designed for robust prompt selection in speech generation. Our dataset construction leverages a novel multi-modal, multi-agent annotation framework, enabling precise and hierarchical labeling across diverse modalities. Furthermore, we propose a lightweight yet effective prompt selection strategy tailored for real-time, resource-constrained inference settings. Experimental results demonstrate that our proposed database and selection strategy effectively support various challenging speech generation scenarios. We hope our work can inspire the community to shift focus from improving performance on standard benchmarks to addressing more realistic and diverse application scenarios in speech generation. Code and dataset are available at: https://github.com/hizening/M3PDB.

[478] Improving the Speaker Anonymization Evaluation’s Robustness to Target Speakers with Adversarial Learning

Carlos Franzreb, Arnab Das, Tim Polzehl, Sebastian Möller

Main category: eess.AS

TL;DR: The paper highlights flaws in current privacy evaluations for speaker anonymization, proposing a target classifier to improve accuracy, especially with same-gender target selection.

Details

Motivation: Existing evaluations overestimate privacy for same-gender target selection, ignoring mixed speaker information in anonymized speech.

Method: Introduces a target classifier to measure target speaker influence, removable via adversarial learning.

Result: Effective for multiple anonymizers, particularly with same-gender target selection, improving evaluation reliability.

Conclusion: The proposed method enhances privacy assessment accuracy by addressing overlooked target speaker influence.

Abstract: The current privacy evaluation for speaker anonymization often overestimates privacy when a same-gender target selection algorithm (TSA) is used, although this TSA leaks the speaker’s gender and should hence be more vulnerable. We hypothesize that this occurs because the evaluation does not account for the fact that anonymized speech contains information from both the source and target speakers. To address this, we propose to add a target classifier that measures the influence of target speaker information in the evaluation, which can also be removed with adversarial learning. Experiments demonstrate that this approach is effective for multiple anonymizers, particularly when using a same-gender TSA, leading to a more reliable assessment.

[479] EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen

Main category: eess.AS

TL;DR: EmoVoice is a novel emotion-controllable TTS model using LLMs for fine-grained emotion control and phoneme boost for content consistency. It introduces EmoVoice-DB, a high-quality dataset, and achieves SOTA performance.

Details

Motivation: Current TTS models struggle with controlling emotional expression in speech, limiting their ability to convey nuanced emotions effectively.

Method: EmoVoice leverages LLMs for natural language emotion control and a phoneme boost design for parallel token output. It uses EmoVoice-DB, a 40-hour dataset with fine-grained labels.

Result: Achieves SOTA performance on English and Chinese test sets. Evaluates emotion metrics and explores GPT-4o-audio and Gemini for emotional speech assessment.

Conclusion: EmoVoice advances emotion-controllable TTS with LLMs and phoneme boost, supported by a high-quality dataset and robust evaluation.

Abstract: Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Dataset, code, checkpoints, and demo samples are available at https://github.com/yanghaha0908/EmoVoice.

[480] ReverbFX: A Dataset of Room Impulse Responses Derived from Reverb Effect Plugins for Singing Voice Dereverberation

Julius Richter, Till Svajda, Timo Gerkmann

Main category: eess.AS

TL;DR: ReverbFX is a new RIR dataset for singing voice dereverberation, using plugin-derived RIRs instead of real recordings, showing better performance in artificial reverb scenarios.

Details

Motivation: To address the lack of diverse RIR datasets for singing voice dereverberation, especially for artificial reverb effects.

Method: Created ReverbFX dataset with RIRs from reverb plugins, trained two generative models, and compared performance with models using real RIRs.

Result: Models trained with plugin-derived RIRs outperformed those using real RIRs in artificial reverb scenarios.

Conclusion: ReverbFX is effective for dereverberation research, especially for artificial reverb effects.

Abstract: We present ReverbFX, a new room impulse response (RIR) dataset designed for singing voice dereverberation research. Unlike existing datasets based on real recorded RIRs, ReverbFX features a diverse collection of RIRs captured from various reverb audio effect plugins commonly used in music production. We conduct comprehensive experiments using the proposed dataset to benchmark the challenge of dereverberation of singing voice recordings affected by artificial reverbs. We train two state-of-the-art generative models using ReverbFX and demonstrate that models trained with plugin-derived RIRs outperform those trained on realistic RIRs in artificial reverb scenarios.

[481] FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities

Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg

Main category: eess.AS

TL;DR: FlexCTC is a GPU-based beam decoding toolkit for CTC models, offering fast, batched GPU implementation with advanced contextualization features.

Details

Motivation: Traditional beam search implementations are slow and CPU-bound, limiting their efficiency on modern hardware.

Method: Developed in Python and PyTorch, FlexCTC eliminates CPU-GPU sync and minimizes kernel overhead using CUDA Graphs, supporting N-gram LM fusion and phrase boosting.

Result: The toolkit provides accurate, efficient decoding suitable for research and production.

Conclusion: FlexCTC is a user-friendly, extensible alternative to traditional decoders, leveraging GPU capabilities for improved performance.

Abstract: While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use.

eess.IV

[482] Generative Artificial Intelligence in Medical Imaging: Foundations, Progress, and Clinical Translation

Xuanru Zhou, Cheng Li, Shuqiang Wang, Ye Li, Tao Tan, Hairong Zheng, Shanshan Wang

Main category: eess.IV

TL;DR: A review of generative AI’s impact on medical imaging, covering techniques like GANs, VAEs, and diffusion models, their clinical applications, and challenges for real-world deployment.

Details

Motivation: To synthesize recent advances in generative AI for medical imaging and evaluate their clinical roles, addressing challenges like data scarcity and standardization.

Method: Systematic examination of generative models across imaging workflows, proposing a three-tiered evaluation framework (pixel-level, feature-level, task-level).

Result: Generative AI enhances imaging workflows but faces obstacles like domain shift, hallucination risks, and regulatory issues.

Conclusion: The review guides future research by highlighting progress and challenges, advocating for interdisciplinary collaboration to advance clinically integrated AI systems.

Abstract: Generative artificial intelligence (AI) is rapidly transforming medical imaging by enabling capabilities such as data synthesis, image enhancement, modality translation, and spatiotemporal modeling. This review presents a comprehensive and forward-looking synthesis of recent advances in generative modeling including generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models, and emerging multimodal foundation architectures and evaluates their expanding roles across the clinical imaging continuum. We systematically examine how generative AI contributes to key stages of the imaging workflow, from acquisition and reconstruction to cross-modality synthesis, diagnostic support, and treatment planning. Emphasis is placed on both retrospective and prospective clinical scenarios, where generative models help address longstanding challenges such as data scarcity, standardization, and integration across modalities. To promote rigorous benchmarking and translational readiness, we propose a three-tiered evaluation framework encompassing pixel-level fidelity, feature-level realism, and task-level clinical relevance. We also identify critical obstacles to real-world deployment, including generalization under domain shift, hallucination risk, data privacy concerns, and regulatory hurdles. Finally, we explore the convergence of generative AI with large-scale foundation models, highlighting how this synergy may enable the next generation of scalable, reliable, and clinically integrated imaging systems. By charting technical progress and translational pathways, this review aims to guide future research and foster interdisciplinary collaboration at the intersection of AI, medicine, and biomedical engineering.

[483] HiFi-Mamba: Dual-Stream W-Laplacian Enhanced Mamba for High-Fidelity MRI Reconstruction

Hongli Chen, Pengcheng Fang, Yuxia Chen, Yingxuan Ren, Jing Hao, Fangfang Tang, Xiaohao Cai, Shanshan Shan, Feng Liu

Main category: eess.IV

TL;DR: HiFi-Mamba, a dual-stream Mamba-based architecture, improves MRI reconstruction by addressing limitations of existing Mamba variants, outperforming state-of-the-art models.

Details

Motivation: Existing Mamba variants for MRI reconstruction lack sensitivity to high-frequency details and rely on redundant scanning, limiting their effectiveness.

Method: HiFi-Mamba uses stacked W-Laplacian and HiFi-Mamba blocks for spectral decoupling and adaptive high-frequency feature integration, with a streamlined unidirectional traversal strategy.

Result: HiFi-Mamba outperforms CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while being efficient.

Conclusion: HiFi-Mamba offers a compact, efficient solution for high-fidelity MRI reconstruction, addressing key limitations of prior methods.

Abstract: Reconstructing high-fidelity MR images from undersampled k-space data remains a challenging problem in MRI. While Mamba variants for vision tasks offer promising long-range modeling capabilities with linear-time complexity, their direct application to MRI reconstruction inherits two key limitations: (1) insensitivity to high-frequency anatomical details; and (2) reliance on redundant multi-directional scanning. To address these limitations, we introduce High-Fidelity Mamba (HiFi-Mamba), a novel dual-stream Mamba-based architecture comprising stacked W-Laplacian (WL) and HiFi-Mamba blocks. Specifically, the WL block performs fidelity-preserving spectral decoupling, producing complementary low- and high-frequency streams. This separation enables the HiFi-Mamba block to focus on low-frequency structures, enhancing global feature modeling. Concurrently, the HiFi-Mamba block selectively integrates high-frequency features through adaptive state-space modulation, preserving comprehensive spectral details. To eliminate the scanning redundancy, the HiFi-Mamba block adopts a streamlined unidirectional traversal strategy that preserves long-range modeling capability with improved computational efficiency. Extensive experiments on standard MRI reconstruction benchmarks demonstrate that HiFi-Mamba consistently outperforms state-of-the-art CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while maintaining a compact and efficient model design.

[484] MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data

Baraa Al Jorf, Farah Shamout

Main category: eess.IV

TL;DR: MedPatch is a multi-stage multimodal fusion architecture for clinical prediction tasks, addressing data heterogeneity and missing modalities via confidence-guided patching. It outperforms existing baselines.

Details

Motivation: Clinical data is heterogeneous, limited, and sparse, hindering model performance. MedPatch aims to integrate multimodal data effectively.

Method: MedPatch uses multi-stage fusion, missingness-aware modules, and joint fusion with confidence-guided patching.

Result: Achieves state-of-the-art performance on in-hospital mortality prediction and clinical condition classification.

Conclusion: Confidence-guided multi-stage fusion effectively handles multimodal data heterogeneity, setting new benchmarks for clinical prediction.

Abstract: Clinical decision-making relies on the integration of information across various data modalities, such as clinical time-series, medical images and textual reports. Compared to other domains, real-world medical data is heterogeneous in nature, limited in size, and sparse due to missing modalities. This significantly limits model performance in clinical prediction tasks. Inspired by clinical workflows, we introduce MedPatch, a multi-stage multimodal fusion architecture, which seamlessly integrates multiple modalities via confidence-guided patching. MedPatch comprises three main components: (i) a multi-stage fusion strategy that leverages joint and late fusion simultaneously, (ii) a missingness-aware module that handles sparse samples with missing modalities, (iii) a joint fusion module that clusters latent token patches based on calibrated unimodal token-level confidence. We evaluated MedPatch using real-world data consisting of clinical time-series data, chest X-ray images, radiology reports, and discharge notes extracted from the MIMIC-IV, MIMIC-CXR, and MIMIC-Notes datasets on two benchmark tasks, namely in-hospital mortality prediction and clinical condition classification. Compared to existing baselines, MedPatch achieves state-of-the-art performance. Our work highlights the effectiveness of confidence-guided multi-stage fusion in addressing the heterogeneity of multimodal data, and establishes new state-of-the-art benchmark results for clinical prediction tasks.

[485] Hybrid(Transformer+CNN)-based Polyp Segmentation

Madan Baduwal

Main category: eess.IV

TL;DR: A hybrid Transformer + CNN model improves polyp segmentation by addressing challenges like ill-defined boundaries and endoscopic artifacts, outperforming existing methods.

Details

Motivation: Polyp segmentation is challenging due to variations in size, shape, and imaging conditions, requiring a robust solution.

Method: A hybrid (Transformer + CNN) model with boundary-aware attention mechanisms for accurate segmentation and artifact resilience.

Result: Significant improvements in segmentation accuracy (Recall: 0.9555, Accuracy: 0.9849) and artifact resilience.

Conclusion: The hybrid model outperforms state-of-the-art methods, offering better polyp segmentation in challenging conditions.

Abstract: Colonoscopy is still the main method of detection and segmentation of colonic polyps, and recent advancements in deep learning networks such as U-Net, ResUNet, Swin-UNet, and PraNet have made outstanding performance in polyp segmentation. Yet, the problem is extremely challenging due to high variation in size, shape, endoscopy types, lighting, imaging protocols, and ill-defined boundaries (fluid, folds) of the polyps, rendering accurate segmentation a challenging and problematic task. To address these critical challenges in polyp segmentation, we introduce a hybrid (Transformer + CNN) model that is crafted to enhance robustness against evolving polyp characteristics. Our hybrid architecture demonstrates superior performance over existing solutions, particularly in addressing two critical challenges: (1) accurate segmentation of polyps with ill-defined margins through boundary-aware attention mechanisms, and (2) robust feature extraction in the presence of common endoscopic artifacts, including specular highlights, motion blur, and fluid occlusions. Quantitative evaluations reveal significant improvements in segmentation accuracy (Recall improved by 1.76%, i.e., 0.9555, accuracy improved by 0.07%, i.e., 0.9849) and artifact resilience compared to state-of-the-art polyp segmentation methods.

Maria Boyko, Aleksandra Beliaeva, Dmitriy Kornilov, Alexander Bernstein, Maxim Sharaev

Main category: eess.IV

TL;DR: impuTMAE is a transformer-based method for multimodal medical data, handling missing modalities and improving glioma survival prediction.

Details

Motivation: Medical data is complex and often incomplete, requiring effective handling for multimodal models to improve prognosis and disease understanding.

Method: impuTMAE uses a transformer-based approach with multimodal pre-training, imputing missing data by reconstructing masked patches and integrating genetic, imaging, and clinical data.

Result: The model achieves state-of-the-art performance in glioma survival prediction, outperforming prior multimodal methods.

Conclusion: impuTMAE effectively addresses missing data and enhances multimodal learning, offering improved prognostic modeling.

Abstract: The use of diverse modalities, such as omics, medical images, and clinical data can not only improve the performance of prognostic models but also deepen an understanding of disease mechanisms and facilitate the development of novel treatment approaches. However, medical data are complex, often incomplete, and contains missing modalities, making effective handling its crucial for training multimodal models. We introduce impuTMAE, a novel transformer-based end-to-end approach with an efficient multimodal pre-training strategy. It learns inter- and intra-modal interactions while simultaneously imputing missing modalities by reconstructing masked patches. Our model is pre-trained on heterogeneous, incomplete data and fine-tuned for glioma survival prediction using TCGA-GBM/LGG and BraTS datasets, integrating five modalities: genetic (DNAm, RNA-seq), imaging (MRI, WSI), and clinical data. By addressing missing data during pre-training and enabling efficient resource utilization, impuTMAE surpasses prior multimodal approaches, achieving state-of-the-art performance in glioma patient survival prediction. Our code is available at https://github.com/maryjis/mtcp

[487] FIVA: Federated Inverse Variance Averaging for Universal CT Segmentation with Uncertainty Estimation

Asim Ukaye, Numan Saeed, Karthik Nandakumar

Main category: eess.IV

TL;DR: A federated learning approach for universal segmentation in diverse abdominal CT datasets, using model and predictive uncertainty for improved aggregation and inference.

Details

Motivation: Addressing challenges of heterogeneous CT segmentation datasets and patient privacy by leveraging federated learning and uncertainty measures.

Method: Utilizes stochastic mini-batch gradient descent to estimate model weight distributions, aggregates parameters with Bayesian-inspired inverse-variance, and quantifies prediction uncertainty.

Result: Improves federated aggregation quality and uncertainty-weighted inference compared to baselines.

Conclusion: The approach effectively enhances segmentation performance and provides confidence measures for clinical use.

Abstract: Different CT segmentation datasets are typically obtained from different scanners under different capture settings and often provide segmentation labels for a limited and often disjoint set of organs. Using these heterogeneous data effectively while preserving patient privacy can be challenging. This work presents a novel federated learning approach to achieve universal segmentation across diverse abdominal CT datasets by utilizing model uncertainty for aggregation and predictive uncertainty for inference. Our approach leverages the inherent noise in stochastic mini-batch gradient descent to estimate a distribution over the model weights to provide an on-the-go uncertainty over the model parameters at the client level. The parameters are then aggregated at the server using the additional uncertainty information using a Bayesian-inspired inverse-variance aggregation scheme. Furthermore, the proposed method quantifies prediction uncertainty by propagating the uncertainty from the model weights, providing confidence measures essential for clinical decision-making. In line with recent work shown, predictive uncertainty is utilized in the inference stage to improve predictive performance. Experimental evaluations demonstrate the effectiveness of this approach in improving both the quality of federated aggregation and uncertainty-weighted inference compared to previously established baselines. The code for this work is made available at: https://github.com/asimukaye/fiva

[488] Zero-shot self-supervised learning of single breath-hold magnetic resonance cholangiopancreatography (MRCP) reconstruction

Jinho Kim, Marcel Dominik Nickel, Florian Knoll

Main category: eess.IV

TL;DR: Zero-shot self-supervised learning reduces breath-hold times in MRCP, improving image quality and offering practical clinical workflow solutions.

Details

Motivation: To reduce breath-hold durations in MRCP while maintaining high image quality.

Method: Used zero-shot reconstruction with a pretrained network for shallow training, comparing it to parallel imaging and compressed sensing.

Result: Zero-shot improved image quality, matching respiratory-triggered MRCP, with shallow training reducing computation time significantly.

Conclusion: Zero-shot learning is feasible for MRCP, reducing breath-hold times and offering clinical practicality.

Abstract: Purpose: To investigate the feasibility of applying zero-shot self-supervised learning reconstruction to reduce breath-hold times in magnetic resonance cholangiopancreatography (MRCP). Methods: Breath-hold MRCP was acquired from 11 healthy volunteers on a 3T scanner using an incoherent k-space sampling pattern leading to a breath-hold duration of 14s. We evaluated zero-shot reconstruction of breath-hold MRCP against parallel imaging of respiratory-triggered MRCP acquired in 338s on average and compressed sensing reconstruction of breath-hold MRCP. To address the long computation times of zero-shot trainings, we used a training approach that leverages a pretrained network to reduce backpropagation depth during training. Results: Zero-shot learning reconstruction significantly improved visual image quality compared to compressed sensing reconstruction, particularly in terms of signal-to-noise ratio and ductal delineation, and reached a level of quality comparable to that of successful respiratory-triggered acquisitions with regular breathing patterns. Shallow training provided nearly equivalent reconstruction performance with a training time of 11 minutes in comparison to 271 minutes for a conventional zero-shot training. Conclusion: Zero-shot learning delivers high-fidelity MRCP reconstructions with reduced breath-hold times, and shallow training offers a practical solution for translation to time-constrained clinical workflows.

[489] From Explainable to Explained AI: Ideas for Falsifying and Quantifying Explanations

Yoni Schirris, Eric Marcus, Jonas Teuwen, Hugo Horlings, Efstratios Gavves

Main category: eess.IV

TL;DR: A human-machine-VLM interaction system is proposed to explain deep learning models in computational pathology, enabling qualitative and quantitative evaluation of explanations.

Details

Motivation: To ensure clinical integration of medical image analysis systems by identifying spurious features or novel biological insights in deep learning models.

Method: A system combining AI-integrated slide viewer for sliding-window experiments and vision-language models to quantify explanation predictiveness.

Result: The system allows qualitative testing of explanations and quantifiably distinguishes competing explanations.

Conclusion: Provides a practical path from explainable AI to explained AI in digital pathology.

Abstract: Explaining deep learning models is essential for clinical integration of medical image analysis systems. A good explanation highlights if a model depends on spurious features that undermines generalization and harms a subset of patients or, conversely, may present novel biological insights. Although techniques like GradCAM can identify influential features, they are measurement tools that do not themselves form an explanation. We propose a human-machine-VLM interaction system tailored to explaining classifiers in computational pathology, including multi-instance learning for whole-slide images. Our proof of concept comprises (1) an AI-integrated slide viewer to run sliding-window experiments to test claims of an explanation, and (2) quantification of an explanation’s predictiveness using general-purpose vision-language models. The results demonstrate that this allows us to qualitatively test claims of explanations and can quantifiably distinguish competing explanations. This offers a practical path from explainable AI to explained AI in digital pathology and beyond. Code and prompts are available at https://github.com/nki-ai/x2x.

[490] AMRG: Extend Vision Language Models for Automatic Mammography Report Generation

Nak-Jun Sung, Donghyun Lee, Bo Hwa Choi, Chae Jung Park

Main category: eess.IV

TL;DR: AMRG is the first end-to-end framework for generating mammography reports using large vision-language models, achieving strong performance in language and clinical metrics.

Details

Motivation: Addressing the underexplored task of mammography report generation, which involves challenges like multiview image reasoning and unstructured radiologic language.

Method: Utilizes MedGemma-4B-it, a domain-specialized VLM, with parameter-efficient fine-tuning via LoRA. Trained on the DMID dataset.

Result: Achieves high scores in ROUGE-L (0.5691), METEOR (0.6152), CIDEr (0.5818), and BI-RADS accuracy (0.5582), with improved diagnostic consistency.

Conclusion: AMRG provides a scalable foundation for radiology report generation and advances multimodal medical AI research.

Abstract: Mammography report generation is a critical yet underexplored task in medical AI, characterized by challenges such as multiview image reasoning, high-resolution visual cues, and unstructured radiologic language. In this work, we introduce AMRG (Automatic Mammography Report Generation), the first end-to-end framework for generating narrative mammography reports using large vision-language models (VLMs). Building upon MedGemma-4B-it-a domain-specialized, instruction-tuned VLM-we employ a parameter-efficient fine-tuning (PEFT) strategy via Low-Rank Adaptation (LoRA), enabling lightweight adaptation with minimal computational overhead. We train and evaluate AMRG on DMID, a publicly available dataset of paired high-resolution mammograms and diagnostic reports. This work establishes the first reproducible benchmark for mammography report generation, addressing a longstanding gap in multimodal clinical AI. We systematically explore LoRA hyperparameter configurations and conduct comparative experiments across multiple VLM backbones, including both domain-specific and general-purpose models under a unified tuning protocol. Our framework demonstrates strong performance across both language generation and clinical metrics, achieving a ROUGE-L score of 0.5691, METEOR of 0.6152, CIDEr of 0.5818, and BI-RADS accuracy of 0.5582. Qualitative analysis further highlights improved diagnostic consistency and reduced hallucinations. AMRG offers a scalable and adaptable foundation for radiology report generation and paves the way for future research in multimodal medical AI.

[491] A Generative Imputation Method for Multimodal Alzheimer’s Disease Diagnosis

Reihaneh Hassanzadeh, Anees Abrol, Hamid Reza Hassanzadeh, Vince D. Calhoun

Main category: eess.IV

TL;DR: A generative adversarial network (GAN) method improves Alzheimer’s disease classification by 9% by reconstructing missing neuroimaging modalities, outperforming traditional approaches.

Details

Motivation: Multimodal data analysis enhances brain disorder diagnoses, but incomplete data poses challenges. Traditional methods like subsampling or zero-filling reduce accuracy or introduce biases.

Method: Proposed a GAN-based method to reconstruct missing modalities (T1-weighted MRI and functional network connectivity) while preserving disease patterns.

Result: 9% improvement in Alzheimer’s disease classification accuracy compared to traditional methods.

Conclusion: GAN-based imputation is effective for handling incomplete multimodal neuroimaging data, improving diagnostic accuracy.

Abstract: Multimodal data analysis can lead to more accurate diagnoses of brain disorders due to the complementary information that each modality adds. However, a major challenge of using multimodal datasets in the neuroimaging field is incomplete data, where some of the modalities are missing for certain subjects. Hence, effective strategies are needed for completing the data. Traditional methods, such as subsampling or zero-filling, may reduce the accuracy of predictions or introduce unintended biases. In contrast, advanced methods such as generative models have emerged as promising solutions without these limitations. In this study, we proposed a generative adversarial network method designed to reconstruct missing modalities from existing ones while preserving the disease patterns. We used T1-weighted structural magnetic resonance imaging and functional network connectivity as two modalities. Our findings showed a 9% improvement in the classification accuracy for Alzheimer’s disease versus cognitive normal groups when using our generative imputation method compared to the traditional approaches.

[492] Dynamic Survival Prediction using Longitudinal Images based on Transformer

Bingfan Liu, Haolun Shi, Jiguo Cao

Main category: eess.IV

TL;DR: SurLonFormer, a Transformer-based model, integrates longitudinal medical images and structured data for survival prediction, outperforming existing methods in performance and interpretability.

Details

Motivation: Current survival analysis methods poorly utilize censored data, ignore correlations in longitudinal images, and lack interpretability.

Method: SurLonFormer combines a Vision Encoder (spatial features), Sequence Encoder (temporal aggregation), and Survival Encoder (Cox model) for survival prediction.

Result: SurLonFormer excels in predictive performance and identifies disease-related biomarkers, validated through simulations and Alzheimer’s disease analysis.

Conclusion: SurLonFormer effectively addresses limitations of current methods, offering improved accuracy and interpretability in survival analysis.

Abstract: Survival analysis utilizing multiple longitudinal medical images plays a pivotal role in the early detection and prognosis of diseases by providing insight beyond single-image evaluations. However, current methodologies often inadequately utilize censored data, overlook correlations among longitudinal images measured over multiple time points, and lack interpretability. We introduce SurLonFormer, a novel Transformer-based neural network that integrates longitudinal medical imaging with structured data for survival prediction. Our architecture comprises three key components: a Vision Encoder for extracting spatial features, a Sequence Encoder for aggregating temporal information, and a Survival Encoder based on the Cox proportional hazards model. This framework effectively incorporates censored data, addresses scalability issues, and enhances interpretability through occlusion sensitivity analysis and dynamic survival prediction. Extensive simulations and a real-world application in Alzheimer’s disease analysis demonstrate that SurLonFormer achieves superior predictive performance and successfully identifies disease-related imaging biomarkers.

[493] MIND: A Noise-Adaptive Denoising Framework for Medical Images Integrating Multi-Scale Transformer

Tao Tang, Chengxu Yang

Main category: eess.IV

TL;DR: A medical image adaptive denoising model (MI-ND) combining multi-scale convolutional and Transformer architecture improves image quality and diagnostic accuracy.

Details

Motivation: Medical images often suffer from non-uniform noise, impacting diagnosis accuracy. The paper aims to enhance image quality and downstream diagnostic performance.

Method: Proposes MI-ND with noise level estimator (NLE) and noise adaptive attention module (NAAB) for channel-spatial attention and cross-modal feature fusion.

Result: Outperforms comparative methods in PSNR, SSIM, LPIPS, and improves F1 score and ROC-AUC in diagnostic tasks.

Conclusion: MI-ND offers strong practical value for medical image enhancement and AI-assisted diagnosis, excelling in structural recovery and diagnostic sensitivity.

Abstract: The core role of medical images in disease diagnosis makes their quality directly affect the accuracy of clinical judgment. However, due to factors such as low-dose scanning, equipment limitations and imaging artifacts, medical images are often accompanied by non-uniform noise interference, which seriously affects structure recognition and lesion detection. This paper proposes a medical image adaptive denoising model (MI-ND) that integrates multi-scale convolutional and Transformer architecture, introduces a noise level estimator (NLE) and a noise adaptive attention module (NAAB), and realizes channel-spatial attention regulation and cross-modal feature fusion driven by noise perception. Systematic testing is carried out on multimodal public datasets. Experiments show that this method significantly outperforms the comparative methods in image quality indicators such as PSNR, SSIM, and LPIPS, and improves the F1 score and ROC-AUC in downstream diagnostic tasks, showing strong prac-tical value and promotional potential. The model has outstanding benefits in structural recovery, diagnostic sensitivity, and cross-modal robustness, and provides an effective solution for medical image enhancement and AI-assisted diagnosis and treatment.

[494] T-CACE: A Time-Conditioned Autoregressive Contrast Enhancement Multi-Task Framework for Contrast-Free Liver MRI Synthesis, Segmentation, and Diagnosis

Xiaojiao Xiao, Jianfeng Zhao, Qinmin Vivian Hu, Guanghui Wang

Main category: eess.IV

TL;DR: T-CACE synthesizes multi-phase contrast-enhanced MRI from non-contrast MRI, improving safety and diagnostic efficiency for liver cancer.

Details

Motivation: Addressing risks from contrast agents, time-consuming manual assessments, and limited annotated datasets in traditional MRI.

Method: Proposes T-CACE with conditional token encoding, dynamic time-aware attention mask, and temporal classification consistency for smooth, plausible transitions.

Result: Outperforms state-of-the-art methods in synthesis, segmentation, and lesion classification on two liver MRI datasets.

Conclusion: T-CACE offers a safer, efficient, and reliable alternative to traditional contrast-enhanced imaging for liver lesion assessment.

Abstract: Magnetic resonance imaging (MRI) is a leading modality for the diagnosis of liver cancer, significantly improving the classification of the lesion and patient outcomes. However, traditional MRI faces challenges including risks from contrast agent (CA) administration, time-consuming manual assessment, and limited annotated datasets. To address these limitations, we propose a Time-Conditioned Autoregressive Contrast Enhancement (T-CACE) framework for synthesizing multi-phase contrast-enhanced MRI (CEMRI) directly from non-contrast MRI (NCMRI). T-CACE introduces three core innovations: a conditional token encoding (CTE) mechanism that unifies anatomical priors and temporal phase information into latent representations; and a dynamic time-aware attention mask (DTAM) that adaptively modulates inter-phase information flow using a Gaussian-decayed attention mechanism, ensuring smooth and physiologically plausible transitions across phases. Furthermore, a constraint for temporal classification consistency (TCC) aligns the lesion classification output with the evolution of the physiological signal, further enhancing diagnostic reliability. Extensive experiments on two independent liver MRI datasets demonstrate that T-CACE outperforms state-of-the-art methods in image synthesis, segmentation, and lesion classification. This framework offers a clinically relevant and efficient alternative to traditional contrast-enhanced imaging, improving safety, diagnostic efficiency, and reliability for the assessment of liver lesion. The implementation of T-CACE is publicly available at: https://github.com/xiaojiao929/T-CACE.

[495] Lung-DDPM: Semantic Layout-guided Diffusion Models for Thoracic CT Image Synthesis

Yifan Jiang, Yannick Lemaréchal, Sophie Plante, Josée Bafaro, Jessica Abi-Rjeile, Philippe Joubert, Philippe Després, Venkata Manem

Main category: eess.IV

TL;DR: Lung-DDPM, a semantic layout-guided DDPM, generates high-fidelity 3D synthetic CT images to address data scarcity in lung cancer screening, improving downstream tasks like nodule segmentation.

Details

Motivation: AI-assisted medical imaging faces challenges due to costly annotations and privacy concerns, limiting large-scale dataset construction.

Method: Uses semantic layout-guided DDPM to generate anatomically reasonable synthetic CT images from incomplete layouts.

Result: Outperforms SOTA models in image quality (FID: 0.0047, MMD: 0.0070, MSE: 0.0024) and enhances nodule segmentation (Dice: 0.3914, sensitivity: 0.4393).

Conclusion: Lung-DDPM shows promise for broader medical imaging applications, with code and models publicly available.

Abstract: With the rapid development of artificial intelligence (AI), AI-assisted medical imaging analysis demonstrates remarkable performance in early lung cancer screening. However, the costly annotation process and privacy concerns limit the construction of large-scale medical datasets, hampering the further application of AI in healthcare. To address the data scarcity in lung cancer screening, we propose Lung-DDPM, a thoracic CT image synthesis approach that effectively generates high-fidelity 3D synthetic CT images, which prove helpful in downstream lung nodule segmentation tasks. Our method is based on semantic layout-guided denoising diffusion probabilistic models (DDPM), enabling anatomically reasonable, seamless, and consistent sample generation even from incomplete semantic layouts. Our results suggest that the proposed method outperforms other state-of-the-art (SOTA) generative models in image quality evaluation and downstream lung nodule segmentation tasks. Specifically, Lung-DDPM achieved superior performance on our large validation cohort, with a Fr'echet inception distance (FID) of 0.0047, maximum mean discrepancy (MMD) of 0.0070, and mean squared error (MSE) of 0.0024. These results were 7.4$\times$, 3.1$\times$, and 29.5$\times$ better than the second-best competitors, respectively. Furthermore, the lung nodule segmentation model, trained on a dataset combining real and Lung-DDPM-generated synthetic samples, attained a Dice Coefficient (Dice) of 0.3914 and sensitivity of 0.4393. This represents 8.8% and 18.6% improvements in Dice and sensitivity compared to the model trained solely on real samples. The experimental results highlight Lung-DDPM’s potential for a broader range of medical imaging applications, such as general tumor segmentation, cancer survival estimation, and risk prediction. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM/.

Today’s Research Highlights

Table of Contents

cs.CL

[1] ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning

[2] Leveraging Large Language Models for Rare Disease Named Entity Recognition

[3] UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

[4] TEN: Table Explicitization, Neurosymbolically

[5] Decoding Neural Emotion Patterns through Natural Language Processing Embeddings

[6] The Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich, Expert Consensus in Complex Domains

[7] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

[8] APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification

[9] Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech

[10] Leveraging Audio and Text Modalities in Mental Health: A Study of LLMs Performance

[11] Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models

[12] Non-native Children’s Automatic Speech Assessment Challenge (NOCASA)

[13] From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text

[14] User-centric Subjective Leaderboard by Customizable Reward Modeling

[15] Learning Facts at Scale with Active Reading

[16] Memp: Exploring Agent Procedural Memory

[17] From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation

[18] LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation

[19] Cross-lingual Aspect-Based Sentiment Analysis: A Survey on Tasks, Approaches, and Challenges

[20] UWBa at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval

[21] COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation

[22] The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

[23] AINL-Eval 2025 Shared Task: Detection of AI-Generated Scientific Abstracts in Russian

[24] Improving Diversity in Language Models: When Temperature Fails, Change the Loss

[25] EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

[26] Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

[27] Evaluating the Role of Large Language Models in Legal Practice in India

[28] The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models

[29] Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

[30] Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

[31] Echoes of Agreement: Argument Driven Opinion Shifts in Large Language Models

[32] Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study

[33] Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges

[34] BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

[35] A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

[36] Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[37] PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

[38] Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription

[39] Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

[40] A Survey of Cognitive Distortion Detection and Classification in NLP

[41] Language of Persuasion and Misrepresentation in Business Communication: A Textual Detection Approach

[42] A Comprehensive Evaluation framework of Alignment Techniques for LLMs

[43] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

[44] Specialised or Generic? Tokenization Choices for Radiology Language Models

[45] Shaping Event Backstories to Estimate Potential Emotion Contexts

[46] Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

[47] Which one Performs Better? Wav2Vec or Whisper? Applying both in Badini Kurdish Speech to Text (BKSTT)

[48] Neural Bandit Based Optimal LLM Selection for a Pipeline of Tasks

[49] From Stars to Insights: Exploration and Implementation of Unified Sentiment Analysis with Distant Supervision

[50] Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs

[51] LongIns: A Challenging Long-context Instruction-based Exam for LLMs

[52] Improving Multimodal Large Language Models Using Continual Learning

[53] Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models’ Character Understanding Evaluation

[54] Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

[55] Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions

[56] RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

[57] EvoP: Robust LLM Inference via Evolutionary Pruning

[58] Efficient Inference for Large Reasoning Models: A Survey

[59] Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

[60] CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

[61] AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation

[62] IP-CRR: Information Pursuit for Interpretable Classification of Chest Radiology Reports

[63] Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs

[64] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

[65] LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning

[66] MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents

[67] Exploring Scaling Laws for EHR Foundation Models

[68] Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

[69] DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

[70] ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark

[71] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

[72] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

[73] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

[74] Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

[75] InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

[76] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens

[77] Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL