Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 69]
cs.CV [Total: 124]
cs.AI [Total: 23]
cs.SD [Total: 8]
cs.LG [Total: 135]
cs.MA [Total: 3]
cs.MM [Total: 1]
eess.AS [Total: 12]
eess.IV [Total: 17]

cs.CL

[1] From Image Captioning to Visual Storytelling

Admitos Passadakis, Yingjin Song, Albert Gatt

Main category: cs.CL

TL;DR: A two-stage visual storytelling approach that first generates image captions then transforms them into coherent narratives, showing improved story quality and faster training.

Details

Motivation: To balance between story grounding in images and narrative coherence by treating visual storytelling as a superset of image captioning.

Method: Two-stage framework: 1) vision-to-language model for image captions, 2) language-to-language methods to transform captions into coherent narratives.

Result: Positive impact on story quality, accelerated training time, and framework reusability. Proposed new ‘ideality’ metric to simulate oracle model performance.

Conclusion: Integrating captioning and storytelling under unified framework improves visual storytelling while making the approach more efficient and reproducible.

Abstract: Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this approach accelerates training time and makes our framework readily reusable and reproducible by anyone interested. Lastly, we propose a new metric/tool, named ideality, that can be used to simulate how far some results are from an oracle model, and we apply it to emulate human-likeness in visual storytelling.

[2] Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

Kezia Oketch, John P. Lalor, Ahmed Abbasi

Main category: cs.CL

TL;DR: First taxonomy-guided evaluation of Swahili NLP using health-related psychometric data from Kenyan speakers, revealing sociolinguistic influences on model performance.

Details

Motivation: Address gaps in sociolinguistic diversity evaluation for African languages, particularly Swahili, by examining how cultural and linguistic variations affect NLP model performance.

Method: Collected 2,170 free-text responses from Kenyan speakers on health psychometric tasks, developed a structured taxonomy to analyze sociolinguistic features (tribal influences, urban vernacular, code-mixing, loanwords), and evaluated pre-trained and instruction-tuned language models.

Result: The data exhibited significant sociolinguistic variation including tribal influences, urban vernacular, code-mixing, and loanwords. Model prediction errors were systematically analyzed through the taxonomy lens, revealing how sociolinguistic factors shape performance.

Conclusion: The study advances culturally grounded evaluation frameworks for NLP and demonstrates the critical role of sociolinguistic variation in understanding and improving model performance for under-resourced languages like Swahili.

Abstract: We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.

[3] Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach

Yiran Rex Ma

Main category: cs.CL

TL;DR: This paper analyzes English-Chinese news translation differences in adverbial chunk ordering using LLM-annotated corpora, finding systematic preferences in positioning patterns between the two languages.

Details

Motivation: To explore differences in constituent order between English and Chinese news from functional chunk perspective and analyze their positional preferences and distribution patterns.

Method: Analysis based on comparable English-Chinese news corpora annotated by Large Language Model (LLM), focusing on functional chunks with adverbial roles.

Result: English news prefers linear narrative with core information first (post-positioned chunks), Chinese prefers background-first presentation (pre-positioned chunks). Chinese shows stronger pre-positioning tendency in SVO structures. Both languages show flexibility in co-occurring function blocks driven by information and pragmatic purposes.

Conclusion: Word order exhibits both systematic preference and dynamic adaptability, providing new empirical support for contrastive study of English-Chinese information structure.

Abstract: Based on comparable English-Chinese news corpora annotated by Large Language Model (LLM), this paper attempts to explore the differences in constituent order of English-Chinese news from the perspective of functional chunks with adverbial roles, and analyze their typical positional preferences and distribution patterns. It is found that: (1) English news prefers linear narrative of core information first, and functional chunks are mostly post-positioned, while Chinese news prefers overall presentation mode of background first, and functional chunks are often pre-positioned; (2) In SVO structure, both English and Chinese news show differences in the distribution of functional chunks, but the tendency of Chinese pre-positioning is more significant, while that of English post-positioning is relatively mild; (3) When function blocks are co-occurring, both English and Chinese news show high flexibility, and the order adjustment is driven by information and pragmatic purposes. The study reveals that word order has both systematic preference and dynamic adaptability, providing new empirical support for contrastive study of English-Chinese information structure.

[4] T-REX: Table – Refute or Entail eXplainer

Tim Luka Horstmann, Baptiste Geisenberger, Mehwish Alam

Main category: cs.CL

TL;DR: T-REX is the first interactive tool for verifying textual claims against tabular data using advanced LLMs, making table fact-checking accessible to non-experts.

Details

Motivation: Current table fact-checking solutions using LLMs remain inaccessible to non-experts despite recent advances, creating a need for user-friendly tools.

Method: Developed T-REX, an interactive tool that uses state-of-the-art instruction-tuned reasoning LLMs for multimodal, multilingual table claim verification.

Result: Created a live, openly available online system that provides accurate and transparent claim verification over tabular data.

Conclusion: T-REX successfully bridges the accessibility gap by empowering non-experts with advanced fact-checking technology for table-based claim verification.

Abstract: Verifying textual claims against structured tabular data is a critical yet challenging task in Natural Language Processing with broad real-world impact. While recent advances in Large Language Models (LLMs) have enabled significant progress in table fact-checking, current solutions remain inaccessible to non-experts. We introduce T-REX (T-REX: Table – Refute or Entail eXplainer), the first live, interactive tool for claim verification over multimodal, multilingual tables using state-of-the-art instruction-tuned reasoning LLMs. Designed for accuracy and transparency, T-REX empowers non-experts by providing access to advanced fact-checking technology. The system is openly available online.

[5] Confidence Estimation for Text-to-SQL in Large Language Models

Sepideh Entezari Maleki, Mohammadreza Pourreza, Davood Rafiei

Main category: cs.CL

TL;DR: Study on confidence estimation methods for LLM-generated SQL queries without gold answers, comparing black-box and white-box approaches with execution-based grounding.

Details

Motivation: To assess reliability of text-to-SQL outputs from LLMs when model weights and gradients are inaccessible, addressing the need for confidence estimation in constrained environments.

Method: Evaluated black-box (consistency-based) and white-box (SQL-syntax-aware logit interpretation) confidence estimation strategies on cross-domain text-to-SQL benchmarks, incorporating execution-based query grounding.

Result: Consistency-based methods performed best for black-box models, SQL-syntax-aware approaches were superior for white-box settings, and execution-based grounding improved both approaches.

Conclusion: Effective confidence estimation for text-to-SQL requires different strategies for black-box vs white-box access, with execution-based grounding providing valuable supplementary signals for improved reliability assessment.

Abstract: Confidence estimation for text-to-SQL aims to assess the reliability of model-generated SQL queries without having access to gold answers. We study this problem in the context of large language models (LLMs), where access to model weights and gradients is often constrained. We explore both black-box and white-box confidence estimation strategies, evaluating their effectiveness on cross-domain text-to-SQL benchmarks. Our evaluation highlights the superior performance of consistency-based methods among black-box models and the advantage of SQL-syntax-aware approaches for interpreting LLM logits in white-box settings. Furthermore, we show that execution-based grounding of queries provides a valuable supplementary signal, improving the effectiveness of both approaches.

[6] Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models

Badrinath Ramakrishnan, Akshaya Balaji

Main category: cs.CL

TL;DR: Fine-tuning LLMs significantly increases privacy risks through data memorization, with leakage rates jumping from 0-5% to 60-75%. A multi-layered privacy framework reduces leakage to 0% while preserving 94.7% model utility.

Details

Motivation: LLMs' tendency to memorize training data during fine-tuning creates serious privacy risks, especially when sensitive data is involved, necessitating effective privacy protection measures.

Method: Proposed a multi-layered privacy protection framework with four techniques: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Tested on GPT-2, Phi-3, and Gemma-2 models.

Result: Fine-tuning with repeated sensitive data increased privacy leakage from baseline 0-5% to 60-75% (64.2% average increase). The privacy protection methods successfully reduced data leakage to 0% while maintaining 94.7% of original model utility.

Conclusion: The proposed multi-layered privacy protection framework effectively eliminates data leakage risks in fine-tuned LLMs while preserving most of the model’s performance, providing a practical solution to privacy concerns in LLM fine-tuning.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2, we demonstrate that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. We propose and rigorously evaluate four complementary privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Our experimental results show that these techniques can reduce data leakage to 0% while maintaining 94.7% of original model utility.

[7] Punctuation and Predicates in Language Models

Sonakshi Chauhan, Maheep Chaudhary, Koby Choy, Samuel Nellessen, Nandi Schoots

Main category: cs.CL

TL;DR: This paper investigates how information is collected and propagated in LLMs, focusing on punctuation tokens as attention sinks and examining model-specific differences in their necessity/sufficiency across GPT-2, DeepSeek, and Gemma.

Details

Motivation: To understand the computational importance of punctuation tokens and how different input components are processed in LLMs, examining whether models form early static summaries or remain sensitive to changes across layers.

Method: Used intervention-based techniques including interchange intervention and layer-swapping experiments to evaluate punctuation necessity/sufficiency and analyze processing of conditional statements and universal quantification across GPT-2, DeepSeek, and Gemma models.

Result: Found stark model-specific differences: punctuation is both necessary and sufficient in multiple GPT-2 layers, less so in DeepSeek, and not at all in Gemma. Also discovered that conditional statements and universal quantification are processed very differently.

Conclusion: The findings provide new insights into LLM internal mechanisms for punctuation usage and reasoning, with significant implications for model interpretability and understanding of information propagation across layers.

Abstract: In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.

[8] DLLMQuant: Quantizing Diffusion-based Large Language Models

Chen Xu, Dawei Yang

Main category: cs.CL

TL;DR: DLLMQuant is a post-training quantization framework specifically designed for diffusion-based LLMs that addresses three key quantization challenges through temporal-mask adaptive sampling, interaction-aware activation quantization, and certainty-guided quantization.

Details

Motivation: Direct application of existing post-training quantization methods to diffusion-based LLMs causes severe accuracy degradation (e.g., 16% drop) due to mismatches with DLLMs' unique mechanisms like dynamic masking, iterative generation, and bidirectional attention.

Method: Proposes DLLMQuant with three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS) for calibration across timesteps, 2) Interaction-Aware Activation Quantization (IA-AQ) using bidirectional attention signals, and 3) Certainty-Guided Quantization (CGQ) incorporating mask status and token scores.

Result: Experiments show DLLMQuant achieves significant performance gains while enhancing efficiency compared to existing PTQ methods that suffer from severe accuracy degradation when applied to DLLMs.

Conclusion: The proposed DLLMQuant framework successfully addresses the unique quantization challenges of diffusion-based LLMs by accounting for their temporal dynamics, bidirectional interactions, and masking mechanisms, enabling efficient deployment without performance degradation.

Abstract: Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how DLLMs’ key mechanisms - dynamic masking, iterative generation, bidirectional attention - clash with quantization. We identify three core issues: 1) Iterative generation and dynamic masking ratios lead to distinct token distributions across decoding steps, which are not adequately captured by existing PTQ calibration methods; 2) Quantization errors are accumulated and amplified progressively during iteration in DLLMs, causing quantized models to perform worse as decoding steps progress; 3) Unmasked tokens stabilize while masked remain probabilistic, making overall feature distribution incompatible with existing PTQ methods. To address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs, which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS), a calibration method that accounts for both time and mask factors, with the capacity to capture distributions across timesteps. 2) Interaction-Aware Activation Quantization (IA-AQ), which utilizes bidirectional attention’s interaction signals to dynamically allocate quantization resources. 3) Certainty-Guided Quantization (CGQ), which integrates mask status and token scores as key weighting criteria into error compensation, making weight quantization more suitable for DLLMs. Experiments show that DLLMQuant achieves significant performance gains while enhancing efficiency.

[9] ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

Junying Chen, Zhenyang Cai, Zhiheng Liu, Yunjin Yang, Rongsheng Wang, Qingying Xiao, Xiangyi Feng, Zhan Su, Jing Guo, Xiang Wan, Guangjun Yu, Haizhou Li, Benyou Wang

Main category: cs.CL

TL;DR: ShizhenGPT is the first multimodal LLM for Traditional Chinese Medicine that addresses data scarcity and multimodal diagnostic challenges through a comprehensive dataset and achieves superior performance in TCM tasks.

Details

Motivation: Large language models have limited application in Traditional Chinese Medicine due to data scarcity and the multimodal nature of TCM diagnostics (looking, listening, smelling, pulse-taking) which conventional LLMs cannot handle.

Method: Created the largest TCM dataset (100GB+ text, 200GB+ multimodal data including 1.2M images, 200 hours audio, physiological signals), pretrained and instruction-tuned ShizhenGPT for deep TCM knowledge and multimodal reasoning.

Result: Outperforms comparable-scale LLMs, competes with larger proprietary models, leads in TCM visual understanding among multimodal LLMs, and demonstrates unified perception across sound, pulse, smell, and vision modalities.

Conclusion: ShizhenGPT paves the way for holistic multimodal perception and diagnosis in TCM, with publicly available datasets, models, and code to inspire further exploration in this field.

Abstract: Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.

[10] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu

Main category: cs.CL

TL;DR: MMReview is a comprehensive multimodal benchmark for evaluating LLMs in peer review tasks across 17 research domains with 240 papers and 13 evaluation tasks.

Details

Motivation: Current LLM-based review systems lack a unified evaluation benchmark to assess comprehensive, accurate, and human-aligned assessments, especially for multimodal content like figures and tables.

Method: Proposed MMReview benchmark spanning multiple disciplines with multimodal content and expert-written reviews for 240 papers across 17 domains. Designed 13 tasks in four categories: step-wise review generation, outcome formulation, human preference alignment, and robustness to adversarial manipulation.

Result: Extensive experiments on 16 open-source and 5 closed-source models demonstrate the benchmark’s thoroughness in evaluating model performance across various review tasks.

Conclusion: MMReview establishes a standardized foundation for developing automated peer review systems and addresses the critical gap in evaluating multimodal review capabilities of LLMs.

Abstract: With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models’ ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.

[11] EmoTale: An Enacted Speech-emotion Dataset in Danish

Maja J. Hjuler, Harald V. Skat-Rørdam, Line H. Clemmensen, Sneha Das

Main category: cs.CL

TL;DR: EmoTale is a new Danish/English emotional speech corpus that addresses the lack of datasets for smaller languages like Danish, with SER models achieving 64.1% UAR using self-supervised embeddings.

Details

Motivation: There is a significant lack of functional emotional speech datasets for smaller languages like Danish, with only one existing database (DES from 1997) available, creating a need for modern, comprehensive emotional speech resources.

Method: Created EmoTale corpus with Danish and English emotional speech recordings and emotion annotations. Developed speech emotion recognition models using self-supervised speech model embeddings and openSMILE feature extractor, evaluated with leave-one-speaker-out cross-validation.

Result: Self-supervised embeddings outperformed hand-crafted features. The best model achieved 64.1% unweighted average recall on EmoTale, which is comparable to performance on the existing DES database.

Conclusion: EmoTale provides a valid and functional emotional speech corpus for Danish, demonstrating that self-supervised learning approaches are effective for speech emotion recognition in smaller languages, with performance matching existing benchmarks.

Abstract: While multiple emotional speech corpora exist for commonly spoken languages, there is a lack of functional datasets for smaller (spoken) languages, such as Danish. To our knowledge, Danish Emotional Speech (DES), published in 1997, is the only other database of Danish emotional speech. We present EmoTale; a corpus comprising Danish and English speech recordings with their associated enacted emotion annotations. We demonstrate the validity of the dataset by investigating and presenting its predictive power using speech emotion recognition (SER) models. We develop SER models for EmoTale and the reference datasets using self-supervised speech model (SSLM) embeddings and the openSMILE feature extractor. We find the embeddings superior to the hand-crafted features. The best model achieves an unweighted average recall (UAR) of 64.1% on the EmoTale corpus using leave-one-speaker-out cross-validation, comparable to the performance on DES.

[12] DPad: Efficient Diffusion Language Models with Suffix Dropout

Xinhua Chen, Sitao Huang, Cong Guo, Chiyue Wei, Yintao He, Jianyi Zhang, Hai “Hellen” Li, Yiran Chen

Main category: cs.CL

TL;DR: DPad is a training-free method that speeds up diffusion-based LLMs by using a sliding window and distance-decay dropout to reduce redundant suffix token computations, achieving 61.4× speedup while maintaining accuracy.

Details

Motivation: Diffusion-based LLMs suffer from high computational overhead because they predict all future suffix tokens at each decoding step while only keeping a small fraction, creating significant redundancy.

Method: DPad uses two strategies: (1) a sliding window that maintains a fixed-length suffix window, and (2) distance-decay dropout that deterministically removes distant suffix tokens before attention computation.

Result: DPad achieves up to 61.4× speedup over vanilla diffusion-based LLMs while maintaining comparable accuracy across multiple benchmarks on LLaDA-1.5 and Dream models.

Conclusion: DPad provides an efficient and scalable solution for long-sequence inference in diffusion-based LLMs, requiring minimal code changes and being compatible with existing optimizations like prefix caching.

Abstract: Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.

[13] Comparing energy consumption and accuracy in text classification inference

Johannes Zschache, Tilman Hartwig

Main category: cs.CL

TL;DR: This paper analyzes energy efficiency in LLM inference for text classification, finding that accuracy and energy efficiency can coexist, while larger models consume more energy with lower accuracy.

Details

Motivation: Address the lack of attention to energy consumption during LLM inference phase compared to training, and concerns about sustainability in NLP applications.

Method: Systematic empirical evaluation of trade-offs between model accuracy and energy consumption across various model architectures and hardware configurations for text classification inference.

Result: Best-performing models can be energy-efficient; larger LLMs consume significantly more energy with lower accuracy; energy consumption varies widely (kWh) based on model type, size, and hardware; strong correlation between energy consumption and runtime.

Conclusion: Execution time can serve as proxy for energy usage; findings provide actionable insights for sustainable AI development, helping balance performance and resource efficiency in NLP applications.

Abstract: The increasing deployment of large language models (LLMs) in natural language processing (NLP) tasks raises concerns about energy efficiency and sustainability. While prior research has largely focused on energy consumption during model training, the inference phase has received comparatively less attention. This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference across various model architectures and hardware configurations. Our empirical analysis shows that the best-performing model in terms of accuracy can also be energy-efficient, while larger LLMs tend to consume significantly more energy with lower classification accuracy. We observe substantial variability in inference energy consumption ($<$mWh to $>$kWh), influenced by model type, model size, and hardware specifications. Additionally, we find a strong correlation between inference energy consumption and model runtime, indicating that execution time can serve as a practical proxy for energy usage in settings where direct measurement is not feasible. These findings have implications for sustainable AI development, providing actionable insights for researchers, industry practitioners, and policymakers seeking to balance performance and resource efficiency in NLP applications.

[14] Let’s Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper

Krishna Garg, Firoz Shaikh, Sambaran Bandyopadhyay, Cornelia Caragea

Main category: cs.CL

TL;DR: SciIG task evaluates LLMs’ ability to generate research paper introductions from titles, abstracts, and related works. LLaMA-4 Maverick performs best, especially in semantic similarity and faithfulness, with three-shot prompting being most effective.

Details

Motivation: As LLMs become writing assistants, generating high-quality research paper introductions remains challenging yet essential for academic writing support.

Method: Created SciIG task with datasets from NAACL 2025 and ICLR 2025 papers. Evaluated 5 state-of-the-art models using automated metrics and LLM-as-a-judge across multiple dimensions including lexical overlap, semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality.

Result: LLaMA-4 Maverick showed superior performance on most metrics, particularly excelling in semantic similarity and faithfulness. Three-shot prompting consistently outperformed fewer-shot approaches.

Conclusion: Provides practical insights for developing effective research writing assistants and sets realistic expectations for LLM-assisted academic writing. All code and datasets will be publicly released for reproducibility.

Abstract: As researchers increasingly adopt LLMs as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates LLMs’ ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems, across multiple dimensions: lexical overlap, semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality. Our comprehensive framework combines automated metrics with LLM-as-a-judge evaluations. Results demonstrate LLaMA-4 Maverick’s superior performance on most metrics, particularly in semantic similarity and faithfulness. Moreover, three-shot prompting consistently outperforms fewer-shot approaches. These findings provide practical insights into developing effective research writing assistants and set realistic expectations for LLM-assisted academic writing. To foster reproducibility and future research, we will publicly release all code and datasets.

[15] Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

Cliff O’Reilly, Ernesto Jimenez-Ruiz, Tillman Weyde

Main category: cs.CL

TL;DR: Averaging concept activations across multiple languages in LLMs helps isolate pure semantic relationships between ontology classes, outperforming single-language analysis.

Details

Motivation: To address LLMs' shortcomings by connecting them with formal knowledge representation and isolating concept semantics from syntactic and language-specific information.

Method: Using Sparse Autoencoders on Gemma 2B LLM to obtain concept activations for OWL ontology classes in English, French, and Chinese, then averaging across languages to derive conceptual representations.

Result: Conceptual averages from multiple languages align better with true relationships between ontology classes compared to single-language activations alone.

Conclusion: This technique enables more accurate mechanistic interpretation of internal network states by isolating pure semantics through cross-language averaging.

Abstract: Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology classes. Our results give a strong indication that the conceptual average aligns to the true relationship between classes when compared with a single language by itself. The result hints at a new technique which enables mechanistic interpretation of internal network states with higher accuracy.

[16] GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs

Adrian-Marius Dumitran, Alexandra-Mihaela Danila, Angela-Liliana Dumitran

Main category: cs.CL

TL;DR: GRILE is the first open benchmark for Romanian language testing with 1,151 multiple-choice questions from high-stakes exams, evaluating LLMs’ answer accuracy and explanation quality.

Details

Motivation: To assess the pedagogical value of LLMs for low-resource languages like Romanian, where their educational capabilities remain unclear despite NLP advancements.

Method: Created GRILE benchmark with questions from Romanian national exams, tested 7 multilingual and Romanian-specific LLMs on answer selection accuracy and linguistic explanation quality evaluated by experts.

Result: Gemini 2.5 Pro achieved 83% accuracy, but most open-weight models stayed below 65%, with 48% of explanations containing factual or pedagogical flaws. Systematic weaknesses found in morphology and DOOM3 orthographic norms.

Conclusion: GRILE exposes significant challenges for trustworthy educational NLP in low-resource settings and serves as a test-bed for controllable explanation generation and evaluation, with all data and code released for future research.

Abstract: LLMs (Large language models) have revolutionized NLP (Natural Language Processing), yet their pedagogical value for low-resource languages remains unclear. We present GRILE (Grammar Romanian Inference and Language Explanations) , the first open benchmark of 1,151 multiple-choice questions harvested from Romanian high-stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state-of-the-art multilingual and Romanian-specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically accurate explanations. While Gemini 2.5 Pro reaches 83% accuracy, most open-weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM3 orthographic norms. All data, code and a public web demo are released to catalyze future research. Our findings expose open challenges for trustworthy educational NLP in low-resource settings and establish GRILE as a new test-bed for controllable explanation generation and evaluation.

[17] Customizing Speech Recognition Model with Large Language Model Feedback

Shaoshi Ling, Guoli Ye

Main category: cs.CL

TL;DR: Reinforcement learning approach using LLMs as reward models for unsupervised ASR domain adaptation, improving entity recognition by 21% over self-training methods.

Details

Motivation: ASR systems struggle with rare named entities and domain mismatches, while LLMs excel across domains. Need to leverage LLM capabilities to improve ASR transcription quality without labeled data.

Method: Uses reinforcement learning with LLM as reward model to score ASR hypotheses. LLM provides reward signals based on contextual information to fine-tune ASR model using unlabeled data.

Result: Achieves 21% improvement on entity word error rate compared to conventional self-training methods.

Conclusion: LLM-guided reinforcement learning effectively enhances ASR domain adaptation, particularly for named entity recognition, without requiring labeled data.

Abstract: Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21% improvement on entity word error rate over conventional self-training methods.

[18] Tokens with Meaning: A Hybrid Tokenization Approach for NLP

M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik

Main category: cs.CL

TL;DR: A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation, specifically designed for morphologically rich languages like Turkish, achieving superior performance on linguistic benchmarks.

Details

Motivation: Existing subword tokenization methods like BPE and WordPiece struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure, leading to inefficient and less interpretable tokenization.

Method: Hybrid framework using phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. Integrates BPE for out-of-vocabulary coverage while maintaining morphological coherence. Includes special tokens for whitespace, case, and uppercase markers.

Result: Achieved highest Turkish Token Percentage (90.29%) and Pure Token Percentage (85.8%) on TR-MMLU benchmark. Outperformed tokenizers from LLaMA, Gemma, and GPT in producing more linguistically meaningful and coherent tokens.

Conclusion: The approach provides a language-independent framework for more interpretable and effective multilingual NLP systems, demonstrated successfully on Turkish but adaptable to other morphologically rich languages.

Abstract: Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitab{\i}), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29%) and Pure Token Percentage (85.8%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.

[19] Chain of Correction for Full-text Speech Recognition with Large Language Models

Zhiyuan Tang, Dong Wang, Zhikai Zhou, Yong Liu, Shen Huang, Shidong Shang

Main category: cs.CL

TL;DR: Chain of Correction (CoC) method uses multi-turn chat format to correct ASR errors segment by segment, outperforming baseline systems while balancing under-correction and over-rephrasing.

Details

Motivation: Address challenges in full-text ASR error correction including stability, controllability, completeness, and fluency issues with current LLM approaches.

Method: Proposes CoC framework using multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context. Fine-tunes pre-trained LLM on ChFT dataset.

Result: Significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. Analyzes correction thresholds and extrapolates to extra-long ASR outputs.

Conclusion: CoC effectively addresses ASR error correction challenges and provides balanced correction performance while maintaining semantic understanding through contextual guidance.

Abstract: Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) is attracting increased attention for its ability to address a wide range of error types, such as punctuation restoration and inverse text normalization, across long context. However, challenges remain regarding stability, controllability, completeness, and fluency. To mitigate these issues, this paper proposes the Chain of Correction (CoC), which uses a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding. Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM to evaluate CoC’s performance. Experiments show that CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. We also analyze correction thresholds to balance under-correction and over-rephrasing, extrapolate CoC on extra-long ASR outputs, and explore using other types of information to guide error correction.

[20] A Joint Multitask Model for Morpho-Syntactic Parsing

Demian Inostroza, Mel Mistica, Ekaterina Vylomova, Chris Guest, Kemal Kurniawan

Main category: cs.CL

TL;DR: A joint multitask model using XLM-RoBERTa with three specialized decoders achieves state-of-the-art performance on morpho-syntactic parsing across nine diverse languages, with ablation studies showing tokenization and content word identification are critical.

Details

Motivation: To develop a unified system that can handle both morphological and syntactic analyses following the novel UD annotation scheme for the UniDive 2025 shared task across multiple typologically diverse languages.

Method: Uses a shared XLM-RoBERTa encoder with three specialized decoders for content word identification, dependency parsing, and morphosyntactic feature prediction in a joint multitask learning framework.

Result: Achieved best overall performance with average MSLAS 78.7%, LAS 80.1%, and Feats F1 90.3% across nine languages. Ablation studies confirmed importance of gold tokenization and content word identification.

Conclusion: The joint multitask approach is effective for morpho-syntactic parsing, though the model still struggles with core grammatical cases (Nom-Acc) and nominal features across different languages.

Abstract: We present a joint multitask model for the UniDive 2025 Morpho-Syntactic Parsing shared task, where systems predict both morphological and syntactic analyses following novel UD annotation scheme. Our system uses a shared XLM-RoBERTa encoder with three specialized decoders for content word identification, dependency parsing, and morphosyntactic feature prediction. Our model achieves the best overall performance on the shared task’s leaderboard covering nine typologically diverse languages, with an average MSLAS score of 78.7 percent, LAS of 80.1 percent, and Feats F1 of 90.3 percent. Our ablation studies show that matching the task’s gold tokenization and content word identification are crucial to model performance. Error analysis reveals that our model struggles with core grammatical cases (particularly Nom-Acc) and nominal features across languages.

[21] Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency

Aman Goel, Daniel Schwartz, Yanjun Qi

Main category: cs.CL

TL;DR: Finch-Zk is a black-box framework that detects and mitigates hallucinations in LLM outputs using cross-model consistency checking without external knowledge sources.

Details

Motivation: LLMs are susceptible to generating plausible but factually inaccurate content (hallucinations), requiring practical solutions to enhance factual reliability in production systems.

Method: Uses fine-grained cross-model consistency checking by comparing responses from diverse models on semantically-equivalent prompts, plus targeted mitigation techniques to correct problematic segments while preserving accurate content.

Result: Improves hallucination detection F1 scores by 6-39% on FELM dataset and achieves 7-8 percentage points improvement in answer accuracy on GPQA-diamond dataset with state-of-the-art models.

Conclusion: Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems across multiple models.

Abstract: Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations–generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages FINe-grained Cross-model consistency to detect and mitigate Hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations:

a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39% compared to existing approaches. For mitigation, Finch-Zk achieves 7-8 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation across multiple models demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.

[22] SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing

Jing Chen, Zhiheng Yang, Yixian Shen, Jie Liu, Adam Belloum, Chrysa Papagainni, Paola Grosso

Main category: cs.CL

TL;DR: SurveyGen-I is an automatic survey generation framework that uses coarse-to-fine retrieval, adaptive planning, and memory-guided generation to create coherent, well-cited scientific surveys.

Details

Motivation: Existing LLM-based survey generation approaches struggle with maintaining coherence across long multi-section surveys and providing comprehensive citation coverage, limiting their effectiveness in scientific communication.

Method: Combines coarse-to-fine retrieval (survey-level then subsection-level), adaptive planning, and memory-guided generation with a memory mechanism that stores previously written content to ensure coherence across subsections.

Result: Experiments across four scientific domains show SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage.

Conclusion: SurveyGen-I successfully addresses the limitations of previous LLM-based survey generation methods by ensuring coherence and comprehensive citation coverage through its innovative memory-guided approach.

Abstract: Survey papers play a critical role in scientific communication by consolidating progress across a field. Recent advances in Large Language Models (LLMs) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing LLM-based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I first performs survey-level retrieval to construct the initial outline and writing plan, and then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across four scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage.

[23] Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever

Yixin Chen, Ying Xiong, Shangyu Wu, Yufei Cui, Xue Liu, Nan Guan, Chun Jason Xue

Main category: cs.CL

TL;DR: A behavior-aligned retriever (BAR) is trained to provide consistent demonstrations that help LLMs make more accurate tool-using decisions, reducing erroneous function calls while maintaining task performance.

Details

Motivation: Existing methods for tool-augmented LLMs suffer from high training overhead and inconsistent demonstration samples that misguide function invocation behavior, leading to inefficiencies and increased costs.

Method: Constructed a corpus with different function-calling behaviors, trained a behavior-aligned retriever using contrastive learning with customized positive/negative pairs and dual-negative contrastive loss to ensure robust retrieval of behaviorally consistent examples.

Result: Significantly reduces erroneous function calls while maintaining high task performance.

Conclusion: Offers a cost-effective and efficient solution for tool-augmented LLMs by providing behaviorally consistent demonstrations to guide accurate tool-using decisions.

Abstract: Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased costs.Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model’s invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or non-calling.We use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of behaviorally consistent examples.Experiments demonstrate that our approach significantly reduces erroneous function calls while maintaining high task performance, offering a cost-effective and efficient solution for tool-augmented LLMs.

[24] ISCA: A Framework for Interview-Style Conversational Agents

Charles Welch, Allison Lahnala, Vasudha Varadarajan, Lucie Flek, Rada Mihalcea, J. Lomax Boyd, João Sedoc

Main category: cs.CL

TL;DR: A low-compute non-generative system for interview-style conversational agents that facilitates qualitative data collection through controlled interactions and quantitative analysis, with easy online adjustment capabilities.

Details

Motivation: To create a tool for qualitative data collection that provides control and standardization over conversational flow, particularly useful for tracking attitude formation and behavior change in research applications.

Method: Developed a non-generative system with an online administrative panel that allows easy creation and adjustment of interviews without coding, enabling controlled conversational interactions.

Result: Successfully implemented the system with two case studies: Expressive Interviewing for COVID-19 and a semi-structured interview for public opinion on neurotechnology, demonstrating practical applications.

Conclusion: The system provides an accessible, open-source solution for researchers to conduct controlled interview-style conversations for data collection, with potential for extensions and broader applications in qualitative research.

Abstract: We present a low-compute non-generative system for implementing interview-style conversational agents which can be used to facilitate qualitative data collection through controlled interactions and quantitative analysis. Use cases include applications to tracking attitude formation or behavior change, where control or standardization over the conversational flow is desired. We show how our system can be easily adjusted through an online administrative panel to create new interviews, making the tool accessible without coding. Two case studies are presented as example applications, one regarding the Expressive Interviewing system for COVID-19 and the other a semi-structured interview to survey public opinion on emerging neurotechnology. Our code is open-source, allowing others to build off of our work and develop extensions for additional functionality.

Wenhan Dong, Zhen Sun, Yuemeng Zhao, Zifan Peng, Jun Wu, Jingyi Zheng, Yule Liu, Xinlei He, Yu Wang, Ruiming Wang, Xinyi Huang, Lei Mo

Main category: cs.CL

TL;DR: LLMs struggle with zero-shot assessment of Chinese reading comprehension difficulty aligned with students’ cognitive abilities, but show improvement with in-context examples, revealing both emerging capabilities and systematic biases.

Details

Motivation: To address the gap in evaluating LLMs' ability to assess reading material difficulty according to the Zone of Proximal Development principle, particularly for Chinese language education where comprehensive studies are lacking.

Method: Introduces ZPD-SCA benchmark annotated by top 0.15% Special Grade teachers, testing LLMs in zero-shot and in-context learning scenarios across different student age groups and reading genres.

Result: LLMs perform poorly in zero-shot learning (some below random guessing), improve substantially with in-context examples (nearly double accuracy), but show systematic directional biases and genre-dependent performance variations.

Conclusion: LLMs possess emerging but limited abilities for educational alignment assessment; ZPD-SCA provides foundation for evaluating and improving LLMs in cognitively aligned educational applications.

Abstract: Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students’ developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students’ Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs’ ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.

[26] Credence Calibration Game? Calibrating Large Language Models through Structured Play

Ke Fang, Tianyi Zhao, Lu Cheng

Main category: cs.CL

TL;DR: A novel prompt-based calibration framework for LLMs that uses game-inspired feedback loops to improve confidence estimation without additional supervision or parameter updates.

Details

Motivation: LLMs are increasingly used in decision-critical domains where accurate confidence estimates are essential, but existing calibration methods require additional supervision or parameter updates.

Method: A prompt-based framework inspired by the Credence Calibration Game, featuring structured interaction loops where LLMs receive feedback on confidence-correctness alignment through feedback-driven prompting and natural language performance summaries.

Result: Extensive experiments across models and game configurations show consistent improvements in evaluation metrics, demonstrating effective calibration.

Conclusion: Game-based prompting is an effective strategy for LLM calibration that doesn’t require additional supervision or parameter updates, with potential for broader application.

Abstract: As Large Language Models (LLMs) are increasingly deployed in decision-critical domains, it becomes essential to ensure that their confidence estimates faithfully correspond to their actual correctness. Existing calibration methods have primarily focused on post-hoc adjustments or auxiliary model training; however, many of these approaches necessitate additional supervision or parameter updates. In this work, we propose a novel prompt-based calibration framework inspired by the Credence Calibration Game. Our method establishes a structured interaction loop wherein LLMs receive feedback based on the alignment of their predicted confidence with correctness. Through feedback-driven prompting and natural language summaries of prior performance, our framework dynamically improves model calibration. Extensive experiments across models and game configurations demonstrate consistent improvements in evaluation metrics. Our results highlight the potential of game-based prompting as an effective strategy for LLM calibration. Code and data are available at https://anonymous.4open.science/r/LLM-Calibration/.

[27] DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement

Yupei Yang, Fan Feng, Lin Yang, Wanxi Deng, Lin Qu, Biwei Huang, Shikui Tu, Lei Xu

Main category: cs.CL

TL;DR: DEPTH framework reduces relation extraction hallucinations by using dependency-aware sentence simplification and two-tiered hierarchical refinement, achieving 17.2% F1 improvement and reducing hallucinations to 7.0%.

Details

Motivation: LLMs struggle with reliably determining relation existence in complex sentences, leading to spurious predictions that compromise knowledge graph integrity and downstream reliability.

Method: Two-stage framework: (1) Grounding module extracts relations using shortest dependency paths to create minimal relational contexts, (2) Refinement module aggregates local predictions and revises them holistically. Includes causality-driven reward model for robust RLHF fine-tuning.

Result: Achieves 17.2% average F1 score improvement over SOTA baselines and reduces hallucination rate to 7.0% across six benchmarks.

Conclusion: DEPTH effectively addresses LLM hallucination issues in relation extraction through syntactic simplification and hierarchical refinement, significantly improving reliability and performance.

Abstract: Relation extraction enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this domain, most existing methods concentrate on relation classification, which predicts the semantic relation type between a related entity pair. However, we observe that LLMs often struggle to reliably determine whether a relation exists, especially in cases involving complex sentence structures or intricate semantics, which leads to spurious predictions. Such hallucinations can introduce noisy edges in knowledge graphs, compromising the integrity of structured knowledge and downstream reliability. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on six benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.0% while achieving a 17.2% improvement in average F1 score over state-of-the-art baselines.

[28] Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs

Yinghan Zhou, Weifeng Zhu, Juan Wen, Wanli Peng, Zhengxian Wu, Yiming Xue

Main category: cs.CL

TL;DR: LLMs struggle with self-recognition in individual text scenarios due to unexpressed latent abilities. The paper proposes Cognitive Surgery (CoSur) to awaken this capability, significantly improving performance.

Details

Motivation: Large language models show self-recognition capabilities in paired text scenarios but fail in individual text judgment. The underlying causes of this performance gap haven't been systematically analyzed.

Method: Proposed Cognitive Surgery (CoSur) framework with four modules: representation extraction, territory construction, authorship discrimination, and cognitive editing to awaken Implicit Territorial Awareness.

Result: CoSur improved performance of three different LLMs in individual presentation paradigm, achieving average accuracies of 83.25%, 66.19%, and 88.01% respectively.

Conclusion: The study successfully identified Implicit Territorial Awareness as the root cause of LLMs’ self-recognition failure and demonstrated that Cognitive Surgery can effectively awaken this latent capability.

Abstract: Large language models (LLMs) have been shown to possess a degree of self-recognition capability-the ability to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the Pair Presentation Paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the Individual Presentation Paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first replicate existing findings to confirm that LLMs struggle to distinguish self- from other-generated text under IPP. We then investigate the reasons for this failure and attribute it to a phenomenon we term Implicit Territorial Awareness (ITA)-the model’s latent ability to distinguish self- and other-texts in representational space, which remains unexpressed in its output behavior. To awaken the ITA of LLMs, we propose Cognitive Surgery (CoSur), a novel framework comprising four main modules: representation extraction, territory construction, authorship discrimination and cognitive editing. Experimental results demonstrate that our proposed method improves the performance of three different LLMs in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%, respectively.

[29] Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models

Wuyang Zhang, Yexin Tian, Xiandong Meng, Mengjie Wang, Junliang Du

Main category: cs.CL

TL;DR: A fine-tuning framework that injects knowledge graph information into language models using GNN encoding and fusion mechanisms to improve entity-level semantic understanding and reasoning capabilities.

Details

Motivation: Address missing reasoning chains and insufficient entity-level semantic understanding in LLMs when handling tasks requiring structured knowledge.

Method: Knowledge graph injection framework using GNN to encode entities/relations, fusion mechanism to combine KG embeddings with contextual representations, gating mechanism to balance linguistic vs structural knowledge, and joint loss function for task performance and structural alignment.

Result: Significantly enhances model’s ability to represent complex semantic units, demonstrates better semantic consistency and contextual logic in structural reasoning and entity extraction tasks across entity recognition, QA, and language generation.

Conclusion: The proposed structure-aware fine-tuning framework effectively integrates structured knowledge into language models, improving entity prediction accuracy and semantic reasoning while maintaining robustness through systematic sensitivity validation.

Abstract: This paper addresses the problems of missing reasoning chains and insufficient entity-level semantic understanding in large language models when dealing with tasks that require structured knowledge. It proposes a fine-tuning algorithm framework based on knowledge graph injection. The method builds on pretrained language models and introduces structured graph information for auxiliary learning. A graph neural network is used to encode entities and their relations, constructing a graph-based semantic representation. A fusion mechanism is then designed to jointly model the knowledge graph embeddings with the contextual representations from the language model. To enhance the robustness of knowledge integration, a gating mechanism is introduced to dynamically balance the contributions of linguistic semantics and structural knowledge. This effectively mitigates conflicts between different representational spaces. During training, a joint loss function is constructed to account for both task performance and structural alignment objectives. This helps improve the accuracy of entity prediction and semantic reasoning. The study also includes a series of systematic sensitivity experiments. It evaluates the effects of learning rate, graph coverage, and structural perturbations on model performance. The results further validate the effectiveness and stability of the proposed method across tasks such as entity recognition, question answering, and language generation. Experimental findings show that the proposed structure-aware fine-tuning framework significantly enhances the model’s ability to represent complex semantic units. It demonstrates better semantic consistency and contextual logic modeling in scenarios involving structural reasoning and entity extraction.

[30] NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman, Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Stefania Alborghetti, Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi, Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, Zijia Chen

Main category: cs.CL

TL;DR: Nemotron-Nano-9B-v2 is a hybrid Mamba-Transformer model that achieves state-of-the-art reasoning accuracy with 6x higher throughput than similar-sized models, enabling 128k token inference on a single A10G GPU.

Details

Motivation: To improve inference speed for reasoning workloads while maintaining high accuracy by replacing most self-attention layers with Mamba-2 layers for better throughput with long thinking traces.

Method: Pre-trained a 12B parameter model on 20T tokens using FP8 training, then used Minitron strategy for compression and distillation to enable 128k token inference on a single GPU with bfloat16 precision.

Result: Achieves on-par or better accuracy than similar models (e.g., Qwen3-8B) with up to 6x higher inference throughput in reasoning scenarios (8k input, 16k output tokens).

Conclusion: The hybrid Mamba-Transformer architecture successfully balances accuracy and throughput for reasoning workloads, making high-performance reasoning more accessible on standard GPU hardware.

Abstract: We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

[31] In2x at WMT25 Translation Task

Lei Pang, Hanyi Mao, Quanjia Xiao, HaiXiao Liu, Xiangyi Li

Main category: cs.CL

TL;DR: In2x team’s WMT25 submission explores a generalizable paradigm for extending LLMs to Japanese translation, focusing on data construction and reward model design for low-resource languages.

Details

Motivation: To enable large language models to achieve exceptional performance in low-resource or less commonly spoken languages, particularly Japanese translation tasks.

Method: Develops a generalizable paradigm that includes data construction methods and reward model design for extending LLMs to other languages.

Result: Not specified in the abstract (submission for WMT25 shared task, results likely to be presented at the conference).

Conclusion: The approach aims to create a scalable framework for improving LLM performance in underrepresented languages through systematic data and reward modeling techniques.

Abstract: This paper presents the open-system submission by the In2x research team for the WMT25 General Machine Translation Shared Task. Our submission focuses on Japanese-related translation tasks, aiming to explore a generalizable paradigm for extending large language models (LLMs) to other languages. This paradigm encompasses aspects such as data construction methods and reward model design. The ultimate goal is to enable large language model systems to achieve exceptional performance in low-resource or less commonly spoken languages.

[32] Reasoning is about giving reasons

Krunal Shah, Dan Roth

Main category: cs.CL

TL;DR: Proposes Representation of Logical Structure (RLS) as an intermediate representation to understand and articulate the logical structure of natural language arguments, enabling deterministic reasoning across multiple reasoning tasks.

Details

Motivation: Current transformer approaches for rule chaining lack interpretability and cannot accommodate theoretically equivalent reasoning tasks like abduction or contradiction identification. There's a need for models that can understand and articulate the core logical structure of arguments.

Method: Develop an intermediate representation (RLS) that captures the logical structure of natural language arguments - including logical atoms and the rules incorporating them. This representation enables deterministic reasoning computation.

Result: Achieved high accuracy in identifying and extracting logical structure from natural language arguments across three popular reasoning datasets. The approach supports explanation generation and significantly extends reasoning capabilities.

Conclusion: The RLS representation successfully addresses limitations of current rule-chaining approaches by providing interpretable logical structure understanding, enabling support for various reasoning tasks including arbitrary depth reasoning, mistake rectification, and interactive discussion.

Abstract: Convincing someone of the truth value of a premise requires understanding and articulating the core logical structure of the argument which proves or disproves the premise. Understanding the logical structure of an argument refers to understanding the underlying “reasons” which make up the proof or disproof of the premise - as a function of the “logical atoms” in the argument. While it has been shown that transformers can “chain” rules to derive simple arguments, the challenge of articulating the “reasons” remains. Not only do current approaches to chaining rules suffer in terms of their interpretability, they are also quite constrained in their ability to accommodate extensions to theoretically equivalent reasoning tasks - a model trained to chain rules cannot support abduction or identify contradictions. In this work we suggest addressing these shortcomings by identifying an intermediate representation (which we call the Representation of the Logical Structure (RLS) of the argument) that possesses an understanding of the logical structure of a natural language argument - the logical atoms in the argument and the rules incorporating them. Given the logical structure, reasoning is deterministic and easy to compute. Therefore, our approach supports all forms of reasoning that depend on the logical structure of the natural language argument, including arbitrary depths of reasoning, on-the-fly mistake rectification and interactive discussion with respect to an argument. We show that we can identify and extract the logical structure of natural language arguments in three popular reasoning datasets with high accuracies, thus supporting explanation generation and extending the reasoning capabilities significantly.

[33] Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning

Guilhem Fauré, Mostafa Sadeghi, Sam Bigeard, Slim Ouni

Main category: cs.CL

TL;DR: Proposes two enhancements to Progressive Transformers for sign language production: quaternion-based pose encoding with geodesic loss for better angular joint movements, and contrastive loss to structure embeddings by semantic similarity.

Details

Motivation: Address high intra-class variability in sign language due to signer morphology and stylistic differences in training data to improve robustness.

Method: 1) Encode poses using bone rotations in quaternion space with geodesic loss 2) Introduce contrastive loss using gloss overlap or SBERT similarity to structure decoder embeddings by semantic meaning

Result: 16% improvement in Probability of Correct Keypoint with contrastive loss alone, and 6% reduction in Mean Bone Angle Error when combined with quaternion encoding on Phoenix14T dataset.

Conclusion: Incorporating skeletal structure modeling and semantically guided contrastive objectives benefits Transformer-based sign language production models.

Abstract: One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a 16% improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a 6% reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.

[34] Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

Mukhammadsaid Mamasaidov, Azizullah Aral, Abror Shopulatov, Mironshoh Inomjonov

Main category: cs.CL

TL;DR: New machine translation resources for Southern Uzbek including datasets, fine-tuned model, and post-processing method for Arabic-script handling.

Details

Motivation: Southern Uzbek has 5 million speakers but is underrepresented in NLP, with significant differences from Northern Uzbek in phonology, lexicon, and orthography.

Method: Created parallel datasets (FLORES+ dev set, 39,994 sentences from various sources), fine-tuned NLLB-200 model (lutfiy), and developed post-processing for Arabic-script half-space character restoration.

Result: Developed comprehensive resources including datasets, models, and tools for Southern Uzbek machine translation with improved morphological boundary handling.

Conclusion: All resources released publicly to support future work on Southern Uzbek and other low-resource languages.

Abstract: Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.

[35] Continuous sentiment scores for literary and multilingual contexts

Laurits Lyngbaek, Pascale Feldkamp, Yuri Bizzoni, Kristoffer Nielbo, Kenneth Enevoldsen

Main category: cs.CL

TL;DR: A novel continuous sentiment scoring method using concept vector projection that outperforms traditional tools for literary text analysis across languages and genres.

Details

Motivation: Traditional sentiment analysis tools underperform on literary texts due to figurative language and ambiguity, especially for low-resource languages, while transformer models provide only coarse categorical labels.

Method: Concept vector projection trained on multilingual literary data to produce continuous sentiment scores rather than categorical labels.

Result: Outperforms existing tools on English and Danish texts, with sentiment score distributions closely matching human ratings, enabling better sentiment arc modeling.

Conclusion: The continuous sentiment scoring approach effectively captures nuanced sentiment expressions in literature across different languages, genres, and historical periods.

Abstract: Sentiment Analysis is widely used to quantify sentiment in text, but its application to literary texts poses unique challenges due to figurative language, stylistic ambiguity, as well as sentiment evocation strategies. Traditional dictionary-based tools often underperform, especially for low-resource languages, and transformer models, while promising, typically output coarse categorical labels that limit fine-grained analysis. We introduce a novel continuous sentiment scoring method based on concept vector projection, trained on multilingual literary data, which more effectively captures nuanced sentiment expressions across genres, languages, and historical periods. Our approach outperforms existing tools on English and Danish texts, producing sentiment scores whose distribution closely matches human ratings, enabling more accurate analysis and sentiment arc modeling in literature.

[36] Improving in-context learning with a better scoring function

Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher

Main category: cs.CL

TL;DR: The paper identifies Softmax as a limitation in LLMs’ in-context learning for quantifier tasks and linear functions, and proposes SSA (scaled signed averaging) as a superior alternative that improves performance.

Details

Motivation: Recent studies have revealed limitations in LLMs' remarkable in-context learning ability, particularly on tasks involving first-order quantifiers (all/some) and linear functions.

Method: The authors propose scaled signed averaging (SSA) as a novel alternative to the Softmax scoring function in attention mechanisms to address these constraints.

Result: Empirical results show SSA dramatically improves performance on target tasks. Both encoder-only and decoder-only transformers with SSA match or exceed Softmax-based counterparts across various linguistic probing tasks.

Conclusion: SSA effectively addresses Softmax-induced limitations in in-context learning, demonstrating superior performance on quantifier reasoning and linear function tasks while maintaining strong performance on general linguistic tasks.

Abstract: Large language models (LLMs) exhibit a remarkable capacity to learn by analogy, known as in-context learning (ICL). However, recent studies have revealed limitations in this ability. In this paper, we examine these limitations on tasks involving first-order quantifiers such as {\em all} and {\em some}, as well as on ICL with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these constraints. To address this, we propose \textbf{scaled signed averaging (SSA)}, a novel alternative to Softmax. Empirical results show that SSA dramatically improves performance on our target tasks. Furthermore, we evaluate both encoder-only and decoder-only transformers models with SSA, demonstrating that they match or exceed their Softmax-based counterparts across a variety of linguistic probing tasks.

[37] The Digital Sous Chef – A Comparative Study on Fine-Tuning Language Models for Recipe Generation

Shubham Pundhir, Ganesh Bagler

Main category: cs.CL

TL;DR: A benchmark study showing GPT-2 large model with custom tokenization outperforms smaller models and traditional baselines in recipe generation, achieving >20% BERTScore improvement and 69.8% perplexity reduction.

Details

Motivation: To establish a rigorous benchmark for text-based recipe generation and address limitations of generic tokenizers in preserving recipe structures and precise numerical quantities.

Method: Fine-tuned GPT-2 large (774M) model compared against GPT-2 small (124M) and LSTM/RNN baselines on RecipeDB 5-cuisine corpus, using custom tokenization with 23 fraction tokens and structural markers.

Result: Large transformer approach achieved >20% relative improvement in BERTScore F1 (0.92 vs 0.72) over best recurrent baseline and reduced perplexity by 69.8%.

Conclusion: The study provides a foundation for advanced recipe generation research, though challenges remain in factual accuracy, paving the way for integrating real-world constraints and multi-modal inputs.

Abstract: We established a rigorous benchmark for text-based recipe generation, a fundamental task in natural language generation. We present a comprehensive comparative study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB. Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers. This approach addresses a critical limitation of generic tokenizers by preserving essential recipe structures and precise numerical quantities, thereby enhancing domain specificity. Performance is evaluated using a comprehensive suite of seven automatic metrics spanning fluency (BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and diversity. Our experiments show that the large transformer-based approach yields a >20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a discussion of remaining challenges, particularly regarding factual accuracy, and outline how this foundational study paves the way for integrating real-world constraints and multi-modal inputs in advanced recipe generation research.

[38] Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Guangzhan Wang, Hongyu Zhang, Beijun Shen, Xiaodong Gu

Main category: cs.CL

TL;DR: LMTransplant is a novel text augmentation method that uses LLMs to create diverse content-level variants by transplanting seed text into expanded contexts and regenerating new versions, outperforming traditional augmentation techniques.

Details

Motivation: Traditional data augmentation methods focus on lexical rephrasing with same semantics, while LLM-based approaches struggle with controlling style and structure. There's a need for methods that can leverage LLM knowledge to create more diverse and creative content variations while preserving core text attributes.

Method: LMTransplant uses a transplant-then-regenerate paradigm: 1) Incorporate seed text into a context expanded by LLM, 2) Ask the LLM to regenerate a variant based on the expanded context. This leverages LLM knowledge to create content-level diversity.

Result: LMTransplant demonstrates superior performance over existing text augmentation methods across various text-related tasks. It also shows exceptional scalability as the size of augmented data grows.

Conclusion: The proposed LMTransplant paradigm effectively addresses limitations of traditional augmentation methods by leveraging LLMs’ knowledge emergence capability to generate diverse, creative content variations while maintaining original text attributes, with strong scalability properties.

Abstract: Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their “knowledge emergence” capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.

[39] Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference

Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban

Main category: cs.CL

TL;DR: This paper evaluates LLMs’ multilingual reasoning consistency using synthetic logic-based NLI pairs translated across diverse languages, finding that code-switching can improve performance and serve as regularization.

Details

Motivation: To assess LLMs' capacity for consistent logical reasoning across languages, as current understanding of their multilingual alignment capabilities remains limited.

Method: Developed a controlled evaluation framework with synthetic logic-based premise-hypothesis pairs, translated into typologically diverse languages, tested in both monolingual and code-switched conditions with embedding-based similarity validation.

Result: Code-switching doesn’t degrade and can even improve performance, suggesting translation-induced lexical variation acts as regularization; semantic preservation was confirmed through embedding analyses.

Conclusion: Current LLMs show both potential and brittleness in cross-lingual reasoning, with code-switching identified as a promising approach for enhancing multilingual robustness.

Abstract: Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing

[40] TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting

Jiaming Leng, Yunying Bi, Chuan Qin, Bing Yin, Yanyong Zhang, Chao Wang

Main category: cs.CL

TL;DR: TransLLM is a unified framework that combines spatiotemporal modeling with large language models using learnable prompt composition for urban transportation tasks, achieving strong performance across multiple datasets and tasks.

Details

Motivation: Existing approaches have limitations: small-scale deep learning models are task-specific and data-hungry, while LLMs struggle with structured spatiotemporal data and numerical reasoning in transportation domains.

Method: Uses a lightweight spatiotemporal encoder with dilated temporal convolutions and dual-adjacency graph attention networks, integrated with LLMs through structured embeddings. Features instance-level prompt routing trained via reinforcement learning for dynamic prompt personalization.

Result: Experiments across 7 datasets and 3 tasks show exceptional effectiveness in both supervised and zero-shot settings. Outperforms 10 baseline models with competitive performance on regression and planning problems.

Conclusion: TransLLM demonstrates strong generalization and cross-task adaptability, providing a unified foundation framework for diverse urban transportation challenges.

Abstract: Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at https://github.com/BiYunying/TransLLM.

[41] Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

Skatje Myers, Dmitriy Dligach, Timothy A. Miller, Samantha Barr, Yanjun Gao, Matthew Churpek, Anoop Mayampurath, Majid Afshar

Main category: cs.CL

TL;DR: RAG approach for clinical EHR analysis matches/exceeds recent notes performance while using significantly fewer tokens, maintaining efficiency even with newer long-context models.

Details

Motivation: EHRs are long, noisy and redundant, making clinical navigation challenging. LLMs struggle with EHR length exceeding context windows, requiring efficient solutions.

Method: Proposed three clinical tasks (imaging extraction, antibiotic timelines, key diagnoses) tested with three LLMs using either targeted RAG retrieval or recent notes approach.

Result: RAG closely matched or exceeded recent notes performance, approaching full context performance while requiring drastically fewer input tokens.

Conclusion: RAG remains competitive and efficient for clinical EHR analysis even as newer models handle longer contexts, offering token-efficient alternative.

Abstract: Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models’ extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models’ full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.

[42] Long Chain-of-Thought Reasoning Across Languages

Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

Main category: cs.CL

TL;DR: Study examines long chain-of-thought reasoning across multiple languages, finding English pivot effectiveness varies by language, multilingual pretraining helps but gaps remain, and data quality/scale trade-offs are language-dependent.

Details

Motivation: To investigate how long chain-of-thought reasoning transfers across different languages, as current reasoning capabilities in LLMs remain predominantly English-centric despite the multilingual nature of many models.

Method: Constructed translated versions of English reasoning datasets, fine-tuned Qwen 2.5 (7B) and Qwen 3 (8B) models, and systematically studied long CoT generation across French, Japanese, Latvian, and Swahili.

Result: English pivot effectiveness varies: no benefit for French, improves Japanese/Latvian performance, insufficient for Swahili. Multilingual pretraining narrows but doesn’t eliminate cross-lingual gaps. Lightweight fine-tuning (1k traces) improves Swahili by 30%. Data quality vs scale trade-offs are language-dependent.

Conclusion: Long CoT transfer effectiveness depends on language characteristics, with English pivot strategy working variably across languages. Multilingual reasoning requires language-specific approaches rather than one-size-fits-all solutions, and translated datasets can foster equitable multilingual research.

Abstract: Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.

[43] MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Ailing Yu, Lan Yao, Jingnan Liu, Zhe Chen, Jiajun Yin, Yuan Wang, Xinhao Liao, Zhiling Ye, Ji Li, Yun Yue, Hansong Xiao, Hualei Zhou, Chunxiao Guo, Peng Wei, Jinjie Gu

Main category: cs.CL

TL;DR: MedResearcher-R1-32B is a medical deep research agent that achieves state-of-the-art performance on medical benchmarks through specialized medical knowledge synthesis and retrieval tools, outperforming larger proprietary systems.

Details

Motivation: General-purpose LLM agents struggle with medical domain challenges due to insufficient medical knowledge and lack of specialized retrieval tools for medical contexts.

Method: Two core innovations: (1) data synthesis framework using medical knowledge graphs to generate complex multi-hop question-answer pairs, (2) custom medical retrieval engine integrated with general tools. Trained with supervised fine-tuning and online reinforcement learning.

Result: Generated 2100+ diverse trajectories across 12 medical specialties. MedResearcher-R1-32B established new state-of-the-art results on medical benchmarks while maintaining competitive performance on general tasks.

Conclusion: Strategic domain-specific innovations in architecture, tool design, and training data construction enable smaller open-source models to outperform larger proprietary systems in specialized medical domains.

Abstract: Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts.We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions.Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.

[44] Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun

Main category: cs.CL

TL;DR: First systematic study on quantizing diffusion-based language models (dLLMs), identifying activation outliers as key challenge and evaluating state-of-the-art PTQ methods across multiple dimensions.

Details

Motivation: Deployment of diffusion large language models on edge devices is challenging due to massive parameter scale and high resource demands, while post-training quantization techniques for autoregressive LLMs remain unexplored for dLLMs.

Method: Systematic study identifying activation outliers, implementing state-of-the-art PTQ methods, and conducting comprehensive evaluation across four dimensions: bit-width, quantization method, task category, and model type.

Result: Identified presence of activation outliers with abnormally large values that dominate dynamic range and pose key challenge to low-bit quantization. Provided practical insights into quantization behavior of dLLMs under different configurations.

Conclusion: Findings provide foundation for future research in efficient dLLM deployment, with codes and experimental setups released to support the community.

Abstract: Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong

Main category: cs.CL

TL;DR: This paper addresses the gap in multimodal LLMs’ ability to solve geometric problems by creating Geo170K dataset with 170K+ geometric image-caption pairs and developing G-LLaVA, which outperforms GPT-4-V on MathVista benchmark with only 7B parameters.

Details

Motivation: Current multimodal LLMs struggle with geometric problem solving as they fail to accurately comprehend basic geometric elements and relationships, despite LLMs' strong performance in text-based mathematical problems.

Method: Leveraged unique characteristics of geometric problems and textual LLMs’ capacity to build an enriched multimodal geometry dataset (Geo170K), then developed G-LLaVA model trained on this dataset.

Result: G-LLaVA demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on MathVista benchmark despite having only 7B parameters.

Conclusion: The approach successfully enables LLMs to solve geometric problems by understanding image input through specialized dataset construction and model development, bridging an important gap in multimodal mathematical reasoning.

Abstract: Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.

Harry Cheng, Yangyang Guo, Qingpei Guo, Ming Yang, Tian Gan, Weili Guan, Liqiang Nie

Main category: cs.CL

TL;DR: This paper introduces a counterfactual dataset (CMSC) and counter-stereotype debiasing (CSD) strategy to mitigate social biases in Multi-modal Large Language Models without compromising general performance.

Details

Motivation: MLLMs inherit deep-rooted social biases from training data, leading to problematic responses regarding race, gender, and other attributes, which needs to be addressed.

Method: Proposes a comprehensive counterfactual dataset with 18 diverse social concepts and a CSD strategy that includes bias-aware data sampling and loss rescaling methods using counter-stereotypes.

Result: Extensive experiments with four MLLM architectures show CMSC dataset advantages and CSD strategy superiority in reducing social biases compared to existing methods.

Conclusion: The proposed approach effectively reduces social biases in MLLMs while maintaining performance on general multi-modal reasoning tasks.

Abstract: Multi-modal Large Language Models (MLLMs) have dramatically advanced the research field and delivered powerful vision-language understanding capabilities. However, these models often inherit deep-rooted social biases from their training data, leading to uncomfortable responses with respect to attributes such as race and gender. This paper addresses the issue of social biases in MLLMs by i) introducing a comprehensive counterfactual dataset with multiple social concepts (CMSC), which complements existing datasets by providing 18 diverse and balanced social concepts; and ii) proposing a counter-stereotype debiasing (CSD) strategy that mitigates social biases in MLLMs by leveraging the opposites of prevalent stereotypes. CSD incorporates both a novel bias-aware data sampling method and a loss rescaling method, enabling the model to effectively reduce biases. We conduct extensive experiments with four prevalent MLLM architectures. The results demonstrate the advantage of the CMSC dataset and the edge of CSD strategy in reducing social biases compared to existing competing methods, without compromising the overall performance on general multi-modal reasoning benchmarks.

[47] Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli

Main category: cs.CL

TL;DR: Source2Synth is a scalable synthetic data generation method that uses real-world data sources to create high-quality training data with reasoning steps, improving performance on question answering tasks by over 22% compared to baselines.

Details

Motivation: Existing synthetic data generation methods often produce low-quality or contrived data, which limits their effectiveness for enhancing large language models without expensive human annotations.

Method: Source2Synth takes custom real-world data sources as input and generates synthetic data examples with intermediate reasoning steps. It improves dataset quality by filtering out low-quality generations based on answerability criteria.

Result: The method achieved 25.51% improvement for tabular question answering on WikiSQL and 22.57% improvement for multi-hop question answering on HotpotQA compared to fine-tuned baselines.

Conclusion: Source2Synth provides an effective and scalable approach for generating high-quality synthetic data from real-world sources, significantly enhancing LLM performance on complex reasoning tasks without human annotation costs.

Abstract: Synthetic data generation has recently emerged as a promising approach for enhancing the capabilities of large language models (LLMs) without the need for expensive human annotations. However, existing methods often generate data that can be low quality or contrived. In this paper, we introduce Source2Synth, a scalable approach for synthetic data generation and curation that is grounded in real-world data sources. Source2Synth takes as input a custom data source and produces synthetic data examples with intermediate reasoning steps. Our method improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two tasks that leverage two different types of data: multi-hop question answering (MHQA), where we test complex reasoning abilities leveraging documents, and tabular question answering (TQA), where we test tool usage leveraging tables. Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotpotQA compared to the fine-tuned baselines.

[48] Deliberate Reasoning in Language Models as Structure-Aware Planning with an Accurate World Model

Siheng Xiong, Ali Payani, Yuan Yang, Faramarz Fekri

Main category: cs.CL

TL;DR: SWAP is a novel reasoning framework that combines structured knowledge representation with learned planning using entailment graphs for symbolic verification, outperforming existing methods on complex reasoning tasks.

Details

Motivation: Existing Chain-of-Thought approaches struggle with consistency and verification in complex multi-step reasoning tasks, requiring a more structured approach to enhance language model reasoning capabilities.

Method: SWAP uses entailment graphs to encode structured dependencies, employs a policy model for candidate expansions, a world model for structural updates with multiple alternatives, Diversity-based Modelling for exploration, and Contrastive Ranking for improved discrimination accuracy.

Result: Extensive experiments show SWAP significantly improves upon base models and consistently outperforms existing reasoning methods across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks.

Conclusion: SWAP successfully integrates structured knowledge representation with learned planning, demonstrating superior performance in complex reasoning tasks through its novel framework combining entailment graphs, diversity modeling, and contrastive ranking techniques.

Abstract: Enhancing the reasoning capabilities of language models (LMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making where existing Chain-of-Thought (CoT) approaches struggle with consistency and verification. In this paper, we propose a novel reasoning framework, referred to as Structure-aware Planning with an Accurate World Model (SWAP), that integrates structured knowledge representation with learned planning. Unlike prior methods that rely purely on natural language reasoning, SWAP leverages entailment graphs to encode structured dependencies and enable symbolic verification of intermediate steps. To systematically construct and update the graph, SWAP employs a policy model to propose candidate expansions and a world model to predict structural updates. To improve accuracy, the world model generates multiple alternative updates, and a discriminator re-ranks them based on plausibility. To encourage diverse exploration, we introduce Diversity-based Modelling (DM), which samples candidates from the remaining probability mass after removing previously sampled candidates from the original policy distribution. Additionally, SWAP improves the discrimination accuracy through Contrastive Ranking (CR), which directly compares candidates within prompts and incorporates meta-knowledge to improve ranking quality. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP significantly improves upon the base models and consistently outperforms existing reasoning methods.

[49] ChuLo: Chunk-Level Key Information Representation for Long Document Understanding

Yan Li, Soyeon Caren Han, Yue Dai, Feiqi Cao

Main category: cs.CL

TL;DR: ChuLo is a novel chunk representation method that uses unsupervised keyphrase extraction to group input tokens into semantically important chunks, reducing input length while preserving core document content for better long document understanding in Transformer models.

Details

Motivation: Transformer models struggle with long documents due to computational limitations. Traditional approaches like truncation, sparse attention, and chunking cause information loss and fail to capture long-range dependencies, especially problematic for token classification tasks requiring full context.

Method: ChuLo groups input tokens using unsupervised keyphrase extraction to create semantically meaningful chunks based on important keyphrases. This reduces input length while retaining core document content and preserving all tokens for fine-grained annotation tasks.

Result: The method was evaluated on multiple long document classification and token classification tasks, showing effectiveness through comprehensive qualitative and quantitative analysis. It minimizes information loss and improves Transformer model efficiency for long documents.

Conclusion: ChuLo successfully addresses computational limitations of Transformers for long documents by using keyphrase-based chunking, preserving semantic content while reducing input length, making it effective for both document classification and token classification tasks requiring full context preservation.

Abstract: Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating inputs, sparse self-attention, and chunking, attempt to mitigate these issues, but they often lead to information loss and hinder the model’s ability to capture long-range dependencies. In this paper, we introduce ChuLo, a novel chunk representation method for long document understanding that addresses these limitations. Our ChuLo groups input tokens using unsupervised keyphrase extraction, emphasizing semantically important keyphrase based chunks to retain core document content while reducing input length. This approach minimizes information loss and improves the efficiency of Transformer-based models. Preserving all tokens in long document understanding, especially token classification tasks, is important to ensure that fine-grained annotations, which depend on the entire sequence context, are not lost. We evaluate our method on multiple long document classification tasks and long document token classification tasks, demonstrating its effectiveness through comprehensive qualitative and quantitative analysis. Our implementation is open-sourced on https://github.com/adlnlp/Chulo.

[50] A Little Human Data Goes A Long Way

Dhananjay Ashok, Jonathan May

Main category: cs.CL

TL;DR: Synthetic data can replace up to 90% of human annotations with minimal performance loss, but the final 10% human data is crucial. Small amounts of human data (125-200 points) significantly outperform large volumes of synthetic data.

Details

Motivation: Address the high cost of human annotation in NLP systems by exploring how much synthetic data can effectively replace human-generated data while maintaining performance.

Method: Studied effects of incrementally replacing human data with synthetic data across eight diverse Fact Verification and Question Answering datasets, comparing performance at different replacement ratios.

Result: Replacing up to 90% of training data with synthetic points only marginally decreases performance, but replacing the final 10% causes severe performance declines. 125 human data points can reliably improve purely synthetic models, and 200 human points outperform an order of magnitude more synthetic data.

Conclusion: Even when large-scale human annotation is infeasible, including a small proportion (10%) of human-generated data provides significant value and cost-effectiveness compared to purely synthetic approaches.

Abstract: Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data (only 200 points) requires an order of magnitude more synthetic data and estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human generated.

[51] SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models

Jianyi Zhang, Da-Cheng Juan, Cyrus Rashtchian, Chun-Sung Ferng, Heinrich Jiang, Yiran Chen

Main category: cs.CL

TL;DR: SLED is a novel decoding framework that improves LLM truthfulness by contrasting final and early layer logits, using latent knowledge for self-refinement without external resources or fine-tuning.

Details

Motivation: Address unreliable and factually incorrect outputs from large language models by enhancing truthfulness without requiring external knowledge bases or additional fine-tuning.

Method: Self Logits Evolution Decoding (SLED) framework that contrasts output logits from final layer with early layers, using approximate gradient approach to enable latent knowledge to guide self-refinement of outputs.

Result: Extensive experiments across diverse model families (Gemma, Qwen, Mixtral, gpt-oss) and scales (1B-45B) show SLED consistently improves factual accuracy while maintaining fluency and negligible latency overhead. Can be combined with other decoding methods.

Conclusion: SLED effectively enhances LLM truthfulness by leveraging internal latent knowledge through logit comparison and self-refinement, providing a flexible and efficient solution for improving factual accuracy without external dependencies.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but their outputs can sometimes be unreliable or factually incorrect. To address this, we introduce Self Logits Evolution Decoding (SLED), a novel decoding framework that enhances the truthfulness of LLMs without relying on external knowledge bases or requiring further fine-tuning. From an optimization perspective, our SLED framework leverages the latent knowledge embedded within the LLM by contrasting the output logits from the final layer with those from early layers. It then utilizes an approximate gradient approach to enable latent knowledge to guide the self-refinement of outputs, thereby effectively improving factual accuracy. Extensive experiments have been conducted on established benchmarks across a diverse range of model families (Gemma, Qwen, Mixtral, gpt-oss) and scales (from 1B to 45B), including more advanced architectural configurations such as the mixture of experts (MoE). Our evaluation spans a wide variety of tasks and the results demonstrate that SLED consistently improves factual accuracy compared to existing decoding methods while maintaining natural language fluency and negligible latency overhead. Furthermore, it can be flexibly combined with other decoding methods to further enhance their performance.

[52] Retrieval-Augmented Semantic Parsing: Improving Generalization with Lexical Knowledge

Xiao Zhang, Qianru Meng, Johan Bos

Main category: cs.CL

TL;DR: RASP integrates external symbolic knowledge with LLMs for semantic parsing, nearly doubling performance on unseen concepts compared to previous models.

Details

Motivation: Open-domain semantic parsing is challenging as neural models rely on heuristics and struggle with unseen concepts, prompting investigation into LLMs' potential.

Method: Retrieval-Augmented Semantic Parsing (RASP) - a simple approach that integrates external symbolic knowledge into the parsing process using large language models.

Result: LLMs outperform previous encoder-decoder baselines, and RASP further enhances their ability to predict unseen concepts, nearly doubling performance on out-of-distribution concepts.

Conclusion: The findings highlight the promise of leveraging large language models with retrieval mechanisms for robust and open-domain semantic parsing.

Abstract: Open-domain semantic parsing remains a challenging task, as neural models often rely on heuristics and struggle to handle unseen concepts. In this paper, we investigate the potential of large language models (LLMs) for this task and introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective approach that integrates external symbolic knowledge into the parsing process. Our experiments not only show that LLMs outperform previous encoder-decoder baselines for semantic parsing, but that RASP further enhances their ability to predict unseen concepts, nearly doubling the performance of previous models on out-of-distribution concepts. These findings highlight the promise of leveraging large language models and retrieval mechanisms for robust and open-domain semantic parsing.

[53] Task-Oriented Automatic Fact-Checking with Frame-Semantics

Jacob Devasier, Rishabh Mediratta, Akshith Putta, Chengkai Li

Main category: cs.CL

TL;DR: Novel fact-checking approach using frame semantics for structured claim understanding, with case studies on voting claims and OECD data showing improved evidence retrieval and explainability.

Details

Motivation: To enhance automatic fact-checking by leveraging frame semantics for better structured understanding of claims and guided evidence retrieval.

Method: Proposed frame semantics paradigm, created annotated dataset from PolitiFact claims, conducted two case studies using Vote semantic frame and OECD-based semantic frames.

Result: Demonstrated effectiveness of frame semantics in improving evidence retrieval and explainability for fact-checking, identified high-impact frames through survey.

Conclusion: Frame semantics provides an effective approach for structured fact-checking, with identified high-impact frames guiding future research directions.

Abstract: We propose a novel paradigm for automatic fact-checking that leverages frame semantics to enhance the structured understanding of claims and guide the process of fact-checking them. To support this, we introduce a pilot dataset of real-world claims extracted from PolitiFact, specifically annotated for large-scale structured data. This dataset underpins two case studies: the first investigates voting-related claims using the Vote semantic frame, while the second explores various semantic frames based on data sources from the Organisation for Economic Co-operation and Development (OECD). Our findings demonstrate the effectiveness of frame semantics in improving evidence retrieval and explainability for fact-checking. Finally, we conducted a survey of frames evoked in fact-checked claims, identifying high-impact frames to guide future work in this direction.

[54] Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions

Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle

Main category: cs.CL

TL;DR: This paper argues that natural language generation from image sequences is a broader multimodal problem requiring models to understand visual-temporal relationships and language features, surveying five tasks and identifying research challenges.

Details

Motivation: To address the lack of attention on modality interactions in visually grounded NLP and establish that language generation from image sequences represents a general multimodal problem requiring sophisticated relationship modeling.

Method: The authors analyze five different tasks as instances of the broader multimodal problem, survey recent modeling and evaluation approaches, and examine common challenges across these tasks.

Result: The paper identifies that current approaches face challenges in managing intricate relationships between visual events over time and corresponding language features, revealing gaps in current multimodal modeling capabilities.

Conclusion: The study proposes several research directions for future investigation to address key open questions in modeling visual-temporal relationships and language generation from multimodal sequences.

Abstract: In recent years, a substantial body of work in visually grounded natural language processing has focused on real-life multimodal scenarios such as describing content depicted in images or videos. However, comparatively less attention has been devoted to study the nature and degree of interaction between the different modalities in these scenarios. In this paper, we argue that any task dealing with natural language generation from sequences of images or frames is an instance of the broader, more general problem of modeling the intricate relationships between visual events unfolding over time and the features of the language used to interpret, describe, or narrate them. Therefore, solving these tasks requires models to be capable of identifying and managing such intricacies. We consider five seemingly different tasks, which we argue are compelling instances of this broader multimodal problem. Subsequently, we survey the modeling and evaluation approaches adopted for these tasks in recent years and examine the common set of challenges these tasks pose. Building on this perspective, we identify key open questions and propose several research directions for future investigation.

[55] Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization

Zhitao He, Zijun Liu, Peng Li, Yi R. Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

Main category: cs.CL

TL;DR: CollabUIAgents is a multi-agent reinforcement learning framework with credit re-assignment strategy using LLMs for process rewards, achieving improved performance and cross-environment generalization.

Details

Motivation: Current multi-agent systems excel in performance but struggle with generalization across environments due to predefined roles and inadequate generalization strategies, hindering progress in interactive environments.

Method: Proposes a multi-agent reinforcement learning framework with novel credit re-assignment strategy using LLMs to assign process rewards instead of environment-specific rewards, learning with synthesized preference data to foster generalizable collaborative behaviors among role-free agents.

Result: Framework improves both performance and cross-environment generalizability; 7B-parameter system achieves results on par with or exceeds strong closed-source models and the LLM guiding the credit re-assignment.

Conclusion: The approach effectively addresses generalization challenges in multi-agent systems through LLM-based credit re-assignment and provides insights for using granular rewards and accommodating trained LLMs in multi-agent settings.

Abstract: LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi-agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi-agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi-agent reinforcement learning framework with a novel multi-agent credit re-assignment (CR) strategy, assigning process rewards with LLMs rather than environment-specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role-free agents’ policies. Empirical results show that our framework improves both performance and cross-environment generalizability of multi-agent systems. Moreover, our 7B-parameter system achieves results on par with or exceed strong closed-source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi-agent systems.

[56] JudgeLRM: Large Reasoning Models as a Judge

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He

Main category: cs.CL

TL;DR: JudgeLRM, a reinforcement learning-based LLM judge system, outperforms SFT-tuned and reasoning models, with smaller models surpassing GPT-4 and larger ones beating state-of-the-art models in complex reasoning evaluation tasks.

Details

Motivation: Existing Supervised Fine-Tuning approaches for LLM judges often fail in domains requiring complex reasoning, creating a need for better evaluation methods that can handle reasoning-intensive tasks.

Method: Introduces JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning with judge-wise, outcome-driven rewards instead of traditional SFT approaches.

Result: JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in reasoning-demanding judge tasks.

Conclusion: Reinforcement learning with outcome-driven rewards is more effective than SFT for developing LLM judges, especially for tasks requiring complex reasoning, demonstrating the superiority of RL-based approaches over traditional fine-tuning methods.

Abstract: The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.

[57] Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement

Zhihan Zhang, Yixin Cao, Lizi Liao

Main category: cs.CL

TL;DR: A dual preference-guided refinement framework for chart-to-code generation that combines visual and code rewards with iterative preference learning to produce high-quality plotting scripts from chart images.

Details

Motivation: Chart-to-code generation is under-constrained with multiple valid implementations for the same visual output, making standard supervised fine-tuning insufficient for learning accurate mappings that consider both code correctness and visual fidelity.

Method: Proposes a dual preference-guided refinement framework with feedback-driven dual-modality reward mechanism, structured variant generation strategy, visual reward model, and offline reinforcement learning to optimize multi-dimensional fidelity.

Result: Significantly enhances performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and some proprietary systems.

Conclusion: The framework effectively addresses the challenges of chart-to-code generation by providing scalable preference collection and targeted supervision through dual-modality rewards and iterative preference learning.

Abstract: Translating chart images into executable plotting scripts-referred to as the chart-to-code generation task-requires Multimodal Large Language Models (MLLMs) to perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning. However, this task is inherently under-constrained: multiple valid code implementations can produce the same visual chart, and evaluation must consider both code correctness and visual fidelity across diverse dimensions. This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. Our approach introduces a structured variant generation strategy and a visual reward model to efficiently produce high-quality, aspect-aware preference pairs-making preference collection scalable and supervision more targeted. These preferences are used in an offline reinforcement learning setup to optimize the model toward multi-dimensional fidelity. Experimental results show that our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and even some proprietary systems. The code and datasets are publicly available at https://github.com/Zhihan72/Chart2Code.

[58] Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

Anindya Bijoy Das, Shibbir Ahmed, Shahnewaz Karim Sakib

Main category: cs.CL

TL;DR: Open-source LLMs show promise in clinical summarization but struggle with consistency in identifying follow-up recommendations and suffer from hallucination issues.

Details

Motivation: To evaluate the effectiveness of open-source large language models in extracting key clinical events from discharge reports and assess hallucination prevalence, as accurate summarization is crucial for patient care and treatment outcomes.

Method: Comprehensive simulations to evaluate performance of open-source LLMs in extracting admission reasons, major in-hospital events, and follow-up actions from discharge reports, while also assessing various types of hallucinations in the generated summaries.

Result: LLMs (e.g., Qwen2.5 and DeepSeek-v2) perform well in capturing admission reasons and hospitalization events but are less consistent in identifying follow-up recommendations, with notable hallucination issues affecting reliability.

Conclusion: While LLMs show significant potential for clinical summarization, challenges remain in comprehensive information extraction and hallucination control, particularly for follow-up recommendations, highlighting the need for further improvement in reliability.

Abstract: Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, including admission reasons, major in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization. Our results reveal that while the LLMs (e.g., Qwen2.5 and DeepSeek-v2) perform quite well in capturing admission reasons and hospitalization events, they are generally less consistent when it comes to identifying follow-up recommendations, highlighting broader challenges in leveraging LLMs for comprehensive summarization.

[59] Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Dylan Bouchard, Mohit Singh Chauhan

Main category: cs.CL

TL;DR: A framework for zero-resource hallucination detection using uncertainty quantification techniques transformed into confidence scores, with a tunable ensemble approach that outperforms individual methods.

Details

Motivation: Hallucinations in LLMs pose serious risks in high-stakes domains like healthcare and finance, requiring effective detection methods without additional resources.

Method: Adapts various uncertainty quantification techniques (black-box UQ, white-box UQ, LLM-as-a-Judge) into standardized confidence scores, proposes tunable ensemble approach, and provides Python toolkit UQLM.

Result: Tunable ensemble typically surpasses individual components and outperforms existing hallucination detection methods across multiple LLM question-answering benchmarks.

Conclusion: Customized hallucination detection strategies improve LLM accuracy and reliability, with the framework offering practical implementation for real-world use cases.

Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper’s companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

[60] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Main category: cs.CL

TL;DR: Critique-GRPO is a reinforcement learning framework that combines numerical rewards with natural language critiques to help LLMs overcome performance plateaus and persistent failures, achieving significant improvements in reasoning tasks.

Details

Motivation: Current RL methods with only numerical feedback face performance plateaus, limited self-reflection effectiveness, and persistent failures in complex reasoning tasks.

Method: Proposes Critique-GRPO framework that integrates natural language critiques with numerical feedback for online policy optimization, using a shaping function to amplify learning from correct refinements and penalize incorrect ones.

Result: Outperforms supervised learning and RL-based methods across 8 challenging tasks, with +4.4% average pass@1 improvement on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Achieved +16.7% improvement on AIME 2024 over GRPO.

Conclusion: Combining natural language critiques with numerical feedback enables effective self-improvement through self-critiquing, overcoming limitations of traditional RL approaches.

Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks. Specifically, Critique-GRPO improves average pass@1 scores across all compared methods by approximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably, Critique-GRPO enables effective self-improvement through self-critiquing, achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME 2024.

[61] Enhancing Temporal Sensitivity of Large Language Model for Recommendation with Counterfactual Tuning

Yutian Liu, Zhengyi Yang, Jiancan Wu, Xiang Wang

Main category: cs.CL

TL;DR: CETRec is a novel framework that enhances LLM-based sequential recommendation by addressing temporal information limitations through causal inference and counterfactual tuning, improving both absolute and relative order awareness.

Details

Motivation: Existing LLM-based sequential recommendation methods fail to adequately leverage temporal information in user interaction sequences due to architectural constraints of self-attention mechanisms and position embeddings designed for natural language rather than user behavior sequences.

Method: Proposes CETRec framework based on causal inference principles to isolate and measure temporal impact, combined with counterfactual tuning task to enhance LLMs’ awareness of absolute order (recency) and relative order (sequential relationships) in user interactions.

Result: Extensive experiments on real-world datasets demonstrate the effectiveness of CETRec in improving temporal awareness and recommendation accuracy.

Conclusion: CETRec successfully addresses the temporal information limitation in LLM-based sequential recommendation through causal inference and counterfactual enhancement, providing better modeling of user preference evolution over time.

Abstract: Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users’ historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose \underline{C}ounterfactual \underline{E}nhanced \underline{T}emporal Framework for LLM-Based \underline{Rec}ommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs’ awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items). Extensive experiments on real-world datasets demonstrate the effectiveness of our CETRec. Our code is available at https://anonymous.4open.science/r/CETRec-B9CE/.

[62] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

Michal Podstawski

Main category: cs.CL

TL;DR: Using pretrained text embeddings to enhance semantic analysis in labeled property graphs without structural changes, improving node classification and relation prediction accuracy.

Details

Motivation: Labeled property graphs contain rich textual attributes that can enhance analytical tasks, but these are often underutilized. The paper aims to leverage pretrained text embedding models to enable efficient semantic analysis in such graphs.

Method: Embed textual node and edge properties using pretrained language models and integrate these embeddings into the graph pipeline without altering the graph structure. This approach supports downstream tasks like node classification and relation prediction.

Result: The approach demonstrates that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis, showing improved performance in node classification and relation prediction tasks.

Conclusion: Integrating pretrained text embeddings into property graph analysis pipelines is an effective method for leveraging textual attributes, leading to enhanced contextual understanding and improved analytical performance without requiring structural modifications to the graph.

Abstract: Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.

[63] Each to Their Own: Exploring the Optimal Embedding in RAG

Shiting Chen, Zijian Zhao, Jinsong Chen

Main category: cs.CL

TL;DR: Two RAG enhancement methods proposed: Mixture-Embedding RAG (combines multiple embedding models) and Confident RAG (selects highest confidence responses from multiple embedding model runs). Confident RAG shows 10% and 5% improvements over vanilla LLMs and RAG respectively.

Details

Motivation: Address the problem that different embedding models in RAG systems produce varying similarity results and response quality due to heterogeneous training data and architectures, leading to inconsistent performance.

Method: Proposed two approaches: 1) Mixture-Embedding RAG - sorts and selects retrievals from multiple embedding models using standardized similarity; 2) Confident RAG - generates responses multiple times using different embedding models and selects responses with highest confidence level.

Result: Mixture-Embedding RAG did not outperform vanilla RAG. Confident RAG demonstrated average improvements of ~10% over vanilla LLMs and ~5% over vanilla RAG, with consistent results across different LLMs and embedding models.

Conclusion: Confident RAG is an efficient plug-and-play approach that effectively combines benefits of multiple embedding models to enhance RAG performance across various domains, providing significant quality improvements over baseline methods.

Abstract: Recently, as Large Language Models (LLMs) have fundamentally impacted various fields, the methods for incorporating up-to-date information into LLMs or adding external knowledge to construct domain-specific models have garnered wide attention. Retrieval-Augmented Generation (RAG), serving as an inference-time scaling method, is notable for its low cost and minimal effort for parameter tuning. However, due to heterogeneous training data and model architecture, the variant embedding models used in RAG exhibit different benefits across various areas, often leading to different similarity calculation results and, consequently, varying response quality from LLMs. To address this problem, we propose and examine two approaches to enhance RAG by combining the benefits of multiple embedding models, named Mixture-Embedding RAG and Confident RAG. Mixture-Embedding RAG simply sorts and selects retrievals from multiple embedding models based on standardized similarity; however, it does not outperform vanilla RAG. In contrast, Confident RAG generates responses multiple times using different embedding models and then selects the responses with the highest confidence level, demonstrating average improvements of approximately 10% and 5% over vanilla LLMs and RAG, respectively. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play approach for various domains. We will release our code upon publication.

[64] Is neural semantic parsing good at ellipsis resolution, or isn’t it?

Xiao Zhang, Johan bos

Main category: cs.CL

TL;DR: Neural semantic parsers perform poorly on verb phrase ellipsis despite high overall performance, but data augmentation helps improve results.

Details

Motivation: To evaluate how well neural semantic parsers handle strongly context-sensitive phenomena like verb phrase ellipsis, where semantic information needs duplication.

Method: Constructed a corpus of 120 ellipsis cases with fully resolved meaning representations and tested various neural semantic parsers on this challenge set.

Result: Parsers performed well on standard test sets but failed on ellipsis instances. Data augmentation improved parsing results.

Conclusion: The difficulty with ellipsis parsing stems from linguistically complex contexts rather than the copying mechanism itself.

Abstract: Neural semantic parsers have shown good overall performance for a variety of linguistic phenomena, reaching semantic matching scores of more than 90%. But how do such parsers perform on strongly context-sensitive phenomena, where large pieces of semantic information need to be duplicated to form a meaningful semantic representation? A case in point is English verb phrase ellipsis, a construct where entire verb phrases can be abbreviated by a single auxiliary verb. Are the otherwise known as powerful semantic parsers able to deal with ellipsis or aren’t they? We constructed a corpus of 120 cases of ellipsis with their fully resolved meaning representation and used this as a challenge set for a large battery of neural semantic parsers. Although these parsers performed very well on the standard test set, they failed in the instances with ellipsis. Data augmentation helped improve the parsing results. The reason for the difficulty of parsing elided phrases is not that copying semantic material is hard, but that usually occur in linguistically complicated contexts causing most of the parsing errors.

[65] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

Kartikeya Badola, Jonathan Simon, Arian Hosseini, Sara Marie Mc Carthy, Tsendsuren Munkhdalai, Abhimanyu Goyal, Tomáš Kočiský, Shyam Upadhyay, Bahare Fatemi, Mehran Kazemi

Main category: cs.CL

TL;DR: A new benchmark for evaluating LLMs’ multi-turn dialogue, reasoning, and information-seeking abilities in interactive scenarios with deterministic scoring.

Details

Motivation: LLMs struggle with nuanced environments and interactive tasks common in real-world scenarios, highlighting the need for models that can engage in logically consistent multi-turn dialogue and reason with incomplete data.

Method: Introduce a novel benchmark comprising a suite of multi-turn tasks designed to test specific reasoning, interactive dialogue, and information-seeking abilities with deterministic scoring mechanisms.

Result: Evaluation of frontier models shows significant headroom for improvement, with most errors stemming from poor instruction following, reasoning failures, and poor planning.

Conclusion: The benchmark provides valuable insights into current LLMs’ strengths/weaknesses in complex interactive scenarios and offers a robust platform for future research to improve these critical capabilities.

Abstract: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.

[66] Investigating Transcription Normalization in the Faetar ASR Benchmark

Leo Peckham, Michael Ong, Naomi Nagy, Ewan Dunbar

Main category: cs.CL

TL;DR: Analysis of transcription inconsistencies in Faetar ASR benchmark shows they are not the main challenge, with lexicon-constrained decoding being beneficial but task remains very difficult.

Details

Motivation: To examine the role of transcription inconsistencies in the challenging low-resource Faetar Automatic Speech Recognition benchmark and understand their impact on ASR performance.

Method: Used a small hand-constructed lexicon to analyze transcription inconsistencies, tested bigram word-based language modeling, and experimented with lexicon-constrained decoding approaches.

Result: Found that transcription inconsistencies exist but are not the main challenge, bigram word-based language modeling provides no benefit, but lexicon-constrained decoding can be beneficial. The task remains extremely difficult.

Conclusion: While transcription inconsistencies are present in the Faetar ASR benchmark, they are not the primary obstacle. The task’s extreme difficulty persists despite some benefits from lexicon-constrained decoding, indicating fundamental challenges in low-resource ASR.

Abstract: We examine the role of transcription inconsistencies in the Faetar Automatic Speech Recognition benchmark, a challenging low-resource ASR benchmark. With the help of a small, hand-constructed lexicon, we conclude that find that, while inconsistencies do exist in the transcriptions, they are not the main challenge in the task. We also demonstrate that bigram word-based language modelling is of no added benefit, but that constraining decoding to a finite lexicon can be beneficial. The task remains extremely difficult.

[67] STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

Haiquan Hu, Jiazhi Jiang, Shiyou Xu, Ruhan Zeng, Tian Wang

Main category: cs.CL

TL;DR: STEM is a lightweight evaluation framework that uses significant transition samples from models of the same architecture but different scales to efficiently estimate LLM capabilities, addressing benchmark overfitting and high evaluation costs.

Details

Motivation: Standard benchmarks are becoming less effective due to model overfitting and high computational costs, making it difficult to distinguish meaningful differences between rapidly advancing LLMs.

Method: STEM identifies significant transition samples (STS) by analyzing performance transitions among LLMs of the same architecture but varying parameter scales, then uses these samples to estimate unknown model capabilities.

Result: STEM reliably captures performance trends and aligns with ground-truth rankings of model capability across six diverse benchmarks using the Qwen3 model family.

Conclusion: STEM provides a practical and scalable method for fine-grained, architecture-agnostic evaluation of LLMs, offering an efficient alternative to full benchmark evaluations.

Abstract: Evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly. While recent models often achieve higher scores on standard benchmarks, these improvements do not consistently reflect enhanced real-world reasoning capabilities. Moreover, widespread overfitting to public benchmarks and the high computational cost of full evaluations have made it both expensive and less effective to distinguish meaningful differences between models. To address these challenges, we propose the \textbf{S}tructured \textbf{T}ransition \textbf{E}valuation \textbf{M}ethod (STEM), a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs. STEM identifies \textit{significant transition samples} (STS) by analyzing consistent performance transitions among LLMs of the same architecture but varying parameter scales. These samples enable STEM to effectively estimate the capability position of an unknown model. Qwen3 model family is applied to construct the STS pool on six diverse and representative benchmarks. To assess generalizability. Experimental results indicate that STEM reliably captures performance trends, aligns with ground-truth rankings of model capability. These findings highlight STEM as a practical and scalable method for fine-grained, architecture-agnostic evaluation of LLMs.

[68] CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Shaoming Duan, Zirui Wang, Chuanyi Liu, Zhibin Zhu, Yuhao Zhang, Peiyi Han, Liang Yan, Zewu Peng

Main category: cs.CL

TL;DR: CRED-SQL is a framework that addresses semantic mismatch in Text-to-SQL systems for large databases using cluster-based schema retrieval and an intermediate Execution Description Language to improve accuracy.

Details

Motivation: Semantic mismatch between natural language questions and SQL queries in large databases causes schema linking issues and semantic drift, reducing model accuracy.

Method: Uses cluster-based large-scale schema retrieval to identify relevant tables/columns, then introduces Execution Description Language (EDL) as intermediate representation to decompose task into Text-to-EDL and EDL-to-SQL stages.

Result: Achieves state-of-the-art performance on SpiderUnion and BirdUnion benchmarks, demonstrating effectiveness and scalability.

Conclusion: CRED-SQL successfully addresses semantic mismatch challenges in large-scale databases through innovative schema retrieval and intermediate language representation, setting new SOTA performance.

Abstract: Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs’ strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git

[69] Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen

Main category: cs.CL

TL;DR: SvS strategy maintains policy entropy in RLVR training by using self-play with variational problem synthesis, improving Pass@k performance significantly across multiple reasoning benchmarks.

Details

Motivation: Vanilla RLVR training improves Pass@1 but reduces policy entropy and generation diversity, limiting Pass@k performance which represents the upper bound of LLM reasoning capability.

Method: Propose online Self-play with Variational problem Synthesis (SvS) strategy that uses policy’s correct solutions to synthesize variational problems while keeping reference answers identical to originals.

Result: Achieves absolute gains of 18.3% and 22.8% in Pass@32 on AIME24 and AIME25 benchmarks, with consistent improvements across 12 reasoning benchmarks from 3B to 32B model sizes.

Conclusion: SvS effectively maintains policy entropy during RLVR training, sustains prolonged improvements, and demonstrates generalizability and robustness across various model sizes and benchmarks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy’s generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy’s correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

cs.CV

[70] A comparative study of some wavelet and sampling operators on various features of an image

Digvijay Singh, Rahul Shukla, Karunesh Kumar Singh

Main category: cs.CV

TL;DR: Analysis of positive sampling Kantorovich operators and their convergence properties, comparing SK, Gaussian, Bilateral, and wavelet-based operators for image approximation with various error metrics.

Details

Motivation: To study the approximation properties of different sampling Kantorovich operators and evaluate their performance in image processing applications under various conditions.

Method: Introduced basic terminology and fundamental approximation theorem, then analyzed multiple operators (SK, Gaussian, Bilateral, wavelet-based) using error metrics like MSE, SI, SSI, SMPI, and ENL at different resolution levels. Conducted numerical examples including 2D Shepp-Logan Phantom analysis.

Result: Different operators showed varying performance for different image features due to the uneven nature of images. Some operators worked well for specific features while others did not, demonstrating that no single operator is universally optimal.

Conclusion: Various sampling Kantorovich operators have their own significance depending on the specific image features being analyzed, with performance varying under non-ideal conditions, justifying the need for selective operator application based on image characteristics.

Abstract: This research includes the study of some positive sampling Kantorovich operators (SK operators) and their convergence properties. A comprehensive analysis of both local and global approximation properties is presented using sampling Kantorovich (SK), Gaussian, Bilateral and the thresholding wavelet-based operators in the framework of SK-operators. Explicitly, we start the article by introducing the basic terminology and state the fundamental theorem of approximation (FTA) by imposing the various required conditions corresponding to the various defined operators. We measure the error and study the other mathematical parameters such as the mean square error (MSE), the speckle index (SI), the speckle suppression index (SSI), the speckle mean preservation index (SMPI), and the equivalent number of looks (ENL) at various levels of resolution parameters. The nature of these operators are demonstrated via an example under ideal conditions in tabulated form at a certain level of samples. Eventually, another numerical example is illustrated to discuss the region of interest (ROI) via SI, SSI and SMPI of 2D Shepp-Logan Phantom taken slice from the 3D image, which gives the justification of the fundamental theorem of approximation (FTA). At the end of the derivation and illustrations we observe that the various operators have their own significance while studying the various features of the image because of the uneven nature of an image (non-ideal condition). Therefore, to some extent, some operators work well and some do not for some specific features of the image.

[71] Federated Action Recognition for Smart Worker Assistance Using FastPose

Vinit Hegiste, Vidit Goyal, Tatjana Legler, Martin Ruskowski

Main category: cs.CV

TL;DR: Federated learning framework for pose-based human activity recognition in industrial settings, showing significant improvements in accuracy and cross-user generalization while preserving privacy.

Details

Motivation: Need for accurate real-time worker action recognition in smart manufacturing while addressing privacy concerns that make centralized data collection impractical.

Method: Developed FL framework using custom skeletal dataset of 8 industrial upper-body gestures, processed with modified FastPose model. Evaluated LSTM and Transformer backbones under four paradigms: centralized, local, FL with FedAvg, and federated ensemble learning.

Result: FL Transformer improved over centralized training by +12.4pp, FedEnsemble by +16.3pp. On unseen external client, FL and FedEnsemble exceeded centralized accuracy by +52.6pp and +58.3pp respectively.

Conclusion: FL not only preserves privacy but significantly enhances cross-user generalization, making it a practical solution for scalable, privacy-aware HAR in heterogeneous industrial environments.

Abstract: In smart manufacturing environments, accurate and real-time recognition of worker actions is essential for productivity, safety, and human-machine collaboration. While skeleton-based human activity recognition (HAR) offers robustness to lighting, viewpoint, and background variations, most existing approaches rely on centralized datasets, which are impractical in privacy-sensitive industrial scenarios. This paper presents a federated learning (FL) framework for pose-based HAR using a custom skeletal dataset of eight industrially relevant upper-body gestures, captured from five participants and processed using a modified FastPose model. Two temporal backbones, an LSTM and a Transformer encoder, are trained and evaluated under four paradigms: centralized, local (per-client), FL with weighted federated averaging (FedAvg), and federated ensemble learning (FedEnsemble). On the global test set, the FL Transformer improves over centralized training by +12.4 percentage points, with FedEnsemble delivering a +16.3 percentage points gain. On an unseen external client, FL and FedEnsemble exceed centralized accuracy by +52.6 and +58.3 percentage points, respectively. These results demonstrate that FL not only preserves privacy but also substantially enhances cross-user generalization, establishing it as a practical solution for scalable, privacy-aware HAR in heterogeneous industrial settings.

[72] LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: LENS is a reinforcement learning framework that integrates chain-of-thought reasoning with image segmentation, achieving state-of-the-art performance on text-prompted segmentation benchmarks.

Details

Motivation: Existing supervised fine-tuning methods ignore explicit chain-of-thought reasoning at test time, limiting generalization to unseen prompts and domains.

Method: Scalable RL framework that jointly optimizes reasoning process and segmentation with unified rewards spanning sentence-, box-, and segment-level cues.

Result: Achieves 81.2% average cIoU on RefCOCO benchmarks, outperforming GLaMM by up to 5.6% using Qwen2.5-VL-3B-Instruct model.

Conclusion: RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and enables more generalizable Segment Anything models.

Abstract: Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.

[73] RynnEC: Bringing MLLMs into Embodied World

Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, Deli Zhao

Main category: cs.CV

TL;DR: RynnEC is a compact video multimodal language model for embodied cognition that achieves SOTA performance in object understanding, segmentation, and spatial reasoning through region-level video interaction.

Details

Motivation: To develop a general-purpose cognitive core for embodied agents that provides fine-grained perception of the physical world and enables precise interactions, addressing the scarcity of annotated 3D datasets.

Method: Built on a vision-language foundation model with region encoder and mask decoder for flexible region-level video interaction. Uses egocentric video pipeline to generate embodied cognition data and introduces RynnEC-Bench benchmark.

Result: Achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning despite compact architecture.

Conclusion: RynnEC advances embodied agent development with region-centric video paradigm and facilitates generalization across diverse embodied tasks, with code and benchmarks publicly available.

Abstract: We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC

[74] Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

Md Ashiqur Rahman, Chiao-An Yang, Michael N. Cheng, Lim Jun Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh

Main category: cs.CV

TL;DR: A deep equilibrium canonicalizer (DEC) method improves local scale equivariance in vision models to handle object scale variations, enhancing performance and scale consistency across multiple architectures like ViT, DeiT, Swin, and BEiT on ImageNet.

Details

Motivation: Scale variation is a fundamental challenge in computer vision where objects of the same class can have different sizes and perceived sizes affected by camera distance. These local scale variations require models to be scale-equivariant for better performance.

Method: Proposes a deep equilibrium canonicalizer (DEC) that can be easily incorporated into existing network architectures and adapted to pre-trained models to improve local scale equivariance.

Result: DEC improves both model performance and local scale consistency across four popular pre-trained deep networks (ViT, DeiT, Swin, and BEiT) on the competitive ImageNet benchmark.

Conclusion: The DEC method effectively addresses scale variation challenges in computer vision by enhancing local scale equivariance, demonstrating improved performance and consistency across multiple state-of-the-art vision architectures.

Abstract: Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at https://github.com/ashiq24/local-scale-equivariance.

[75] CLIPSym: Delving into Symmetry Detection with CLIP

Tinghan Yang, Md Ashiqur Rahman, Raymond A. Yeh

Main category: cs.CV

TL;DR: CLIPSym leverages CLIP’s vision-language capabilities with a novel rotation-equivariant decoder and semantic-aware prompting to detect rotation and reflection symmetries, outperforming state-of-the-art methods on standard datasets.

Details

Motivation: Symmetry is a fundamental geometric cue in computer vision, and recent advances in vision-language models like CLIP offer potential to improve symmetry detection by leveraging semantic cues from natural image descriptions.

Method: Proposes CLIPSym with CLIP’s image/language encoders, rotation-equivariant decoder using Transformer and G-Convolution, and Semantic-Aware Prompt Grouping (SAPG) to aggregate object-based prompts for better semantic integration.

Result: Outperforms current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS), with ablations confirming benefits of CLIP pre-training, equivariant decoder, and SAPG technique.

Conclusion: CLIPSym successfully demonstrates that pre-trained vision-language models can significantly enhance symmetry detection by effectively combining geometric and semantic cues through novel architectural components and prompting strategies.

Abstract: Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP’s image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP’s language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP’s pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.

[76] Identity Preserving 3D Head Stylization with Multiview Score Distillation

Bahri Batuhan Bilecen, Ahmet Berke Gokmen, Furkan Guzelant, Aysegul Dundar

Main category: cs.CV

TL;DR: A novel 3D head stylization framework using PanoHead with negative log-likelihood distillation to improve identity preservation and stylization quality from 360-degree perspectives.

Details

Motivation: Current 3D stylization methods mainly provide near-frontal views and fail to preserve subject identities, resulting in outputs lacking diversity and individuality.

Method: Leverages PanoHead for 360-degree synthesis, employs negative log-likelihood distillation, integrates multi-view grid score and mirror gradients in 3D GAN architecture, and introduces score rank weighing technique.

Result: Achieves substantial qualitative and quantitative improvements in identity preservation and stylization quality.

Conclusion: Advances 3D head stylization state and provides insights into effective distillation processes between diffusion models and GANs, focusing on identity preservation.

Abstract: 3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the https://three-bee.github.io/head_stylization for more visuals.

[77] A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment

Ghazal Alinezhad Noghre, Armin Danesh Pazho, Hamed Tabkhi

Main category: cs.CV

TL;DR: This survey provides a comprehensive overview of Video Anomaly Detection (VAD), systematically organizing literature across supervision levels, adaptive learning methods, and three major application categories to advance theoretical understanding and real-world applicability.

Details

Motivation: Video Anomaly Detection is a pivotal computer vision task with broad relevance, but the field remains fragmented across domains and learning paradigms, requiring a comprehensive perspective to consolidate insights and identify fundamental contributions and limitations.

Method: The survey systematically organizes VAD literature across various supervision levels, adaptive learning methods (online, active, continual learning), and three major application categories: human-centric, vehicle-centric, and environment-centric scenarios.

Result: The survey provides a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems, identifying fundamental contributions and limitations of current methodologies.

Conclusion: This comprehensive survey serves as a useful reference for researchers while drawing attention to broader open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.

Abstract: Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields. Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms. This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels, as well as adaptive learning methods such as online, active, and continual learning. We examine the state of VAD across three major application categories: human-centric, vehicle-centric, and environment-centric scenarios, each with distinct challenges and design considerations. In doing so, we identify fundamental contributions and limitations of current methodologies. By consolidating insights from subfields, we aim to provide the community with a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems. This survey aims to support researchers by providing a useful reference, while also drawing attention to the broader set of open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.

[78] Interpreting the linear structure of vision-language model embedding spaces

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham Kakade, Stephanie Gil

Main category: cs.CV

TL;DR: Sparse autoencoders reveal that vision-language models organize concepts in a sparse linear structure where modality-specific concepts collaborate through latent bridges to support cross-modal integration, with stable common concepts and variable rare concepts.

Details

Motivation: To understand how vision-language models organize language and images in joint embedding spaces, and investigate how meaning and modality are encoded in these multimodal models.

Method: Trained sparse autoencoders (SAEs) on embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, AIMv2) to approximate embeddings as sparse linear combinations of learned concepts, and introduced Bridge Score metric to quantify cross-modal integration.

Result: SAEs achieve better reconstruction with more sparsity than other linear methods; rare concepts vary across runs while common concepts are stable; concepts encode cross-modal semantics rather than just modality; single-modality concepts collaborate through latent bridges for cross-modal integration.

Conclusion: Vision-language models use a sparse linear structure where modality-shaped concepts are stitched together through latent bridges, providing new insights into how multimodal meaning is constructed in joint embedding spaces.

Abstract: Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or “concepts”. We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.

[79] Accelerating Image Classification with Graph Convolutional Neural Networks using Voronoi Diagrams

Mustafa Mohammadi Gharasuie, Luis Rueda

Main category: cs.CV

TL;DR: Novel image classification framework combining Graph Convolutional Networks with Voronoi diagrams, using graph-based image representation and Delaunay triangulations for improved efficiency and accuracy.

Details

Motivation: To leverage GCNs' capability to model relational data for image classification, addressing limitations of conventional CNNs by treating images as graphs with spatial relationships.

Method: Proposed normalized Voronoi Graph Convolution Network (NVGCN) that represents images as graphs where pixels/regions are vertices, simplified using Delaunay triangulations derived from Voronoi diagrams.

Result: Significant improvement in pre-processing time and classification accuracy on benchmark datasets, outperforming state-of-the-art models, especially for complex scenes and fine-grained categories.

Conclusion: Integration of GCNs with Voronoi diagrams shows strong potential for advancing image classification and opens new avenues for graph-based learning in computer vision and unstructured data domains.

Abstract: Recent advances in image classification have been significantly propelled by the integration of Graph Convolutional Networks (GCNs), offering a novel paradigm for handling complex data structures. This study introduces an innovative framework that employs GCNs in conjunction with Voronoi diagrams to peform image classification, leveraging their exceptional capability to model relational data. Unlike conventional convolutional neural networks, our approach utilizes a graph-based representation of images, where pixels or regions are treated as vertices of a graph, which are then simplified in the form of the corresponding Delaunay triangulations. Our model yields significant improvement in pre-processing time and classification accuracy on several benchmark datasets, surpassing existing state-of-the-art models, especially in scenarios that involve complex scenes and fine-grained categories. The experimental results, validated via cross-validation, underscore the potential of integrating GCNs with Voronoi diagrams in advancing image classification tasks. This research contributes to the field by introducing a novel approach to image classification, while opening new avenues for developing graph-based learning paradigms in other domains of computer vision and non-structured data. In particular, we have proposed a new version of the GCN in this paper, namely normalized Voronoi Graph Convolution Network (NVGCN), which is faster than the regular GCN.

[80] Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, Khoa Luu

Main category: cs.CV

TL;DR: A simple but efficient learning mechanism that improves multimodal alignment by solving shuffling problems through image and text order reconstruction tasks, achieving state-of-the-art performance.

Details

Motivation: Large multimodal models suffer from robustness and generalization limitations due to imperfect alignment between visual and textual features.

Method: Introduces two new tasks: reconstructing image order and text order during pre-training and fine-tuning, plus a directed-token approach and Image-to-Response Guided loss to improve visual understanding.

Result: Consistently achieves state-of-the-art performance on academic task-oriented and instruction-following LMM benchmarks.

Conclusion: The proposed approach effectively improves reasoning capability, visual understanding, and cross-modality alignment in large multimodal models.

Abstract: Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM’s pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.

[81] Effect of Data Augmentation on Conformal Prediction for Diabetic Retinopathy

Rizwan Ahamed, Annahita Amireskandari, Joel Palko, Carol Laxson, Binod Bhattarai, Prashnna Gyawali

Main category: cs.CV

TL;DR: Data augmentation strategies significantly impact conformal prediction performance for diabetic retinopathy grading, with Mixup and CutMix improving both accuracy and uncertainty reliability, while CLAHE can negatively affect model certainty.

Details

Motivation: Deep learning models for medical tasks like diabetic retinopathy grading need reliable uncertainty quantification for clinical deployment. Current models lack robust uncertainty measures, and the interaction between standard training practices like data augmentation and conformal prediction guarantees is not well understood.

Method: Systematic evaluation of five data augmentation strategies (no augmentation, geometric transforms, CLAHE, Mixup, CutMix) on two backbone architectures (ResNet-50 and CoaT transformer) using the DDR dataset. Analyzed conformal prediction metrics including empirical coverage, average prediction set size, and correct efficiency.

Result: Sample-mixing strategies (Mixup and CutMix) improved both predictive accuracy and produced more reliable and efficient uncertainty estimates. Conversely, CLAHE negatively impacted model certainty. The choice of augmentation strategy significantly affects conformal prediction performance.

Conclusion: Augmentation strategies must be co-designed with downstream uncertainty quantification to build trustworthy AI systems for medical imaging. Mixup and CutMix are particularly effective for both accuracy and reliable uncertainty estimation in diabetic retinopathy grading.

Abstract: The clinical deployment of deep learning models for high-stakes tasks such as diabetic retinopathy (DR) grading requires demonstrable reliability. While models achieve high accuracy, their clinical utility is limited by a lack of robust uncertainty quantification. Conformal prediction (CP) offers a distribution-free framework to generate prediction sets with statistical guarantees of coverage. However, the interaction between standard training practices like data augmentation and the validity of these guarantees is not well understood. In this study, we systematically investigate how different data augmentation strategies affect the performance of conformal predictors for DR grading. Using the DDR dataset, we evaluate two backbone architectures – ResNet-50 and a Co-Scale Conv-Attentional Transformer (CoaT) – trained under five augmentation regimes: no augmentation, standard geometric transforms, CLAHE, Mixup, and CutMix. We analyze the downstream effects on conformal metrics, including empirical coverage, average prediction set size, and correct efficiency. Our results demonstrate that sample-mixing strategies like Mixup and CutMix not only improve predictive accuracy but also yield more reliable and efficient uncertainty estimates. Conversely, methods like CLAHE can negatively impact model certainty. These findings highlight the need to co-design augmentation strategies with downstream uncertainty quantification in mind to build genuinely trustworthy AI systems for medical imaging.

[82] Tooth-Diffusion: Guided 3D CBCT Synthesis with Fine-Grained Tooth Conditioning

Said Djafar Said, Torkan Gholamalizadeh, Mostafa Mehdipour Ghazi

Main category: cs.CV

TL;DR: A novel conditional diffusion framework for 3D dental CBCT scan generation with precise tooth-level control using binary attributes for tooth presence and configuration.

Details

Motivation: Despite the importance of dental CBCT scans for diagnosis and treatment planning, generating anatomically realistic scans with fine-grained control remains challenging in medical image synthesis.

Method: Integrates wavelet-based denoising diffusion, FiLM conditioning, and masked loss functions to focus learning on relevant anatomical structures for 3D dental volume generation.

Result: Strong fidelity and generalization with low FID scores, robust inpainting performance, and SSIM values above 0.91 even on unseen scans across tasks like tooth addition, removal, and full dentition synthesis.

Conclusion: Enables realistic, localized modification of dentition without rescanning, opening opportunities for surgical planning, patient communication, and targeted data augmentation in dental AI workflows.

Abstract: Despite the growing importance of dental CBCT scans for diagnosis and treatment planning, generating anatomically realistic scans with fine-grained control remains a challenge in medical image synthesis. In this work, we propose a novel conditional diffusion framework for 3D dental volume generation, guided by tooth-level binary attributes that allow precise control over tooth presence and configuration. Our approach integrates wavelet-based denoising diffusion, FiLM conditioning, and masked loss functions to focus learning on relevant anatomical structures. We evaluate the model across diverse tasks, such as tooth addition, removal, and full dentition synthesis, using both paired and distributional similarity metrics. Results show strong fidelity and generalization with low FID scores, robust inpainting performance, and SSIM values above 0.91 even on unseen scans. By enabling realistic, localized modification of dentition without rescanning, this work opens opportunities for surgical planning, patient communication, and targeted data augmentation in dental AI workflows. The codes are available at: https://github.com/djafar1/tooth-diffusion.

[83] MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, Ning Yu

Main category: cs.CV

TL;DR: MAViS is a multi-agent collaborative framework for long-sequence video storytelling that addresses limitations in assistive capability, visual quality, and expressiveness through specialized agents working across multiple stages under the 3E Principle.

Details

Motivation: Current long-sequence video generation frameworks suffer from poor assistive capability, suboptimal visual quality, and limited expressiveness, which MAViS aims to overcome.

Method: End-to-end multi-agent framework with specialized agents for script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation, operating under the 3E Principle (Explore, Examine, Enhance) with Script Writing Guidelines for generative model compatibility.

Result: MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness, producing high-quality, expressive long-sequence videos with narratives and background music from brief user prompts.

Conclusion: MAViS is the first framework to provide multimodal design output (videos with narratives and background music) and offers a scalable modular framework that enriches user creativity and inspiration.

Abstract: Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle – Explore, Examine, and Enhance – to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output – videos with narratives and background music.

[84] GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

Elena Alegret Regalado, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari

Main category: cs.CV

TL;DR: GALA is a novel framework for open-vocabulary 3D scene understanding using 3D Gaussian Splatting that learns language-aware 3D representations through self-supervised contrastive learning and cross-attention with learnable codebooks.

Details

Motivation: Existing 3D scene reconstruction methods struggle to capture fine-grained, language-aware 3D representations from 2D images, limiting their ability to support open-vocabulary queries and detailed scene understanding.

Method: GALA distills scene-specific 3D instance feature fields via self-supervised contrastive learning and introduces a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings, avoiding per-Gaussian high-dimensional feature learning.

Result: Extensive experiments on real-world datasets demonstrate GALA’s remarkable open-vocabulary performance on both 2D and 3D tasks, showing effective language-aware 3D representation learning.

Conclusion: GALA successfully addresses the challenge of learning fine-grained, language-aware 3D representations from 2D images, enabling effective open-vocabulary 3D scene understanding with reduced memory consumption.

Abstract: 3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA’s remarkable open-vocabulary performance on both 2D and 3D.

[85] Improving OCR using internal document redundancy

Diego Belzarena, Seginus Mowlavi, Aitor Artola, Camilo Mariño, Marina Gardella, Ignacio Ramírez, Antoine Tadros, Roy He, Natalia Bottaioli, Boshra Rajaei, Gregory Randall, Jean-Michel Morel

Main category: cs.CV

TL;DR: Unsupervised method using document character redundancy to improve OCR outputs through extended GMM with EM algorithm and statistical testing

Details

Motivation: Current OCR systems struggle with low-quality data and don't fully exploit document redundancy, especially for printed documents with high inter-domain variability

Method: Extended Gaussian Mixture Model with alternating Expectation-Maximization algorithm and intra-cluster realignment process, plus normality statistical testing

Result: Demonstrated improvements on various degraded documents including Uruguayan military archives and 17th-20th century European newspapers

Conclusion: Leveraging intra-document character shape redundancy through unsupervised clustering can effectively correct imperfect OCR outputs for degraded documents

Abstract: Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document’s redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.

[86] Multi-Rationale Explainable Object Recognition via Contrastive Conditional Inference

Ali Rasekh, Sepehr Kazemi Ranjbar, Simon Gottschalk

Main category: cs.CV

TL;DR: A new multi-rationale benchmark and contrastive conditional inference framework for explainable object recognition that improves both classification accuracy and rationale quality without training.

Details

Motivation: Existing explainable object recognition methods using CLIP suffer from weak conditioning on explanatory structures and limited single rationales that don't capture feature diversity.

Method: Proposes a contrastive conditional inference (CCI) framework that explicitly models probabilistic relationships among image embeddings, category labels, and rationales without requiring training.

Result: Achieves state-of-the-art results on the multi-rationale benchmark, including strong zero-shot performance, setting new standards for classification accuracy and rationale quality.

Conclusion: Provides a more complete framework for evaluating explainable object recognition models and establishes a new benchmark with multiple ground-truth rationales per image.

Abstract: Explainable object recognition using vision-language models such as CLIP involves predicting accurate category labels supported by rationales that justify the decision-making process. Existing methods typically rely on prompt-based conditioning, which suffers from limitations in CLIP’s text encoder and provides weak conditioning on explanatory structures. Additionally, prior datasets are often restricted to single, and frequently noisy, rationales that fail to capture the full diversity of discriminative image features. In this work, we introduce a multi-rationale explainable object recognition benchmark comprising datasets in which each image is annotated with multiple ground-truth rationales, along with evaluation metrics designed to offer a more comprehensive representation of the task. To overcome the limitations of previous approaches, we propose a contrastive conditional inference (CCI) framework that explicitly models the probabilistic relationships among image embeddings, category labels, and rationales. Without requiring any training, our framework enables more effective conditioning on rationales to predict accurate object categories. Our approach achieves state-of-the-art results on the multi-rationale explainable object recognition benchmark, including strong zero-shot performance, and sets a new standard for both classification accuracy and rationale quality. Together with the benchmark, this work provides a more complete framework for evaluating future models in explainable object recognition. The code will be made available online.

[87] A Comprehensive Review of Agricultural Parcel and Boundary Delineation from Remote Sensing Images: Recent Progress and Future Perspectives

Juepeng Zheng, Zi Ye, Yibin Wen, Jianxi Huang, Zhiwei Zhang, Qingmei Li, Qiong Hu, Baodong Xu, Lingyuan Zhao, Haohuan Fu

Main category: cs.CV

TL;DR: This paper provides a comprehensive review of Agricultural Parcel and Boundary Delineation (APBD) methods using remote sensing images, categorizing approaches into traditional image processing, traditional machine learning, and deep learning methods, with analysis of current trends and future prospects.

Details

Motivation: The increasing availability of high-resolution remote sensing images enables cost-efficient agricultural inventory and analysis, creating a need to systematically review and categorize the various APBD methods that have been developed.

Method: The authors conduct a comprehensive meta-data analysis of recent APBD papers, categorizing methods into three classes: traditional image processing (pixel-based, edge-based, region-based), traditional machine learning (random forest, decision tree), and deep learning-based methods (semantic segmentation, object detection, Transformer-based).

Result: The review identifies deep learning approaches as the dominant methodology in current APBD research, provides systematic categorization of methods, discusses key issues like multi-sensor data and learning approaches, and analyzes algorithm comparisons across different APBD tasks.

Conclusion: The review serves as a knowledge map for APBD development, highlights deep learning’s dominance, discusses current challenges, and proposes future applications and research directions to help researchers track the field’s development and trends.

Abstract: Powered by advances in multiple remote sensing sensors, the production of high spatial resolution images provides great potential to achieve cost-efficient and high-accuracy agricultural inventory and analysis in an automated way. Lots of studies that aim at providing an inventory of the level of each agricultural parcel have generated many methods for Agricultural Parcel and Boundary Delineation (APBD). This review covers APBD methods for detecting and delineating agricultural parcels and systematically reviews the past and present of APBD-related research applied to remote sensing images. With the goal to provide a clear knowledge map of existing APBD efforts, we conduct a comprehensive review of recent APBD papers to build a meta-data analysis, including the algorithm, the study site, the crop type, the sensor type, the evaluation method, etc. We categorize the methods into three classes: (1) traditional image processing methods (including pixel-based, edge-based and region-based); (2) traditional machine learning methods (such as random forest, decision tree); and (3) deep learning-based methods. With deep learning-oriented approaches contributing to a majority, we further discuss deep learning-based methods like semantic segmentation-based, object detection-based and Transformer-based methods. In addition, we discuss five APBD-related issues to further comprehend the APBD domain using remote sensing data, such as multi-sensor data in APBD task, comparisons between single-task learning and multi-task learning in the APBD domain, comparisons among different algorithms and different APBD tasks, etc. Finally, this review proposes some APBD-related applications and a few exciting prospects and potential hot topics in future APBD research. We hope this review help researchers who involved in APBD domain to keep track of its development and tendency.

[88] OccluNet: Spatio-Temporal Deep Learning for Occlusion Detection on DSA

Anushka A. Kore, Frank G. te Nijenhuis, Matthijs van der Sluijs, Wim van Zwam, Charles Majoie, Geert Lycklama à Nijeholt, Danny Ruijters, Frans Vos, Sandra Cornelissen, Ruisheng Su, Theo van Walsum

Main category: cs.CV

TL;DR: OccluNet is a spatio-temporal deep learning model that combines YOLOX object detection with transformer-based temporal attention to automatically detect vascular occlusions in DSA sequences, significantly outperforming baseline models.

Details

Motivation: Accurate detection of vascular occlusions during endovascular thrombectomy is critical for acute ischemic stroke treatment, but interpreting DSA sequences is challenging due to anatomical complexity and time constraints.

Method: Proposes OccluNet which integrates YOLOX (single-stage object detector) with transformer-based temporal attention mechanisms. Two variants were explored: pure temporal attention and divided space-time attention. Compared against YOLOv11 baseline trained on individual DSA frames or minimum intensity projections.

Result: Evaluation on DSA images from MR CLEAN Registry showed precision of 89.02% and recall of 74.87%. OccluNet significantly outperformed baseline models, with both attention variants achieving similar performance.

Conclusion: OccluNet demonstrates capability to capture temporally consistent features for automated occlusion detection in DSA sequences, providing an effective solution for acute ischemic stroke diagnosis.

Abstract: Accurate detection of vascular occlusions during endovascular thrombectomy (EVT) is critical in acute ischemic stroke (AIS). Interpretation of digital subtraction angiography (DSA) sequences poses challenges due to anatomical complexity and time constraints. This work proposes OccluNet, a spatio-temporal deep learning model that integrates YOLOX, a single-stage object detector, with transformer-based temporal attention mechanisms to automate occlusion detection in DSA sequences. We compared OccluNet with a YOLOv11 baseline trained on either individual DSA frames or minimum intensity projections. Two spatio-temporal variants were explored for OccluNet: pure temporal attention and divided space-time attention. Evaluation on DSA images from the MR CLEAN Registry revealed the model’s capability to capture temporally consistent features, achieving precision and recall of 89.02% and 74.87%, respectively. OccluNet significantly outperformed the baseline models, and both attention variants attained similar performance. Source code is available at https://github.com/anushka-kore/OccluNet.git

[89] Adversarial Hospital-Invariant Feature Learning for WSI Patch Classification

Mengliang Zhang, Jacob M. Luber

Main category: cs.CV

TL;DR: Study on domain bias in pathology foundation models from hospital-specific variations, proposing adversarial framework to remove hospital-specific features while maintaining disease classification performance.

Details

Motivation: Pathology foundation models risk learning hospital-specific features due to hardware/preprocessing differences across hospitals, posing deployment risks.

Method: Lightweight adversarial framework with trainable adapter and domain classifier using gradient reversal layer to remove latent hospital-specific features from frozen representations.

Result: Substantially reduces domain predictability while maintaining/improving disease classification, especially in out-of-domain scenarios; confirmed by hospital detection and feature visualization.

Conclusion: Proposed method effectively mitigates hospital bias in pathology foundation models without modifying encoder, enhancing generalizability for clinical deployment.

Abstract: Pathology foundation models (PFMs) have demonstrated remarkable potential in whole-slide image (WSI) diagnosis. However, pathology images from different hospitals often vary due to differences in scanning hardware and preprocessing styles, which may lead PFMs to inadvertently learn hospital-specific features, posing risks for clinical deployment. In this work, we present the first systematic study of domain bias in PFMs arising from hospital source characteristics. Specifically, we (1) construct a pipeline for quantifying domain bias in PFMs, (2) evaluate and compare the performance of multiple models, and (3) propose a lightweight adversarial framework that removes latent hospital-specific features from frozen representations without modifying the encoder itself. By introducing a trainable adapter and a domain classifier connected through a gradient reversal layer (GRL), our method learns task-discriminative yet domain-invariant representations. Experiments on multi-center histopathology datasets demonstrate that our approach substantially reduces domain predictability while maintaining or even improving disease classification performance, particularly in out-of-domain (unseen hospital) scenarios. Further analyses, including hospital detection and feature space visualization, confirm the effectiveness of our method in mitigating hospital bias. We will provide our code based on acceptance.

[90] Pixels to Play: A Foundation Model for 3D Gameplay

Yuguang Yue, Chris Green, Samuel Hunt, Irakli Salia, Wenzhe Shi, Jonathan J Hunt

Main category: cs.CV

TL;DR: Pixels2Play-0.1 is a foundation model that learns to play 3D video games from pixel input using behavior cloning with both labeled human demonstrations and unlabeled public videos, achieving competent gameplay across various titles.

Details

Motivation: To create AI agents that can serve as teammates, controllable NPCs, personalized live-streamers, and assistive testers by learning from the same pixel stream available to human players and generalizing to new games with minimal engineering.

Method: End-to-end behavior cloning using labeled human gameplay demonstrations and unlabeled public videos with action imputation via inverse-dynamics model. Uses decoder-only transformer with auto-regressive action output for large action space handling on consumer GPUs.

Result: Competent play across simple Roblox and classic MS-DOS titles, with ablations showing benefits of unlabeled data. Model demonstrates human-like behavior from pixel input.

Conclusion: The approach shows promise for building general game-playing agents, with scaling and evaluation needed to reach expert-level, text-conditioned control across diverse game titles.

Abstract: We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.

Guile Wu, David Huang, Dongfeng Bai, Bingbing Liu

Main category: cs.CV

TL;DR: Proposes a unified diffusion transformer model for multi-modal multi-view video generation in autonomous driving, capable of generating RGB, depth maps, and semantic maps simultaneously.

Details

Motivation: Existing video generation approaches focus only on RGB and lack multi-modal support, while multi-modal data (depth maps, semantic maps) is crucial for holistic urban scene understanding in autonomous driving.

Method: Constructs a unified diffusion transformer model with modal-shared and modal-specific components, leveraging diverse conditioning inputs to encode controllable scene structure and content cues for multi-modal multi-view generation.

Result: Experiments on nuScenes dataset show the approach generates high-fidelity multi-modal multi-view urban scene videos with superior controllability, surpassing state-of-the-art methods.

Conclusion: The proposed unified framework successfully addresses the limitation of single-modal generation and enables simultaneous generation of multiple modalities in a single model, improving deployment efficiency and leveraging complementary cues.

Abstract: Video generation has recently shown superiority in urban scene synthesis for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the unified diffusion model for multi-modal multi-view video generation. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability, surpassing the state-of-the-art methods.

[92] Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates

Dian Ning, Dong Seog Han

Main category: cs.CV

TL;DR: Proposes inter-class relational loss to improve small object detection by leveraging spatial relationships between objects, achieving significant mAP improvements on license plate detection.

Details

Motivation: IoU-based losses fail to properly update gradients for small objects due to flat gradients, causing insufficient learning for small objects during multi-object training.

Method: Uses spatial relationships between objects (e.g., car plate attached to car) to add loss punishment when predicted small object is not within its related larger object, inversely proportional to their overlapped area.

Result: Achieved 10.3% and 1.6% mAP50 improvements for YOLOv12-T and UAV-DETR respectively, without additional hyperparameter tuning.

Conclusion: Inter-class relational loss effectively enhances small object detection by leveraging spatial relationships and can be easily integrated with existing IoU-based losses.

Abstract: In one-stage multi-object detection tasks, various intersection over union (IoU)-based solutions aim at smooth and stable convergence near the targets during training. However, IoU-based losses fail to correctly update the gradient of small objects due to an extremely flat gradient. During the update of multiple objects, the learning of small objects’ gradients suffers more because of insufficient gradient updates. Therefore, we propose an inter-class relational loss to efficiently update the gradient of small objects while not sacrificing the learning efficiency of other objects based on the simple fact that an object has a spatial relationship to another object (e.g., a car plate is attached to a car in a similar position). When the predicted car plate’s bounding box is not within its car, a loss punishment is added to guide the learning, which is inversely proportional to the overlapped area of the car’s and predicted car plate’s bounding box. By leveraging the spatial relationship at the inter-class level, the loss guides small object predictions using larger objects and enhances latent information in deeper feature maps. In this paper, we present twofold contributions using license plate detection as a case study: (1) a new small vehicle multi-license plate dataset (SVMLP), featuring diverse real-world scenarios with high-quality annotations; and (2) a novel inter-class relational loss function designed to promote effective detection performance. We highlight the proposed ICR loss penalty can be easily added to existing IoU-based losses and enhance the performance. These contributions improve the standard mean Average Precision (mAP) metric, achieving gains of 10.3% and 1.6% in mAP$^{\text{test}}_{50}$ for YOLOv12-T and UAV-DETR, respectively, without any additional hyperparameter tuning. Code and dataset will be available soon.

[93] HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation

Gaston Gustavo Rios

Main category: cs.CV

TL;DR: A lightweight sign generation model using CMLPe with synthetic data pretraining improves sign language recognition accuracy, achieving state-of-the-art results on LSFB and DiSPLaY datasets.

Details

Motivation: Sign Language Recognition models suffer from limited training data availability, which significantly restricts their performance.

Method: Introduces a novel lightweight sign generation model based on CMLPe coupled with synthetic data pretraining approach, implemented with Mamba-SL and Transformer-SL classifiers.

Result: Achieves new state-of-the-art results on LSFB and DiSPLaY datasets. Synthetic data pretraining outperforms traditional augmentation methods in some cases and provides complementary benefits when combined with them.

Conclusion: The approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that deliver significant performance improvements across diverse datasets.

Abstract: Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.

[94] Deep Learning for Taxol Exposure Analysis: A New Cell Image Dataset and Attention-Based Baseline Model

Sean Fletcher, Gabby Scott, Douglas Currie, Xin Zhang, Yuqi Song, Bruce MacLeod

Main category: cs.CV

TL;DR: New microscopy dataset and baseline model for automated detection of Taxol effects on cells using deep learning, addressing lack of public data for this application.

Details

Motivation: Existing methods for monitoring Taxol effects require specialized equipment, skilled personnel, and extensive preparation, making them expensive and unsuitable for high-throughput analysis. No public dataset exists for automated morphological analysis of cellular responses to Taxol.

Method: Created a new microscopy image dataset of C6 glioma cells treated with varying Taxol concentrations. Proposed ResAttention-KNN baseline model combining ResNet-50 with Convolutional Block Attention Modules and k-Nearest Neighbors classifier for enhanced robustness and interpretability.

Result: Developed and publicly released both the dataset and implementation to support reproducibility and facilitate future research in vision-based biomedical analysis of Taxol effects.

Conclusion: The work provides a valuable public resource and benchmark for automated morphological analysis of cellular responses to Taxol, enabling high-throughput assessment without specialized equipment or extensive sample preparation.

Abstract: Monitoring the effects of the chemotherapeutic agent Taxol at the cellular level is critical for both clinical evaluation and biomedical research. However, existing detection methods require specialized equipment, skilled personnel, and extensive sample preparation, making them expensive, labor-intensive, and unsuitable for high-throughput or real-time analysis. Deep learning approaches have shown great promise in medical and biological image analysis, enabling automated, high-throughput assessment of cellular morphology. Yet, no publicly available dataset currently exists for automated morphological analysis of cellular responses to Taxol exposure. To address this gap, we introduce a new microscopy image dataset capturing C6 glioma cells treated with varying concentrations of Taxol. To provide an effective solution for Taxol concentration classification and establish a benchmark for future studies on this dataset, we propose a baseline model named ResAttention-KNN, which combines a ResNet-50 with Convolutional Block Attention Modules and uses a k-Nearest Neighbors classifier in the learned embedding space. This model integrates attention-based refinement and non-parametric classification to enhance robustness and interpretability. Both the dataset and implementation are publicly released to support reproducibility and facilitate future research in vision-based biomedical analysis.

[95] Learning Point Cloud Representations with Pose Continuity for Depth-Based Category-Level 6D Object Pose Estimation

Zhujun Li, Shuo Zhang, Ioannis Stamos

Main category: cs.CV

TL;DR: HRC-Pose is a depth-only framework for category-level object pose estimation that uses contrastive learning to preserve 6D pose continuity, outperforming state-of-the-art methods on standard benchmarks.

Details

Motivation: Existing approaches rely solely on 6D pose supervision without capturing pose continuity, leading to prediction inconsistencies and poor generalization to unseen poses.

Method: Decouples pose into rotation and translation components, uses contrastive learning with 6D pose-aware hierarchical ranking scheme, and processes learned embeddings separately through dedicated modules.

Result: Outperforms existing depth-only state-of-the-art methods on REAL275 and CAMERA25 benchmarks, runs in real-time, and successfully learns continuous feature spaces.

Conclusion: HRC-Pose demonstrates effectiveness for real-world applications by addressing pose continuity issues through contrastive learning and separate processing of rotation/translation components.

Abstract: Category-level object pose estimation aims to predict the 6D pose and 3D size of objects within given categories. Existing approaches for this task rely solely on 6D poses as supervisory signals without explicitly capturing the intrinsic continuity of poses, leading to inconsistencies in predictions and reduced generalization to unseen poses. To address this limitation, we propose HRC-Pose, a novel depth-only framework for category-level object pose estimation, which leverages contrastive learning to learn point cloud representations that preserve the continuity of 6D poses. HRC-Pose decouples object pose into rotation and translation components, which are separately encoded and leveraged throughout the network. Specifically, we introduce a contrastive learning strategy for multi-task, multi-category scenarios based on our 6D pose-aware hierarchical ranking scheme, which contrasts point clouds from multiple categories by considering rotational and translational differences as well as categorical information. We further design pose estimation modules that separately process the learned rotation-aware and translation-aware embeddings. Our experiments demonstrate that HRC-Pose successfully learns continuous feature spaces. Results on REAL275 and CAMERA25 benchmarks show that our method consistently outperforms existing depth-only state-of-the-art methods and runs in real-time, demonstrating its effectiveness and potential for real-world applications. Our code is at https://github.com/zhujunli1993/HRC-Pose.

[96] Taming Transformer for Emotion-Controllable Talking Face Generation

Ziqi Zhang, Cheng Deng

Main category: cs.CV

TL;DR: A novel method for emotion-controllable talking face generation that uses pre-training strategies to disentangle audio components and quantize videos into visual tokens, then employs emotion-anchor representations and an autoregressive transformer to synthesize emotional videos.

Details

Motivation: To address two main challenges in emotion-controllable talking face generation: effectively modeling multimodal relationships related to specific emotions, and leveraging these relationships to synthesize identity-preserving emotional videos.

Method: Uses two pre-training strategies to disentangle audio into independent components and quantize videos into visual tokens. Proposes emotion-anchor (EA) representation to integrate emotional information into visual tokens. Employs an autoregressive transformer to model global distribution of visual tokens and predict index sequences for video synthesis.

Result: Extensive experiments on the MEAD dataset demonstrate superior performance both qualitatively and quantitatively in controlling video emotions conditioned on multiple emotional audios.

Conclusion: The proposed method effectively tackles emotion-controllable talking face generation by discretely modeling multimodal relationships and successfully synthesizing identity-preserving emotional videos through the emotion-anchor representation and transformer architecture.

Abstract: Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to model the global distribution of the visual tokens under the given conditions and further predict the index sequence for synthesizing the manipulated videos. We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios. Extensive experiments demonstrate the superiorities of our method both qualitatively and quantitatively.

[97] FastTracker: Real-Time and Accurate Visual Tracking

Hamidreza Hashempoor, Yu Dong Hwang

Main category: cs.CV

TL;DR: A generalized multi-object tracking framework for various object types, with special focus on vehicle tracking, featuring occlusion-aware re-ID and road-structure-aware refinement.

Details

Motivation: Conventional MOT systems are limited to pedestrian tracking and lack generalization to other object categories like vehicles in complex traffic scenes.

Method: Proposes two key components: (1) occlusion-aware re-identification mechanism for identity preservation of occluded objects, and (2) road-structure-aware tracklet refinement using semantic scene priors (lane directions, crosswalks, road boundaries).

Result: Achieves robust performance on new vehicle benchmark and public benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets.

Conclusion: The framework demonstrates effectiveness in general-purpose object tracking while maintaining strong performance on conventional benchmarks, providing a generalized solution beyond pedestrian tracking.

Abstract: Conventional multi-object tracking (MOT) systems are predominantly designed for pedestrian tracking and often exhibit limited generalization to other object categories. This paper presents a generalized tracking framework capable of handling multiple object types, with a particular emphasis on vehicle tracking in complex traffic scenes. The proposed method incorporates two key components: (1) an occlusion-aware re-identification mechanism that enhances identity preservation for heavily occluded objects, and (2) a road-structure-aware tracklet refinement strategy that utilizes semantic scene priors such as lane directions, crosswalks, and road boundaries to improve trajectory continuity and accuracy. In addition, we introduce a new benchmark dataset comprising diverse vehicle classes with frame-level tracking annotations, specifically curated to support evaluation of vehicle-focused tracking methods. Extensive experimental results demonstrate that the proposed approach achieves robust performance on both the newly introduced dataset and several public benchmarks, highlighting its effectiveness in general-purpose object tracking. While our framework is designed for generalized multi-class tracking, it also achieves strong performance on conventional benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets. Code and Benchmark are available: github.com/Hamidreza-Hashempoor/FastTracker, huggingface.co/datasets/Hamidreza-Hashemp/FastTracker-Benchmark.

[98] TCFNet: Bidirectional face-bone transformation via a Transformer-based coarse-to-fine point movement network

Runshi Zhang, Bimeng Jie, Yang He, Junchen Wang

Main category: cs.CV

TL;DR: TCFNet is a Transformer-based coarse-to-fine network for accurate face-bone point cloud transformations in surgical simulation, addressing limitations of traditional biomechanical and existing deep learning methods.

Details

Motivation: Traditional biomechanical simulation methods are computationally intensive and inaccurate, while existing deep learning approaches have limited receptive fields, cannot handle large-scale points, and require complex preprocessing/postprocessing operations.

Method: Two-stage end-to-end framework: 1) Transformer-based network for global feature extraction, 2) Local Information Aggregation Network (LIA-Net) to model local geometric structures. Uses gated recurrent unit to guide local displacement with global features, and includes auxiliary loss inspired by deformable medical image registration.

Result: TCFNet achieves outstanding evaluation metrics and visualization results compared to state-of-the-art methods on gathered datasets, demonstrating superior performance in face-bone point cloud transformations.

Conclusion: The proposed TCFNet effectively addresses limitations of existing methods by combining Transformer-based global feature extraction with local geometric modeling, providing an accurate and efficient solution for computer-aided surgical simulation without complex preprocessing/postprocessing requirements.

Abstract: Computer-aided surgical simulation is a critical component of orthognathic surgical planning, where accurately simulating face-bone shape transformations is significant. The traditional biomechanical simulation methods are limited by their computational time consumption levels, labor-intensive data processing strategies and low accuracy. Recently, deep learning-based simulation methods have been proposed to view this problem as a point-to-point transformation between skeletal and facial point clouds. However, these approaches cannot process large-scale points, have limited receptive fields that lead to noisy points, and employ complex preprocessing and postprocessing operations based on registration. These shortcomings limit the performance and widespread applicability of such methods. Therefore, we propose a Transformer-based coarse-to-fine point movement network (TCFNet) to learn unique, complicated correspondences at the patch and point levels for dense face-bone point cloud transformations. This end-to-end framework adopts a Transformer-based network and a local information aggregation network (LIA-Net) in the first and second stages, respectively, which reinforce each other to generate precise point movement paths. LIA-Net can effectively compensate for the neighborhood precision loss of the Transformer-based network by modeling local geometric structures (edges, orientations and relative position features). The previous global features are employed to guide the local displacement using a gated recurrent unit. Inspired by deformable medical image registration, we propose an auxiliary loss that can utilize expert knowledge for reconstructing critical organs.Compared with the existing state-of-the-art (SOTA) methods on gathered datasets, TCFNet achieves outstanding evaluation metrics and visualization results. The code is available at https://github.com/Runshi-Zhang/TCFNet.

[99] QuadINR: Hardware-Efficient Implicit Neural Representations Through Quadratic Activation

Wenyong Zhou, Boyu Li, Jiachen Ren, Taiqiang Wu, Zhilin Ai, Zhengwu Liu, Ngai Wong

Main category: cs.CV

TL;DR: QuadINR introduces hardware-efficient Implicit Neural Representations using piecewise quadratic activation functions to achieve better performance with significantly reduced hardware overhead compared to previous approaches.

Details

Motivation: Existing INRs address spectral bias through complex activation functions that incur significant hardware overhead, creating a need for more hardware-efficient solutions.

Method: Proposes QuadINR with piecewise quadratic activation functions that provide rich harmonic content, verified through Neural Tangent Kernel analysis. Develops a unified N-stage pipeline framework for efficient hardware implementation on FPGA and ASIC platforms.

Result: Achieves up to 2.06dB PSNR improvement over prior work with area of only 1914μm² and dynamic power of 6.14mW. Reduces resource and power consumption by up to 97% and improves latency by up to 93% compared to existing baselines.

Conclusion: QuadINR demonstrates that piecewise quadratic activation functions can provide superior performance while dramatically reducing hardware consumption, making INRs more practical for real-world applications.

Abstract: Implicit Neural Representations (INRs) encode discrete signals continuously while addressing spectral bias through activation functions (AFs). Previous approaches mitigate this bias by employing complex AFs, which often incur significant hardware overhead. To tackle this challenge, we introduce QuadINR, a hardware-efficient INR that utilizes piecewise quadratic AFs to achieve superior performance with dramatic reductions in hardware consumption. The quadratic functions encompass rich harmonic content in their Fourier series, delivering enhanced expressivity for high-frequency signals, as verified through Neural Tangent Kernel (NTK) analysis. We develop a unified $N$-stage pipeline framework that facilitates efficient hardware implementation of various AFs in INRs. We demonstrate FPGA implementations on the VCU128 platform and an ASIC implementation in a 28nm process. Experiments across images and videos show that QuadINR achieves up to 2.06dB PSNR improvement over prior work, with an area of only 1914$\mu$m$^2$ and a dynamic power of 6.14mW, reducing resource and power consumption by up to 97% and improving latency by up to 93% vs existing baselines.

[100] Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning

Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Juming Xiong, Chongyu Qu, Mengmeng Yin, Yu Wang, Shilin Zhao, Haichun Yang, Daguang Xu, Yucheng Tang, Yuankai Huo

Main category: cs.CV

TL;DR: Img2ST-Net is a novel framework that uses fully convolutional architecture to generate high-resolution spatial transcriptomics data from histology images in parallel, overcoming computational challenges of conventional spot-by-spot methods.

Details

Motivation: High-resolution spatial transcriptomics data acquisition is expensive and time-consuming. Conventional sequential regression frameworks become inefficient and unstable at finer resolutions (8um or finer), and the extreme sparsity of high-resolution ST data complicates both prediction and evaluation.

Method: Proposes Img2ST-Net with fully convolutional architecture to generate dense HD gene expression maps in parallel. Reformulates the task as super-content image generation with hundreds/thousands of output channels by modeling HD ST data as super-pixel representations. Also introduces SSIM-ST, a structural-similarity-based evaluation metric for high-resolution ST analysis.

Result: The framework improves computational efficiency while better preserving spatial organization intrinsic to spatial omics data. Provides a scalable, biologically coherent solution for efficient and accurate ST inference at scale.

Conclusion: Img2ST-Net offers a principled solution for efficient high-resolution ST prediction and lays groundwork for next-generation robust and resolution-aware ST modeling. The source code is publicly available.

Abstract: Recent advances in multi-modal AI have demonstrated promising potential for generating the currently expensive spatial transcriptomics (ST) data directly from routine histology images, offering a means to reduce the high cost and time-intensive nature of ST data acquisition. However, the increasing resolution of ST, particularly with platforms such as Visium HD achieving 8um or finer, introduces significant computational and modeling challenges. Conventional spot-by-spot sequential regression frameworks become inefficient and unstable at this scale, while the inherent extreme sparsity and low expression levels of high-resolution ST further complicate both prediction and evaluation. To address these limitations, we propose Img2ST-Net, a novel histology-to-ST generation framework for efficient and parallel high-resolution ST prediction. Unlike conventional spot-by-spot inference methods, Img2ST-Net employs a fully convolutional architecture to generate dense, HD gene expression maps in a parallelized manner. By modeling HD ST data as super-pixel representations, the task is reformulated from image-to-omics inference into a super-content image generation problem with hundreds or thousands of output channels. This design not only improves computational efficiency but also better preserves the spatial organization intrinsic to spatial omics data. To enhance robustness under sparse expression patterns, we further introduce SSIM-ST, a structural-similarity-based evaluation metric tailored for high-resolution ST analysis. We present a scalable, biologically coherent framework for high-resolution ST prediction. Img2ST-Net offers a principled solution for efficient and accurate ST inference at scale. Our contributions lay the groundwork for next-generation ST modeling that is robust and resolution-aware. The source code has been made publicly available at https://github.com/hrlblab/Img2ST-Net.

[101] CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities

Yue Gong, Shanyuan Liu, Liuzhuozheng Li, Jian Zhu, Bo Cheng, Liebucha Wu, Xiaoyu Wu, Yuhang Ma, Dawei Leng, Yuhui Yin

Main category: cs.CV

TL;DR: CTA-Flux is a Chinese text adapter that enables Flux (English-trained text-to-image model) to understand Chinese prompts while maintaining compatibility with existing plugins, using a parameter-efficient MultiModal Diffusion Transformer approach.

Details

Motivation: Flux performs poorly with non-English prompts due to linguistic and cultural biases in English-centric training. Existing translation or finetuning methods fail to preserve culturally specific semantics, compromising image authenticity.

Method: Leverages MultiModal Diffusion Transformer (MMDiT) to directly control the Flux backbone, significantly reducing parameters while enhancing Chinese semantic understanding, without extensive retraining of the entire model.

Result: Empirical evaluations show CTA-flux supports both Chinese and English prompts, achieving superior image generation quality, visual realism, and faithful depiction of Chinese semantics.

Conclusion: CTA-flux effectively bridges Chinese semantic understanding with English-centric TTI models, improving cultural authenticity and generation quality while maintaining plugin compatibility.

Abstract: We proposed the Chinese Text Adapter-Flux (CTA-Flux). An adaptation method fits the Chinese text inputs to Flux, a powerful text-to-image (TTI) generative model initially trained on the English corpus. Despite the notable image generation ability conditioned on English text inputs, Flux performs poorly when processing non-English prompts, particularly due to linguistic and cultural biases inherent in predominantly English-centric training datasets. Existing approaches, such as translating non-English prompts into English or finetuning models for bilingual mappings, inadequately address culturally specific semantics, compromising image authenticity and quality. To address this issue, we introduce a novel method to bridge Chinese semantic understanding with compatibility in English-centric TTI model communities. Existing approaches relying on ControlNet-like architectures typically require a massive parameter scale and lack direct control over Chinese semantics. In comparison, CTA-flux leverages MultiModal Diffusion Transformer (MMDiT) to control the Flux backbone directly, significantly reducing the number of parameters while enhancing the model’s understanding of Chinese semantics. This integration significantly improves the generation quality and cultural authenticity without extensive retraining of the entire model, thus maintaining compatibility with existing text-to-image plugins such as LoRA, IP-Adapter, and ControlNet. Empirical evaluations demonstrate that CTA-flux supports Chinese and English prompts and achieves superior image generation quality, visual realism, and faithful depiction of Chinese semantics.

[102] MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing

Jeahun Sung, Changhyun Roh, Chanho Eom, Jihyong Oh

Main category: cs.CV

TL;DR: MoCHA-former is a novel transformer-based model that effectively removes moiré patterns from camera-captured screen content by addressing spatially varying artifacts, large-scale structures, channel dependencies, and temporal fluctuations through decoupled moiré adaptive demoiréing and spatio-temporal adaptive components.

Details

Motivation: Camera-based screen capture suffers from moiré patterns caused by frequency aliasing between camera CFA and display sub-pixels, degrading photo/video quality. Existing demoiréing methods fail to address spatially varying artifact strength, large-scale structures, channel-dependent statistics, and temporal fluctuations across frames.

Method: MoCHA-former uses two key components: Decoupled Moiré Adaptive Demoiréing (DMAD) with Moiré Decoupling Block and Detail Decoupling Block to separate moiré and content, and Spatio-Temporal Adaptive Demoiréing (STAD) with Spatial Fusion Block and Feature Channel Attention. It performs implicit frame alignment without explicit modules for temporal consistency.

Result: The model was evaluated on two video datasets covering RAW and sRGB domains. MoCHA-former consistently outperformed prior methods across all metrics including PSNR, SSIM, and LPIPS.

Conclusion: MoCHA-former effectively addresses the key limitations in moiré pattern removal by leveraging transformer architecture with specialized components for spatial, channel, and temporal adaptation, demonstrating superior performance over existing approaches.

Abstract: Recent advances in portable imaging have made camera-based screen capture ubiquitous. Unfortunately, frequency aliasing between the camera’s color filter array (CFA) and the display’s sub-pixels induces moir'e patterns that severely degrade captured photos and videos. Although various demoir'eing models have been proposed to remove such moir'e patterns, these approaches still suffer from several limitations: (i) spatially varying artifact strength within a frame, (ii) large-scale and globally spreading structures, (iii) channel-dependent statistics and (iv) rapid temporal fluctuations across frames. We address these issues with the Moir'e Conditioned Hybrid Adaptive Transformer (MoCHA-former), which comprises two key components: Decoupled Moir'e Adaptive Demoir'eing (DMAD) and Spatio-Temporal Adaptive Demoir'eing (STAD). DMAD separates moir'e and content via a Moir'e Decoupling Block (MDB) and a Detail Decoupling Block (DDB), then produces moir'e-adaptive features using a Moir'e Conditioning Block (MCB) for targeted restoration. STAD introduces a Spatial Fusion Block (SFB) with window attention to capture large-scale structures, and a Feature Channel Attention (FCA) to model channel dependence in RAW frames. To ensure temporal consistency, MoCHA-former performs implicit frame alignment without any explicit alignment module. We analyze moir'e characteristics through qualitative and quantitative studies, and evaluate on two video datasets covering RAW and sRGB domains. MoCHA-former consistently surpasses prior methods across PSNR, SSIM, and LPIPS.

[103] HyperDiff: Hypergraph Guided Diffusion Model for 3D Human Pose Estimation

Bing Han, Yuhua Huang, Pan Gao

Main category: cs.CV

TL;DR: HyperDiff combines diffusion models with HyperGCN for monocular 3D human pose estimation, addressing depth ambiguity and occlusion while capturing multi-scale skeleton features through multi-granularity structures.

Details

Motivation: To overcome challenges in monocular 3D HPE including depth ambiguity, occlusion during 2D-to-3D lifting, and the oversight of multi-scale skeleton features in traditional methods.

Method: Integrates diffusion models to capture data uncertainty and HyperGCN as a denoiser that uses multi-granularity structures to model high-order correlations between joints.

Result: Achieves state-of-the-art performance on Human3.6M and MPI-INF-3DHP datasets and can flexibly adapt to varying computational resources for performance-efficiency balance.

Conclusion: HyperDiff effectively addresses key challenges in monocular 3D pose estimation through the novel combination of diffusion models and HyperGCN with multi-granularity structures.

Abstract: Monocular 3D human pose estimation (HPE) often encounters challenges such as depth ambiguity and occlusion during the 2D-to-3D lifting process. Additionally, traditional methods may overlook multi-scale skeleton features when utilizing skeleton structure information, which can negatively impact the accuracy of pose estimation. To address these challenges, this paper introduces a novel 3D pose estimation method, HyperDiff, which integrates diffusion models with HyperGCN. The diffusion model effectively captures data uncertainty, alleviating depth ambiguity and occlusion. Meanwhile, HyperGCN, serving as a denoiser, employs multi-granularity structures to accurately model high-order correlations between joints. This improves the model’s denoising capability especially for complex poses. Experimental results demonstrate that HyperDiff achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets and can flexibly adapt to varying computational resources to balance performance and efficiency.

[104] FOCUS: Frequency-Optimized Conditioning of DiffUSion Models for mitigating catastrophic forgetting during Test-Time Adaptation

Gabriel Tjio, Jie Zhang, Xulei Yang, Yun Xing, Nhat Chung, Xiaofeng Cao, Ivor W. Tsang, Chee Keong Kwoh, Qing Guo

Main category: cs.CV

TL;DR: FOCUS is a frequency-based conditioning approach that uses diffusion-driven input adaptation to balance knowledge preservation and domain adaptation, achieving state-of-the-art performance on semantic segmentation and depth estimation across diverse corruptions.

Details

Motivation: Test-time adaptation methods struggle to balance adapting to domain shifts while preserving task-relevant knowledge, as adaptation can cause catastrophic forgetting of important semantic information.

Method: Proposes FOCUS - a frequency-based conditioning approach using a Y-shaped Frequency Prediction Network (Y-FPN) that disentangles high/low frequency information from noisy images. Uses FrequencyMix data augmentation and diffusion-driven denoising with learned frequency priors to preserve semantic information.

Result: Achieves state-of-the-art averaged performance across 15 corruption types and three datasets for semantic segmentation and monocular depth estimation. Also complements existing model adaptation methods by providing pseudo labels from denoised images.

Conclusion: FOCUS effectively mitigates catastrophic forgetting in test-time adaptation while improving performance on dense prediction tasks, and can enhance existing adaptation methods through pseudo-label supervision.

Abstract: Test-time adaptation enables models to adapt to evolving domains. However, balancing the tradeoff between preserving knowledge and adapting to domain shifts remains challenging for model adaptation methods, since adapting to domain shifts can induce forgetting of task-relevant knowledge. To address this problem, we propose FOCUS, a novel frequency-based conditioning approach within a diffusion-driven input-adaptation framework. Utilising learned, spatially adaptive frequency priors, our approach conditions the reverse steps during diffusion-driven denoising to preserve task-relevant semantic information for dense prediction. FOCUS leverages a trained, lightweight, Y-shaped Frequency Prediction Network (Y-FPN) that disentangles high and low frequency information from noisy images. This minimizes the computational costs involved in implementing our approach in a diffusion-driven framework. We train Y-FPN with FrequencyMix, a novel data augmentation method that perturbs the images across diverse frequency bands, which improves the robustness of our approach to diverse corruptions. We demonstrate the effectiveness of FOCUS for semantic segmentation and monocular depth estimation across 15 corruption types and three datasets, achieving state-of-the-art averaged performance. In addition to improving standalone performance, FOCUS complements existing model adaptation methods since we can derive pseudo labels from FOCUS-denoised images for additional supervision. Even under limited, intermittent supervision with the pseudo labels derived from the FOCUS denoised images, we show that FOCUS mitigates catastrophic forgetting for recent model adaptation methods.

[105] MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion

Fei Peng, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Huiyuan Fu

Main category: cs.CV

TL;DR: MUSE is a unified framework for layout-controllable multi-subject synthesis that achieves precise spatial control and identity preservation through concatenated cross-attention and progressive training.

Details

Motivation: Existing text-to-image diffusion models struggle with multi-subject compositional synthesis that requires both precise spatial control and faithful reconstruction of reference subjects simultaneously.

Method: Proposes MUSE framework with concatenated cross-attention (CCA) for bidirectional modality alignment between layout specifications and textual guidance, plus a progressive two-stage training strategy.

Result: Achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions.

Conclusion: MUSE advances controllable image synthesis by effectively solving the dual requirements of spatial precision and identity preservation in multi-subject generation.

Abstract: Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at https://github.com/pf0607/MUSE.

[106] Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting

Gyusam Chang, Tuan-Anh Vu, Vivek Alumootil, Harris Song, Deanna Pham, Sangpil Kim, M. Khalid Jawed

Main category: cs.CV

TL;DR: NIRSplat is a multimodal Gaussian splatting method that integrates NIR, RGB, and vegetation index data to improve 3D reconstruction in challenging agricultural environments.

Details

Motivation: Agricultural scenes present unique challenges for 3D reconstruction including uneven illumination, occlusions, and limited field of view. Current 3DGS methods are underexplored in agriculture and lack robustness in these conditions.

Method: Proposed NIRSplat architecture using cross-attention mechanism with 3D point-based positional encoding to integrate multimodal data (NIR, RGB, depth, LiDAR, and text metadata from vegetation indices like NDVI, NDWI, chlorophyll index).

Result: NIRSplat outperforms existing methods including 3DGS, CoR-GS, and InstantSplat in comprehensive experiments on challenging agricultural scenarios.

Conclusion: The integration of NIR data and vegetation index metadata significantly enhances 3D reconstruction robustness and provides botanical insights beyond the visible spectrum, making it effective for agricultural applications.

Abstract: While 3D Gaussian Splatting (3DGS) has rapidly advanced, its application in agriculture remains underexplored. Agricultural scenes present unique challenges for 3D reconstruction methods, particularly due to uneven illumination, occlusions, and a limited field of view. To address these limitations, we introduce \textbf{NIRPlant}, a novel multimodal dataset encompassing Near-Infrared (NIR) imagery, RGB imagery, textual metadata, Depth, and LiDAR data collected under varied indoor and outdoor lighting conditions. By integrating NIR data, our approach enhances robustness and provides crucial botanical insights that extend beyond the visible spectrum. Additionally, we leverage text-based metadata derived from vegetation indices, such as NDVI, NDWI, and the chlorophyll index, which significantly enriches the contextual understanding of complex agricultural environments. To fully exploit these modalities, we propose \textbf{NIRSplat}, an effective multimodal Gaussian splatting architecture employing a cross-attention mechanism combined with 3D point-based positional encoding, providing robust geometric priors. Comprehensive experiments demonstrate that \textbf{NIRSplat} outperforms existing landmark methods, including 3DGS, CoR-GS, and InstantSplat, highlighting its effectiveness in challenging agricultural scenarios. The code and dataset are publicly available at: https://github.com/StructuresComp/3D-Reconstruction-NIR

[107] Generalizable Engagement Estimation in Conversation via Domain Prompting and Parallel Attention

Yangche Yu, Yin Chen, Jia Li, Peng Jia, Yu Zhang, Li Dai, Zhenzhen Hu, Meng Wang, Richang Hong

Main category: cs.CV

TL;DR: DAPA is a novel framework for generalizable conversational engagement modeling that uses domain prompting and parallel cross-attention to improve cross-domain performance.

Details

Motivation: Accurate engagement estimation is essential for adaptive human-computer interaction, but current methods suffer from poor generalizability across diverse domains and difficulty modeling complex interaction dynamics.

Method: Proposes DAPA framework with Domain Prompting mechanism (learnable domain-specific vectors) and Parallel Cross-Attention module that aligns reactive (forward BiLSTM) and anticipatory (backward BiLSTM) states between participants.

Result: Achieves state-of-the-art performance on cross-cultural and cross-linguistic benchmarks, with 0.45 CCC improvement on NoXi-J test set, and won first place in Multi-Domain Engagement Estimation Challenge at MultiMediate'25.

Conclusion: DAPA effectively addresses domain generalization challenges in engagement estimation through explicit domain conditioning and interaction synchrony modeling, demonstrating superior performance across diverse conversational contexts.

Abstract: Accurate engagement estimation is essential for adaptive human-computer interaction systems, yet robust deployment is hindered by poor generalizability across diverse domains and challenges in modeling complex interaction dynamics.To tackle these issues, we propose DAPA (Domain-Adaptive Parallel Attention), a novel framework for generalizable conversational engagement modeling. DAPA introduces a Domain Prompting mechanism by prepending learnable domain-specific vectors to the input, explicitly conditioning the model on the data’s origin to facilitate domain-aware adaptation while preserving generalizable engagement representations. To capture interactional synchrony, the framework also incorporates a Parallel Cross-Attention module that explicitly aligns reactive (forward BiLSTM) and anticipatory (backward BiLSTM) states between participants.Extensive experiments demonstrate that DAPA establishes a new state-of-the-art performance on several cross-cultural and cross-linguistic benchmarks, notably achieving an absolute improvement of 0.45 in Concordance Correlation Coefficient (CCC) over a strong baseline on the NoXi-J test set. The superiority of our method was also confirmed by winning the first place in the Multi-Domain Engagement Estimation Challenge at MultiMediate'25.

[108] D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis

Yuhang Guo, Kaijun Deng, Siyang Song, Jindong Xie, Wenhui Ma, Linlin Shen

Main category: cs.CV

TL;DR: D^3-Talker is a novel 3D talking head synthesis method that uses static 3D Gaussian attribute fields with separate audio and facial motion controls, achieving better lip sync and image quality with limited training data.

Details

Motivation: Existing methods struggle with poor lip synchronization and image quality when trained on few frames due to audio containing irrelevant information and difficulty mapping audio to realistic lip behaviors.

Method: Constructs static 3D Gaussian attribute field with independent audio and facial motion deformation controls, uses similarity contrastive loss for decoupling, and integrates coarse-to-fine module for image refinement.

Result: Outperforms state-of-the-art methods in both high-fidelity rendering and accurate audio-lip synchronization with limited training data.

Conclusion: D^3-Talker effectively decouples general and personalized deformations, achieving superior performance in 3D talking head synthesis with minimal training requirements.

Abstract: A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model for each target identity from scratch. Recent methods have attempted to address this issue by extracting general features from audio through pre-training models. However, since audio contains information irrelevant to lip motion, existing approaches typically struggle to map the given audio to realistic lip behaviors in the target face when trained on only a few frames, causing poor lip synchronization and talking head image quality. This paper proposes D^3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals to independently control two distinct Gaussian attribute deformation fields, effectively decoupling the predictions of general and personalized deformations. We design a novel similarity contrastive loss function during pre-training to achieve more thorough decoupling. Furthermore, we integrate a Coarse-to-Fine module to refine the rendered images, alleviating blurriness caused by head movements and enhancing overall image quality. Extensive experiments demonstrate that D^3-Talker outperforms state-of-the-art methods in both high-fidelity rendering and accurate audio-lip synchronization with limited training data. Our code will be provided upon acceptance.

[109] Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

Shanlin Sun, Yifan Wang, Hanwen Zhang, Yifeng Xiong, Qin Ren, Ruogu Fang, Xiaohui Xie, Chenyu You

Main category: cs.CV

TL;DR: Ouroboros is a dual single-step diffusion framework that mutually reinforces forward and inverse rendering with cycle consistency, achieving faster inference and state-of-the-art performance across diverse scenes.

Details

Motivation: Existing multi-step diffusion models treat forward and inverse rendering independently, leading to cycle inconsistency and slow inference speed.

Method: Two single-step diffusion models handle forward and inverse rendering with mutual reinforcement, extending intrinsic decomposition to both indoor/outdoor scenes with cycle consistency mechanism.

Result: State-of-the-art performance across diverse scenes with substantially faster inference speed compared to other diffusion methods, plus training-free video decomposition capability.

Conclusion: Ouroboros demonstrates effective mutual reinforcement between forward and inverse rendering with cycle consistency, enabling faster inference and high-quality results across various applications including video.

Abstract: While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.

[110] DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing

Weitao Wang, Zichen Wang, Hongdeng Shen, Yulei Lu, Xirui Fan, Suhui Wu, Jun Zhang, Haoqian Wang, Hao Zhang

Main category: cs.CV

TL;DR: DreamSwapV is a mask-guided, subject-agnostic framework for swapping any subject in videos using user-specified masks and reference images, outperforming existing methods through advanced condition fusion and adaptive mask strategies.

Details

Motivation: Current video subject swapping methods are either domain-specific (human-body animation, hand-object interaction) or rely on indirect editing paradigms and ambiguous text prompts that compromise fidelity, creating a need for a more general and high-fidelity solution.

Method: Mask-guided end-to-end framework with multiple conditions and dedicated condition fusion module, adaptive mask strategy for varying subject scales/attributes, and elaborate two-phase dataset construction and training scheme.

Result: Outperforms existing methods as validated by comprehensive experiments on VBench indicators and the newly introduced DreamSwapV-Benchmark.

Conclusion: DreamSwapV provides an effective solution for customized video editing through subject swapping, achieving superior performance with its advanced conditioning and adaptive strategies.

Abstract: With the rapid progress of video generation, demand for customized video editing is surging, where subject swapping constitutes a key component yet remains under-explored. Prevailing swapping approaches either specialize in narrow domains–such as human-body animation or hand-object interaction–or rely on some indirect editing paradigm or ambiguous text prompts that compromise final fidelity. In this paper, we propose DreamSwapV, a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image. To inject fine-grained guidance, we introduce multiple conditions and a dedicated condition fusion module that integrates them efficiently. In addition, an adaptive mask strategy is designed to accommodate subjects of varying scales and attributes, further improving interactions between the swapped subject and its surrounding context. Through our elaborate two-phase dataset construction and training scheme, our DreamSwapV outperforms existing methods, as validated by comprehensive experiments on VBench indicators and our first introduced DreamSwapV-Benchmark.

[111] LookOut: Real-World Humanoid Egocentric Navigation

Boxiao Pan, Adam W. Harley, C. Karen Liu, Leonidas J. Guibas

Main category: cs.CV

TL;DR: Predicting 6D head poses from egocentric video to understand active information-gathering behavior during navigation

Details

Motivation: Predicting collision-free future trajectories from egocentric observations is crucial for humanoid robotics, VR/AR, and assistive navigation applications

Method: A framework that reasons over temporally aggregated 3D latent features to model geometric and semantic constraints of static and dynamic environments, plus a data collection pipeline using Project Aria glasses

Result: Created Aria Navigation Dataset (4 hours of real-world navigation recordings), model learns human-like navigation behaviors (waiting, rerouting, looking around) and generalizes to unseen environments

Conclusion: The proposed approach successfully predicts 6D head poses and demonstrates human-like navigation behaviors, providing valuable resources for learning real-world egocentric navigation policies

Abstract: The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments. Check out our project webpage at https://sites.google.com/stanford.edu/lookout.

[112] Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration

Haoran Bai, Xiaoxu Chen, Canqian Yang, Zongyao He, Sibin Deng, Ying Chen

Main category: cs.CV

TL;DR: Vivid-VR is a DiT-based video restoration method that uses ControlNet and concept distillation to achieve superior texture realism and temporal coherence while maintaining content consistency.

Details

Motivation: Conventional fine-tuning of controllable video restoration pipelines suffers from distribution drift due to imperfect multimodal alignment, leading to compromised texture realism and temporal coherence.

Method: Proposes concept distillation training using pretrained T2V model to synthesize training samples, redesigned control architecture with control feature projector to filter degradation artifacts, and dual-branch ControlNet connector combining MLP-based mapping with cross-attention for dynamic control.

Result: Extensive experiments show Vivid-VR outperforms existing approaches on synthetic, real-world benchmarks and AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency.

Conclusion: The method successfully addresses distribution drift issues in video restoration through concept distillation and enhanced control architecture, delivering state-of-the-art performance with publicly available code and checkpoints.

Abstract: We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.

[113] WeedSense: Multi-Task Learning for Weed Segmentation, Height Estimation, and Growth Stage Classification

Toqi Tahamid Sarker, Khaled R Ahmed, Taminul Islam, Cristiana Bernardi Rankrape, Karla Gage

Main category: cs.CV

TL;DR: WeedSense is a multi-task learning architecture that simultaneously performs weed segmentation, height estimation, and growth stage classification, achieving state-of-the-art performance with real-time inference capabilities.

Details

Motivation: Weed management is critical for agriculture but resource-intensive. Effective monitoring and analysis are needed for sustainable practices and site-specific management approaches.

Method: Multi-task learning architecture with dual-path encoder using Universal Inverted Bottleneck blocks and Multi-Task Bifurcated Decoder with transformer-based feature fusion. Trained on a unique dataset of 16 weed species over 11 weeks with pixel-level annotations, height measurements, and temporal labels.

Result: Achieved mIoU of 89.78% for segmentation, 1.67cm MAE for height estimation, and 99.99% accuracy for growth stage classification at 160 FPS. 3× faster inference than sequential single-task execution with 32.4% fewer parameters.

Conclusion: WeedSense provides a comprehensive, efficient solution for weed analysis that enables real-time monitoring and management, significantly advancing sustainable agricultural practices.

Abstract: Weed management represents a critical challenge in agriculture, significantly impacting crop yields and requiring substantial resources for control. Effective weed monitoring and analysis strategies are crucial for implementing sustainable agricultural practices and site-specific management approaches. We introduce WeedSense, a novel multi-task learning architecture for comprehensive weed analysis that jointly performs semantic segmentation, height estimation, and growth stage classification. We present a unique dataset capturing 16 weed species over an 11-week growth cycle with pixel-level annotations, height measurements, and temporal labels. WeedSense leverages a dual-path encoder incorporating Universal Inverted Bottleneck blocks and a Multi-Task Bifurcated Decoder with transformer-based feature fusion to generate multi-scale features and enable simultaneous prediction across multiple tasks. WeedSense outperforms other state-of-the-art models on our comprehensive evaluation. On our multi-task dataset, WeedSense achieves mIoU of 89.78% for segmentation, 1.67cm MAE for height estimation, and 99.99% accuracy for growth stage classification while maintaining real-time inference at 160 FPS. Our multitask approach achieves 3$\times$ faster inference than sequential single-task execution and uses 32.4% fewer parameters. Please see our project page at weedsense.github.io.

[114] SATURN: Autoregressive Image Generation Guided by Scene Graphs

Thanh-Nhan Vo, Trong-Thuan Nguyen, Tam V. Nguyen, Minh-Triet Tran

Main category: cs.CV

TL;DR: SATURN is a lightweight extension to VAR-CLIP that translates scene graphs into salience-ordered token sequences, enabling better text-to-image generation with improved layout and object relationship accuracy while maintaining high fidelity.

Details

Motivation: Existing text-to-image models struggle with capturing complex layouts and object relationships from prompts, while previous graph-guided approaches rely on heavy GAN/diffusion pipelines that lag behind modern autoregressive architectures in speed and fidelity.

Method: SATURN extends VAR-CLIP by translating scene graphs into salience-ordered token sequences, allowing a frozen CLIP-VQ-VAE backbone to interpret graph structure while only fine-tuning the VAR transformer component.

Result: On Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases Inception Score from 16.03 to 24.78, outperforming SG2IM and SGDiff without requiring extra modules or multi-stage training.

Conclusion: SATURN effectively combines structural awareness from scene graphs with state-of-the-art autoregressive fidelity, showing significant improvements in object count fidelity and spatial relation accuracy.

Abstract: State-of-the-art text-to-image models excel at photorealistic rendering but often struggle to capture the layout and object relationships implied by complex prompts. Scene graphs provide a natural structural prior, yet previous graph-guided approaches have typically relied on heavy GAN or diffusion pipelines, which lag behind modern autoregressive architectures in both speed and fidelity. We introduce SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence, enabling a frozen CLIP-VQ-VAE backbone to interpret graph structure while fine-tuning only the VAR transformer. On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78, outperforming prior methods such as SG2IM and SGDiff without requiring extra modules or multi-stage training. Qualitative results further confirm improvements in object count fidelity and spatial relation accuracy, showing that SATURN effectively combines structural awareness with state-of-the-art autoregressive fidelity.

[115] PB-IAD: Utilizing multimodal foundation models for semantic industrial anomaly detection in dynamic manufacturing environments

Bernd Hofmann, Albert Scheck, Joerg Franke, Patrick Bruendl

Main category: cs.CV

TL;DR: PB-IAD is a prompt-based industrial anomaly detection framework that leverages foundation models’ multimodal capabilities to address data sparsity, adaptability, and user-centric requirements in manufacturing, outperforming state-of-the-art methods.

Details

Motivation: Traditional statistical and data-driven anomaly detection methods are constrained by their dependence on extensive annotated datasets and limited flexibility in dynamic production conditions, creating a need for more adaptable solutions.

Method: PB-IAD uses a prompt-based framework with foundation models (GPT-4.1), featuring a specialized prompt template for iterative domain knowledge implementation and a pre-processing module that translates user inputs into effective system prompts.

Result: The framework demonstrates superior performance in data-sparse scenarios and low-shot settings across three manufacturing scenarios and two data modalities, achieving results solely through semantic instructions.

Conclusion: PB-IAD provides a user-centric, flexible solution for industrial anomaly detection that eliminates the need for data science expertise while delivering state-of-the-art performance, particularly in challenging data-limited environments.

Abstract: The detection of anomalies in manufacturing processes is crucial to ensure product quality and identify process deviations. Statistical and data-driven approaches remain the standard in industrial anomaly detection, yet their adaptability and usability are constrained by the dependence on extensive annotated datasets and limited flexibility under dynamic production conditions. Recent advances in the perception capabilities of foundation models provide promising opportunities for their adaptation to this downstream task. This paper presents PB-IAD (Prompt-based Industrial Anomaly Detection), a novel framework that leverages the multimodal and reasoning capabilities of foundation models for industrial anomaly detection. Specifically, PB-IAD addresses three key requirements of dynamic production environments: data sparsity, agile adaptability, and domain user centricity. In addition to the anomaly detection, the framework includes a prompt template that is specifically designed for iteratively implementing domain-specific process knowledge, as well as a pre-processing module that translates domain user inputs into effective system prompts. This user-centric design allows domain experts to customise the system flexibly without requiring data science expertise. The proposed framework is evaluated by utilizing GPT-4.1 across three distinct manufacturing scenarios, two data modalities, and an ablation study to systematically assess the contribution of semantic instructions. Furthermore, PB-IAD is benchmarked to state-of-the-art methods for anomaly detection such as PatchCore. The results demonstrate superior performance, particularly in data-sparse scenarios and low-shot settings, achieved solely through semantic instructions.

[116] Adversarial Generation and Collaborative Evolution of Safety-Critical Scenarios for Autonomous Vehicles

Jiangfan Liu, Yongkang Guo, Fangzhi Zhong, Tianyuan Zhang, Zonglei Jing, Siyuan Liang, Jiakai Wang, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: ScenGE is a framework that generates safety-critical scenarios for autonomous vehicle testing using LLM-based reasoning and traffic flow amplification, outperforming state-of-the-art methods by 31.96% in collision detection.

Details

Motivation: Current approaches for safety-critical scenario generation rely on predefined patterns or rule-based strategies, limiting their ability to expose diverse and unforeseen failure modes in autonomous vehicles.

Method: Uses Meta-Scenario Generation with LLMs grounded in driving knowledge to infer plausible adversarial agents, then Complex Scenario Evolution with adversarial collaborator graphs to amplify threats through optimized background vehicle trajectories that reduce maneuvering space and create critical occlusions.

Result: Generates 31.96% more severe collision cases than SoTA baselines, improves model robustness through adversarial training, works with different simulators and large model AV systems, and validated through real-world tests and human evaluation.

Conclusion: ScenGE provides a critical step toward building public trust and ensuring safe deployment of autonomous vehicles by generating plausible and critical safety scenarios that expose diverse failure modes.

Abstract: The generation of safety-critical scenarios in simulation has become increasingly crucial for safety evaluation in autonomous vehicles prior to road deployment in society. However, current approaches largely rely on predefined threat patterns or rule-based strategies, which limit their ability to expose diverse and unforeseen failure modes. To overcome these, we propose ScenGE, a framework that can generate plentiful safety-critical scenarios by reasoning novel adversarial cases and then amplifying them with complex traffic flows. Given a simple prompt of a benign scene, it first performs Meta-Scenario Generation, where a large language model, grounded in structured driving knowledge, infers an adversarial agent whose behavior poses a threat that is both plausible and deliberately challenging. This meta-scenario is then specified in executable code for precise in-simulator control. Subsequently, Complex Scenario Evolution uses background vehicles to amplify the core threat introduced by Meta-Scenario. It builds an adversarial collaborator graph to identify key agent trajectories for optimization. These perturbations are designed to simultaneously reduce the ego vehicle’s maneuvering space and create critical occlusions. Extensive experiments conducted on multiple reinforcement learning based AV models show that ScenGE uncovers more severe collision cases (+31.96%) on average than SoTA baselines. Additionally, our ScenGE can be applied to large model based AV systems and deployed on different simulators; we further observe that adversarial training on our scenarios improves the model robustness. Finally, we validate our framework through real-world vehicle tests and human evaluation, confirming that the generated scenarios are both plausible and critical. We hope our paper can build up a critical step towards building public trust and ensuring their safe deployment.

[117] Virtual Community: An Open World for Humans, Robots, and Society

Qinhong Zhou, Hongxin Zhang, Xiangye Lin, Zheyuan Zhang, Yutian Chen, Wenjun Liu, Zunzhe Zhang, Sunli Chen, Lixing Fang, Qiushi Lyu, Xinyu Sun, Jincheng Yang, Zeyuan Wang, Bao Chi Dang, Zhehuan Chen, Daksha Ladia, Jiageng Liu, Chuang Gan

Main category: cs.CV

TL;DR: Virtual Community is an open-world platform for studying human-robot coexistence, featuring a physics simulator and real-world 3D scenes to explore social intelligence and multi-agent cooperation challenges.

Details

Motivation: To study how humans and robots can intelligently coexist in shared communities as AI and robotics advance, addressing both opportunities and challenges of this societal transformation.

Method: Developed an open-source multi-agent physics simulator with real-world aligned community generation, including diverse indoor/outdoor scenes and grounded agents with rich characteristics. Proposed two challenges: Community Planning Challenge for multi-agent reasoning and Community Robot Challenge for heterogeneous robot collaboration.

Result: The platform enables evaluation of various baselines, demonstrating challenges in both high-level open-world task planning and low-level cooperation controls for human-robot interaction.

Conclusion: Virtual Community provides a foundation for further research into human-robot coexistence in open-world environments, unlocking new possibilities for studying embodied social intelligence at scale.

Abstract: The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community-an open-world platform for humans, robots, and society-built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to study embodied social intelligence at scale: 1) How robots can intelligently cooperate or compete; 2) How humans develop social relations and build community; 3) More importantly, how intelligent robots and humans can co-exist in an open world. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large-scale, real-world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi-agent reasoning and planning ability in open-world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open-world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high-level open-world task planning and low-level cooperation controls. We hope that Virtual Community will unlock further study of human-robot coexistence within open-world environments.

[118] WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion

Yonghan Shin, SeungKyu Kim, Won-Ki Jeong

Main category: cs.CV

TL;DR: WISE-FUSE is an adaptive WSI encoding framework that reduces computational pathology processing time by 3x+ while maintaining diagnostic accuracy through selective processing of relevant regions using vision-language models.

Details

Motivation: Whole slide images in computational pathology require processing tens to hundreds of thousands of high-resolution patches, creating prohibitive encoding costs and making WSI encoding the major bottleneck in real-world deployment.

Method: Uses pathology-domain vision-language models and LLMs to compute similarity scores between low-res patches and class-specific text descriptions, selects informative regions, and selectively encodes high-res patches fused with textual embeddings.

Result: Reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing exhaustive patch processing methods.

Conclusion: WISE-FUSE provides a scalable and practical solution for computational pathology by dramatically reducing processing time without compromising diagnostic accuracy through adaptive selective encoding.

Abstract: Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale, often requiring the processing of tens to hundreds of thousands of high-resolution patches per slide. This results in prohibitive encoding costs, with preprocessing and training times extending to days or even weeks-making WSI encoding the most significant bottleneck in real-world deployment. In this work, we propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models to address this challenge by selectively processing diagnostically relevant regions. WISE-FUSE first computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism that preserves fine-grained diagnostic features. Based on these similarity scores, we select a small subset of informative regions for the target task, which quickly eliminates irrelevant patches at the coarse level. The corresponding high-resolution patches are then selectively encoded and fused with textual embeddings to reinforce diagnostic context. Extensive experiments demonstrate that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing, offering a scalable and practical solution for CPath.

[119] Making Pose Representations More Expressive and Disentangled via Residual Vector Quantization

Sukhyun Jeong, Hong-Gi Shin, Yong-Hoon Choi

Main category: cs.CV

TL;DR: Proposed method enhances controllable motion generation by combining discrete pose codes with continuous motion features using residual vector quantization, improving both motion quality and controllability.

Details

Motivation: Discrete pose codes alone cannot capture fine-grained motion details, limiting expressiveness in controllable motion generation systems.

Method: Augments pose code-based latent representations with continuous motion features using residual vector quantization (RVQ) to preserve interpretability while capturing subtle motion characteristics.

Result: Experiments on HumanML3D show FID reduction from 0.041 to 0.015 and Top-1 R-Precision improvement from 0.508 to 0.510, with qualitative analysis confirming enhanced controllability.

Conclusion: The proposed RVQ-based approach successfully bridges the gap between discrete pose codes and continuous motion details, achieving both high-quality motion generation and improved controllability for motion editing tasks.

Abstract: Recent progress in text-to-motion has advanced both 3D human motion generation and text-based motion control. Controllable motion generation (CoMo), which enables intuitive control, typically relies on pose code representations, but discrete pose codes alone cannot capture fine-grained motion details, limiting expressiveness. To overcome this, we propose a method that augments pose code-based latent representations with continuous motion features using residual vector quantization (RVQ). This design preserves the interpretability and manipulability of pose codes while effectively capturing subtle motion characteristics such as high-frequency details. Experiments on the HumanML3D dataset show that our model reduces Frechet inception distance (FID) from 0.041 to 0.015 and improves Top-1 R-Precision from 0.508 to 0.510. Qualitative analysis of pairwise direction similarity between pose codes further confirms the model’s controllability for motion editing.

[120] Locality-aware Concept Bottleneck Model

Sujin Jeon, Hyundo Lee, Eungseo Kim, Sanghack Lee, Byoung-Tak Zhang, Inwoo Hwang

Main category: cs.CV

TL;DR: LCBM improves concept localization in interpretable models by using prototype learning with foundation models to ensure concepts are predicted from relevant image regions.

Details

Motivation: Existing label-free concept bottleneck models often fail to localize concepts properly, attending to irrelevant regions when predicting concept presence, which reduces interpretability and reliability.

Method: Proposes Locality-aware Concept Bottleneck Model (LCBM) that assigns one prototype per concept, learns prototypical image features, and leverages foundation models to ensure prototypes encode relevant local regions for accurate concept localization.

Result: Experimental results show LCBM effectively identifies present concepts in images with improved localization accuracy while maintaining comparable classification performance to existing methods.

Conclusion: LCBM successfully addresses the localization problem in concept bottleneck models by combining prototype learning with foundation model guidance, resulting in more reliable and interpretable concept-based predictions.

Abstract: Concept bottleneck models (CBMs) are inherently interpretable models that make predictions based on human-understandable visual cues, referred to as concepts. As obtaining dense concept annotations with human labeling is demanding and costly, recent approaches utilize foundation models to determine the concepts existing in the images. However, such label-free CBMs often fail to localize concepts in relevant regions, attending to visually unrelated regions when predicting concept presence. To this end, we propose a framework, coined Locality-aware Concept Bottleneck Model (LCBM), which utilizes rich information from foundation models and adopts prototype learning to ensure accurate spatial localization of the concepts. Specifically, we assign one prototype to each concept, promoted to represent a prototypical image feature of that concept. These prototypes are learned by encouraging them to encode similar local regions, leveraging foundation models to assure the relevance of each prototype to its associated concept. Then we use the prototypes to facilitate the learning process of identifying the proper local region from which each concept should be predicted. Experimental results demonstrate that LCBM effectively identifies present concepts in the images and exhibits improved localization while maintaining comparable classification performance.

[121] GOGS: High-Fidelity Geometry and Relighting for Glossy Objects via Gaussian Surfels

Xingyuan Yang, Min Wei

Main category: cs.CV

TL;DR: GOGS is a two-stage framework using 2D Gaussian surfels for inverse rendering of glossy objects, achieving robust surface reconstruction and material decomposition with state-of-the-art performance in geometry, materials, and relighting.

Details

Motivation: Existing NeRF-based methods are computationally expensive, while 3D Gaussian Splatting struggles with specular reflections, multi-view inconsistencies, and simplified rendering equations that produce poor relighting results.

Method: Two-stage approach: 1) Physics-based rendering with split-sum approximation and geometric priors for surface reconstruction, 2) Material decomposition using Monte Carlo importance sampling, differentiable 2D Gaussian ray tracing for indirect illumination, and spherical mipmap-based directional encoding for specular details.

Result: Extensive experiments show state-of-the-art performance in geometry reconstruction, material separation, and photorealistic relighting under novel illuminations, outperforming existing inverse rendering approaches.

Conclusion: GOGS successfully addresses limitations of current methods by combining robust surface reconstruction with accurate material decomposition, enabling high-quality inverse rendering of glossy objects from RGB imagery.

Abstract: Inverse rendering of glossy objects from RGB imagery remains fundamentally limited by inherent ambiguity. Although NeRF-based methods achieve high-fidelity reconstruction via dense-ray sampling, their computational cost is prohibitive. Recent 3D Gaussian Splatting achieves high reconstruction efficiency but exhibits limitations under specular reflections. Multi-view inconsistencies introduce high-frequency surface noise and structural artifacts, while simplified rendering equations obscure material properties, leading to implausible relighting results. To address these issues, we propose GOGS, a novel two-stage framework based on 2D Gaussian surfels. First, we establish robust surface reconstruction through physics-based rendering with split-sum approximation, enhanced by geometric priors from foundation models. Second, we perform material decomposition by leveraging Monte Carlo importance sampling of the full rendering equation, modeling indirect illumination via differentiable 2D Gaussian ray tracing and refining high-frequency specular details through spherical mipmap-based directional encoding that captures anisotropic highlights. Extensive experiments demonstrate state-of-the-art performance in geometry reconstruction, material separation, and photorealistic relighting under novel illuminations, outperforming existing inverse rendering approaches.

[122] Safety-Critical Learning for Long-Tail Events: The TUM Traffic Accident Dataset

Walter Zimmer, Ross Greer, Xingcheng Zhou, Rui Song, Marc Pavel, Daniel Lehmberg, Ahmed Ghita, Akshay Gopalkrishnan, Mohan Trivedi, Alois Knoll

Main category: cs.CV

TL;DR: TUMTraf-A dataset containing 10 real-world highway accident sequences with extensive 2D/3D labeling, plus Accid3nD model combining rule-based and learning-based approaches for accident detection.

Details

Motivation: Despite safety improvements, traffic accidents remain unavoidable and sporadic, requiring better understanding through comprehensive datasets and detection methods.

Method: Collection of real-world highway accidents from roadside cameras and LiDARs, creation of TUMTraf-A dataset with OpenLABEL format, and development of Accid3nD model combining rule-based and learning-based approaches.

Result: Dataset contains 294,924 labeled 2D boxes, 93,012 labeled 3D boxes, 48,144 labeled frames from 4 cameras/LiDARs at 10Hz, covering 10 object classes. Experiments show robustness of proposed Accid3nD method.

Conclusion: The TUMTraf-A dataset and Accid3nD model provide valuable resources for traffic accident research, with the combined approach demonstrating effectiveness in accident detection on real-world highway scenarios.

Abstract: Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as an unavoidable and sporadic outcome of traffic networks. We present the TUM Traffic Accident (TUMTraf-A) dataset, a collection of real-world highway accidents. It contains ten sequences of vehicle crashes at high-speed driving with 294,924 labeled 2D and 93,012 labeled 3D boxes and track IDs within 48,144 labeled frames recorded from four roadside cameras and LiDARs at 10 Hz. The dataset contains ten object classes and is provided in the OpenLABEL format. We propose Accid3nD, an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our project website: https://tum-traffic-dataset.github.io/tumtraf-a.

[123] Controllable Latent Space Augmentation for Digital Pathology

Sofiène Boutaj, Marin Scalbert, Pierre Marza, Florent Couzinie-Devy, Maria Vakalopoulou, Stergios Christodoulidis

Main category: cs.CV

TL;DR: HistAug is a generative model for controllable latent space augmentations in digital pathology that improves multiple instance learning performance by efficiently generating realistic augmented embeddings while preserving semantic information.

Details

Motivation: Whole slide image analysis faces challenges due to gigapixel resolution and limited supervision. Traditional patch-level augmentation is computationally expensive, while existing feature-level methods lack transformation control, creating a need for efficient and semantically meaningful augmentation techniques.

Method: HistAug uses a generative model conditioned on explicit patch-level transformations (e.g., hue, erosion) to generate realistic augmented embeddings in latent space. It processes large numbers of patches efficiently in a single forward pass while preserving semantic information.

Result: Experiments across multiple slide-level tasks and diverse organs show HistAug outperforms existing methods, particularly in low-data regimes. It consistently improves MIL model performance and handles large patch volumes efficiently.

Conclusion: HistAug provides an effective solution for WSI analysis by enabling fast, controllable augmentations that enhance model robustness. The method demonstrates the superiority of learned transformations over noise-based approaches and highlights the importance of uniform WSI-wise augmentation.

Abstract: Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation. Code is available at https://github.com/MICS-Lab/HistAug.

[124] Reliable Smoke Detection via Optical Flow-Guided Feature Fusion and Transformer-Based Uncertainty Modeling

Nitish Kumar Mahala, Muzammil Khan, Pushpendra Kumar

Main category: cs.CV

TL;DR: A novel smoke detection framework using optical flow-based motion encoding and a Two-Phase Uncertainty-Aware Shifted Windows Transformer for robust early fire detection from monocular imagery.

Details

Motivation: Traditional smoke detectors struggle with complex spatiotemporal dynamics, illumination variability, and environmental noise. There's a need for high-fidelity early-warning systems without complex multi-sensor arrays.

Method: Proposes a two-phase approach: 1) Optical flow estimation using four-color-theorem-inspired dual-phase level-set fractional-order variational model to preserve motion discontinuities, 2) Fusion of color-encoded optical flow maps with appearance cues via Gaussian Mixture Model, 3) Shifted-Windows Transformer with multi-scale uncertainty estimation head trained under two-phase learning regimen.

Result: Extensive experiments demonstrate superior generalization and robustness compared to state-of-the-art approaches, offering reliable smoke detection across various evaluation metrics.

Conclusion: The framework provides a reliable solution for early fire detection in surveillance, industrial safety, and autonomous monitoring applications by effectively handling complex smoke dynamics through information fusion and uncertainty-aware learning.

Abstract: Fire outbreaks pose critical threats to human life and infrastructure, necessitating high-fidelity early-warning systems that detect combustion precursors such as smoke. However, smoke plumes exhibit complex spatiotemporal dynamics influenced by illumination variability, flow kinematics, and environmental noise, undermining the reliability of traditional detectors. To address these challenges without the logistical complexity of multi-sensor arrays, we propose an information-fusion framework by integrating smoke feature representations extracted from monocular imagery. Specifically, a Two-Phase Uncertainty-Aware Shifted Windows Transformer for robust and reliable smoke detection, leveraging a novel smoke segmentation dataset, constructed via optical flow-based motion encoding, is proposed. The optical flow estimation is performed with a four-color-theorem-inspired dual-phase level-set fractional-order variational model, which preserves motion discontinuities. The resulting color-encoded optical flow maps are fused with appearance cues via a Gaussian Mixture Model to generate binary segmentation masks of the smoke regions. These fused representations are fed into the novel Shifted-Windows Transformer, which is augmented with a multi-scale uncertainty estimation head and trained under a two-phase learning regimen. First learning phase optimizes smoke detection accuracy, while during the second phase, the model learns to estimate plausibility confidence in its predictions by jointly modeling aleatoric and epistemic uncertainties. Extensive experiments using multiple evaluation metrics and comparative analysis with state-of-the-art approaches demonstrate superior generalization and robustness, offering a reliable solution for early fire detection in surveillance, industrial safety, and autonomous monitoring applications.

[125] Incremental Object Detection with Prompt-based Methods

Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool

Main category: cs.CV

TL;DR: Prompt-based methods underperform in incremental object detection compared to image classification, but combining visual prompts with limited data replay achieves best results.

Details

Motivation: To evaluate the generalizability of visual prompt-based methods from incremental image classification to incremental object detection (IOD) under complex domain-incremental learning settings.

Method: Analyzed three different prompt-based methods under domain-incremental learning for object detection, compared with various baselines, and tested combinations of visual prompts with small data replay.

Result: Prompt-based approaches alone underperformed in IOD setting, but visual prompts combined with replaying a small portion of previous data achieved the best performance.

Conclusion: While prompt-based methods show limitations in IOD, their combination with limited data replay offers a strong practical solution, providing valuable insights for advancing prompt-based incremental learning in object detection.

Abstract: Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.

[126] UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

Peiming Li, Ziyi Wang, Yulin Yuan, Hong Liu, Xiangming Meng, Junsong Yuan, Mengyuan Liu

Main category: cs.CV

TL;DR: UST-SSM extends Selective State Space Models to point cloud videos by addressing spatio-temporal disorder through semantic-aware sequence reorganization and feature aggregation.

Details

Motivation: Point cloud videos are effective for human action recognition but suffer from spatio-temporal disorder that hinders unidirectional modeling when directly processed as 1D sequences.

Method: Proposes Unified Spatio-Temporal State Space Model with Spatial-Temporal Selection Scanning (STSS) for semantic-aware sequence reorganization, Spatio-Temporal Structure Aggregation (STSA) for feature compensation, and Temporal Interaction Sampling (TIS) for enhanced temporal dependencies.

Result: Experimental validation on MSR-Action3D, NTU RGB+D, and Synthia 4D datasets demonstrates the effectiveness of the proposed method.

Conclusion: UST-SSM successfully addresses the challenges of spatio-temporal disorder in point cloud videos and enables effective utilization of spatially and temporally distant yet similar points for improved action recognition.

Abstract: Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.

[127] SMTrack: End-to-End Trained Spiking Neural Networks for Multi-Object Tracking in RGB Videos

Pengzhi Zhong, Xinzhe Wang, Dan Zeng, Qihua Zhou, Feixiang He, Shuiwang Li

Main category: cs.CV

TL;DR: SMTrack is the first directly trained deep Spiking Neural Network framework for end-to-end multi-object tracking on standard RGB videos, achieving performance comparable to ANN-based methods.

Details

Motivation: SNNs show potential for low-power computation but their application in visual tasks is limited to basic tasks like classification and detection. The potential of directly-trained SNNs for complex temporal tasks like multi-object tracking on standard RGB videos remains underexplored.

Method: Proposes SMTrack with adaptive scale-aware Normalized Wasserstein Distance loss (Asa-NWDLoss) that dynamically adjusts normalization based on average object size per batch, and incorporates TrackTrack identity module for robust association.

Result: Extensive evaluations on BEE24, MOT17, MOT20, and DanceTrack show SMTrack achieves performance on par with leading ANN-based MOT methods.

Conclusion: SMTrack advances robust and accurate SNN-based tracking in complex scenarios, demonstrating the viability of directly-trained SNNs for complex temporal vision tasks.

Abstract: Brain-inspired Spiking Neural Networks (SNNs) exhibit significant potential for low-power computation, yet their application in visual tasks remains largely confined to image classification, object detection, and event-based tracking. In contrast, real-world vision systems still widely use conventional RGB video streams, where the potential of directly-trained SNNs for complex temporal tasks such as multi-object tracking (MOT) remains underexplored. To address this challenge, we propose SMTrack-the first directly trained deep SNN framework for end-to-end multi-object tracking on standard RGB videos. SMTrack introduces an adaptive and scale-aware Normalized Wasserstein Distance loss (Asa-NWDLoss) to improve detection and localization performance under varying object scales and densities. Specifically, the method computes the average object size within each training batch and dynamically adjusts the normalization factor, thereby enhancing sensitivity to small objects. For the association stage, we incorporate the TrackTrack identity module to maintain robust and consistent object trajectories. Extensive evaluations on BEE24, MOT17, MOT20, and DanceTrack show that SMTrack achieves performance on par with leading ANN-based MOT methods, advancing robust and accurate SNN-based tracking in complex scenarios.

[128] AnchorSync: Global Consistency Optimization for Long Video Editing

Zichi Liu, Yinggui Wang, Tao Wei, Chao Ma

Main category: cs.CV

TL;DR: AnchorSync is a diffusion-based framework for long video editing that uses sparse anchor frame editing and smooth interpolation to maintain global consistency and temporal coherence across thousands of frames.

Details

Motivation: Existing video editing methods struggle with structural drift and temporal artifacts in long videos, particularly minute-long sequences with thousands of frames that require both global consistency and temporal coherence.

Method: Decouples long video editing into sparse anchor frame editing and smooth intermediate frame interpolation, enforcing structural consistency through progressive denoising and preserving temporal dynamics via multimodal guidance.

Result: Extensive experiments show AnchorSync produces coherent, high-fidelity edits that surpass prior methods in both visual quality and temporal stability.

Conclusion: AnchorSync successfully addresses the challenges of long video editing by maintaining structural consistency and temporal coherence through its novel diffusion-based framework with anchor frame editing and interpolation approach.

Abstract: Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability.

[129] Towards PerSense++: Advancing Training-Free Personalized Instance Segmentation in Dense Images

Muhammad Ibraheem Siddiqui, Muhammad Umer Sheikh, Hassan Abid, Kevin Henry, Muhammad Haris Khan

Main category: cs.CV

TL;DR: PerSense++ is a training-free one-shot framework for personalized instance segmentation in dense images, using density maps and adaptive filtering to handle occlusions and clutter.

Details

Motivation: Address challenges in dense visual scene segmentation including occlusions, background clutter, and scale variations that make instance segmentation difficult.

Method: Uses Instance Detection Module with density maps for candidate point prompts, Point Prompt Selection Module with adaptive thresholding, and feedback mechanism. Enhanced version adds diversity-aware exemplar selection, hybrid IDM combining contour/peak-based prompts, and Irrelevant Mask Rejection Module.

Result: Extensive experiments show PerSense++ outperforms existing methods in dense settings across multiple benchmarks.

Conclusion: The framework effectively handles dense segmentation challenges and introduces a dedicated benchmark (PerSense-D) for this underexplored task.

Abstract: Segmentation in dense visual scenes poses significant challenges due to occlusions, background clutter, and scale variations. To address this, we introduce PerSense, an end-to-end, training-free, and model-agnostic one-shot framework for Personalized instance Segmentation in dense images. PerSense employs a novel Instance Detection Module (IDM) that leverages density maps (DMs) to generate instance-level candidate point prompts, followed by a Point Prompt Selection Module (PPSM) that filters false positives via adaptive thresholding and spatial gating. A feedback mechanism further enhances segmentation by automatically selecting effective exemplars to improve DM quality. We additionally present PerSense++, an enhanced variant that incorporates three additional components to improve robustness in cluttered scenes: (i) a diversity-aware exemplar selection strategy that leverages feature and scale diversity for better DM generation; (ii) a hybrid IDM combining contour and peak-based prompt generation for improved instance separation within complex density patterns; and (iii) an Irrelevant Mask Rejection Module (IMRM) that discards spatially inconsistent masks using outlier analysis. Finally, to support this underexplored task, we introduce PerSense-D, a dedicated benchmark for personalized segmentation in dense images. Extensive experiments across multiple benchmarks demonstrate that PerSense++ outperforms existing methods in dense settings.

[130] GeMS: Efficient Gaussian Splatting for Extreme Motion Blur

Gopi Raju Matta, Trisha Reddypalli, Vemunuri Divya Madhuri, Kaushik Mitra

Main category: cs.CV

TL;DR: GeMS is a 3D Gaussian Splatting framework that handles severely motion-blurred images without requiring sharp reference images, using deep learning-based pose estimation, probabilistic Gaussian initialization, and optional event-based refinement.

Details

Motivation: Existing deblurring methods assume access to sharp images for camera pose estimation, which is unrealistic. Methods relying on COLMAP fail under severe blur due to unreliable feature correspondences.

Method: Integrates VGGSfM for pose estimation from blurred inputs, 3DGS-MCMC for probabilistic Gaussian initialization, joint optimization of camera trajectories and Gaussian parameters, and GeMS-E adds event-based EDI deblurring for progressive refinement.

Result: Achieves state-of-the-art performance on synthetic and real-world datasets, successfully reconstructing scenes directly from extremely blurred images.

Conclusion: First framework to address extreme motion blur within 3DGS directly from severely blurred inputs, eliminating the need for sharp reference images that existing methods rely on.

Abstract: We introduce GeMS, a framework for 3D Gaussian Splatting (3DGS) designed to handle severely motion-blurred images. State-of-the-art deblurring methods for extreme blur, such as ExBluRF, as well as Gaussian Splatting-based approaches like Deblur-GS, typically assume access to sharp images for camera pose estimation and point cloud generation, an unrealistic assumption. Methods relying on COLMAP initialization, such as BAD-Gaussians, also fail due to unreliable feature correspondences under severe blur. To address these challenges, we propose GeMS, a 3DGS framework that reconstructs scenes directly from extremely blurred images. GeMS integrates: (1) VGGSfM, a deep learning-based Structure-from-Motion pipeline that estimates poses and generates point clouds directly from blurred inputs; (2) 3DGS-MCMC, which enables robust scene initialization by treating Gaussians as samples from a probability distribution, eliminating heuristic densification and pruning; and (3) joint optimization of camera trajectories and Gaussian parameters for stable reconstruction. While this pipeline produces strong results, inaccuracies may remain when all inputs are severely blurred. To mitigate this, we propose GeMS-E, which integrates a progressive refinement step using events: (4) Event-based Double Integral (EDI) deblurring restores sharper images that are then fed into GeMS, improving pose estimation, point cloud generation, and overall reconstruction. Both GeMS and GeMS-E achieve state-of-the-art performance on synthetic and real-world datasets. To our knowledge, this is the first framework to address extreme motion blur within 3DGS directly from severely blurred inputs.

[131] Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

Jiabo Huang, Chen Chen, Lingjuan Lyu

Main category: cs.CV

TL;DR: A model-driven approach for vision foundation models that unifies multiple pre-trained teacher models through joint knowledge transfer and preservation, eliminating the need for large-scale labeled data training.

Details

Motivation: Current vision foundation models rely on data-centric methods requiring vast labeled data and high-end GPUs, which are inaccessible to most institutions. Many open-source domain-specific models contain valuable knowledge but remain underutilized for general-purpose VFM development.

Method: Unifies multiple pre-trained teacher models in a shared latent space to address distributional gaps, and uses a knowledge preservation strategy with a general-purpose teacher as a knowledge base to integrate purpose-specific teachers via an adapter module.

Result: The developed VFM outperforms existing data-centric models across four fundamental vision tasks: image classification, object detection, semantic segmentation, and instance segmentation.

Conclusion: This model-driven approach successfully builds powerful vision foundation models by unifying and aggregating existing models, inheriting teachers’ expertise without requiring large-scale labeled data training, while providing generalizable features and supporting multiple downstream tasks.

Abstract: Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we presents a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer’’ issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers’ expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.

[132] GSFix3D: Diffusion-Guided Repair of Novel Views in Gaussian Splatting

Jiaxin Wei, Stefan Leutenegger, Simon Schaefer

Main category: cs.CV

TL;DR: GSFix3D improves 3D Gaussian Splatting by integrating diffusion model knowledge to enhance rendering quality in under-constrained regions and extreme viewpoints.

Details

Motivation: Address limitations of 3D Gaussian Splatting in extreme novel viewpoints and partially observed regions, and overcome diffusion models' lack of scene awareness for accurate 3D reconstruction.

Method: Introduces GSFixer, a latent diffusion model fine-tuned to leverage both mesh and 3D Gaussians, with random mask augmentation for plausible inpainting of missing regions.

Result: Achieves state-of-the-art performance on challenging benchmarks with minimal scene-specific fine-tuning, demonstrating resilience to pose errors in real-world tests.

Conclusion: GSFix3D successfully bridges generative capabilities of diffusion models with 3D scene consistency, enabling robust novel view repair for unseen camera poses.

Abstract: Recent developments in 3D Gaussian Splatting have significantly enhanced novel view synthesis, yet generating high-quality renderings from extreme novel viewpoints or partially observed regions remains challenging. Meanwhile, diffusion models exhibit strong generative capabilities, but their reliance on text prompts and lack of awareness of specific scene information hinder accurate 3D reconstruction tasks. To address these limitations, we introduce GSFix3D, a novel framework that improves the visual fidelity in under-constrained regions by distilling prior knowledge from diffusion models into 3D representations, while preserving consistency with observed scene details. At its core is GSFixer, a latent diffusion model obtained via our customized fine-tuning protocol that can leverage both mesh and 3D Gaussians to adapt pretrained generative models to a variety of environments and artifact types from different reconstruction methods, enabling robust novel view repair for unseen camera poses. Moreover, we propose a random mask augmentation strategy that empowers GSFixer to plausibly inpaint missing regions. Experiments on challenging benchmarks demonstrate that our GSFix3D and GSFixer achieve state-of-the-art performance, requiring only minimal scene-specific fine-tuning on captured data. Real-world test further confirms its resilience to potential pose errors. Our code and data will be made publicly available. Project page: https://gsfix3d.github.io.

[133] Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving

Leila Cheshmi, Mennatullah Siam

Main category: cs.CV

TL;DR: Efficient video transformer for class-agnostic segmentation using motion cues without optical flow, outperforming baselines while being memory and runtime efficient.

Details

Motivation: Addressing safety in autonomous driving by detecting unknown objects and handling unforeseen scenarios, overcoming limitations of class-dependent segmentation and expensive visual grounding methods.

Method: Multiscale video transformer with multi-stage query-memory decoding, scale-specific random drop-token, and shared learnable memory module to preserve high-resolution spatiotemporal features without optical flow.

Result: Consistently outperforms multiscale baselines on DAVIS'16, KITTI, and Cityscapes datasets while being efficient in GPU memory and run-time.

Conclusion: Demonstrates a promising direction for real-time, robust dense prediction in safety-critical robotics applications like autonomous driving.

Abstract: Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS'16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.

[134] Improved Mapping Between Illuminations and Sensors for RAW Images

Abhijith Punnappurath, Luxi Zhao, Hoang Le, Abdelrahman Abdelhamed, SaiKiran Kumar Tedla, Michael S. Brown

Main category: cs.CV

TL;DR: A novel dataset and neural network approach for RAW image illumination and sensor mapping to reduce data capture burden for deep learning methods.

Details

Motivation: RAW images have sensor- and illumination-specific properties that make dataset collection challenging, requiring scenes to be captured for each sensor under various illumination conditions.

Method: Created a first-of-its-kind dataset using a customized lightbox with tunable illumination spectra to capture scenes under 390 illuminations with 4 cameras across 18 scenes. Developed a lightweight neural network for illumination and sensor mapping.

Result: The proposed neural network approach outperforms competing methods and demonstrates utility in training neural ISP (Image Signal Processor) systems.

Conclusion: The introduced dataset and mapping method provide an effective solution for reducing the data capture burden in RAW image processing, enabling better performance in downstream tasks like neural ISP training.

Abstract: RAW images are unprocessed camera sensor output with sensor-specific RGB values based on the sensor’s color filter spectral sensitivities. RAW images also incur strong color casts due to the sensor’s response to the spectral properties of scene illumination. The sensor- and illumination-specific nature of RAW images makes it challenging to capture RAW datasets for deep learning methods, as scenes need to be captured for each sensor and under a wide range of illumination. Methods for illumination augmentation for a given sensor and the ability to map RAW images between sensors are important for reducing the burden of data capture. To explore this problem, we introduce the first-of-its-kind dataset comprising carefully captured scenes under a wide range of illumination. Specifically, we use a customized lightbox with tunable illumination spectra to capture several scenes with different cameras. Our illumination and sensor mapping dataset has 390 illuminations, four cameras, and 18 scenes. Using this dataset, we introduce a lightweight neural network approach for illumination and sensor mapping that outperforms competing methods. We demonstrate the utility of our approach on the downstream task of training a neural ISP. Link to project page: https://github.com/SamsungLabs/illum-sensor-mapping.

[135] Fusing Monocular RGB Images with AIS Data to Create a 6D Pose Estimation Dataset for Marine Vessels

Fabian Holst, Emre Gülsoylu, Simone Frintrop

Main category: cs.CV

TL;DR: Novel technique fuses monocular RGB images with AIS data to create 6D pose estimation datasets for marine vessels, eliminating need for manual annotation and achieving high accuracy with PnP method and YOLOX-X detection.

Details

Motivation: Address limitations of relying purely on AIS data for vessel location due to equipment reliability issues, data manipulation, and transmission delays by combining visual detection with AIS information.

Method: Fuses vessel detections from monocular RGB images (using YOLOX-X object detection) with AIS messages to generate 3D bounding boxes representing 6D poses. Compares homography and Perspective-n-Point transformation methods for alignment.

Result: Perspective-n-Point method achieves significantly lower projection error than homography. YOLOX-X achieves 0.80 mAP at IoU 0.5. Created BONK-pose dataset with 3753 images and 3D bounding boxes, plus 1000 images with 2D annotations.

Conclusion: The approach successfully creates 6D pose estimation datasets without manual annotation, providing valuable public dataset (BONK-pose) for training and evaluating pose estimation networks in marine environments.

Abstract: The paper presents a novel technique for creating a 6D pose estimation dataset for marine vessels by fusing monocular RGB images with Automatic Identification System (AIS) data. The proposed technique addresses the limitations of relying purely on AIS for location information, caused by issues like equipment reliability, data manipulation, and transmission delays. By combining vessel detections from monocular RGB images, obtained using an object detection network (YOLOX-X), with AIS messages, the technique generates 3D bounding boxes that represent the vessels’ 6D poses, i.e. spatial and rotational dimensions. The paper evaluates different object detection models to locate vessels in image space. We also compare two transformation methods (homography and Perspective-n-Point) for aligning AIS data with image coordinates. The results of our work demonstrate that the Perspective-n-Point (PnP) method achieves a significantly lower projection error compared to homography-based approaches used before, and the YOLOX-X model achieves a mean Average Precision (mAP) of 0.80 at an Intersection over Union (IoU) threshold of 0.5 for relevant vessel classes. We show indication that our approach allows the creation of a 6D pose estimation dataset without needing manual annotation. Additionally, we introduce the Boats on Nordelbe Kehrwieder (BONK-pose), a publicly available dataset comprising 3753 images with 3D bounding box annotations for pose estimation, created by our data fusion approach. This dataset can be used for training and evaluating 6D pose estimation networks. In addition we introduce a set of 1000 images with 2D bounding box annotations for ship detection from the same scene.

[136] 6-DoF Object Tracking with Event-based Optical Flow and Frames

Zhichao Li, Arren Glover, Chiara Bartolozzi, Lorenzo Natale

Main category: cs.CV

TL;DR: Combines event camera optical flow with RGB-based pose estimation for high-speed 6-DoF object tracking, leveraging advantages of both sensor types to overcome motion blur limitations.

Details

Motivation: High-speed object tracking is challenging due to frame rate limitations and motion blur in conventional cameras. Event cameras offer high temporal resolution but lack rich visual information, while RGB cameras provide better single-shot pose estimation but suffer from motion issues at high speeds.

Method: Proposes event-based optical flow algorithm for object motion measurement to create a 6-DoF velocity tracker. Integrates this with low-frequency pose estimates from an RGB-based global pose estimator to enable high-speed tracking.

Result: The method was tested and validated on both synthetic and real-world data, demonstrating effectiveness particularly in high-speed motion scenarios where traditional approaches struggle.

Conclusion: The hybrid approach successfully exploits complementary advantages of event and RGB cameras, enabling robust 6-DoF pose tracking for high-speed moving objects by combining high-temporal-resolution motion data with rich visual pose estimation.

Abstract: Tracking the position and orientation of objects in space (i.e., in 6-DoF) in real time is a fundamental problem in robotics for environment interaction. It becomes more challenging when objects move at high-speed due to frame rate limitations in conventional cameras and motion blur. Event cameras are characterized by high temporal resolution, low latency and high dynamic range, that can potentially overcome the impacts of motion blur. Traditional RGB cameras provide rich visual information that is more suitable for the challenging task of single-shot object pose estimation. In this work, we propose using event-based optical flow combined with an RGB based global object pose estimator for 6-DoF pose tracking of objects at high-speed, exploiting the core advantages of both types of vision sensors. Specifically, we propose an event-based optical flow algorithm for object motion measurement to implement an object 6-DoF velocity tracker. By integrating the tracked object 6-DoF velocity with low frequency estimated pose from the global pose estimator, the method can track pose when objects move at high-speed. The proposed algorithm is tested and validated on both synthetic and real world data, demonstrating its effectiveness, especially in high-speed motion scenarios.

[137] MF-LPR$^2$: Multi-Frame License Plate Image Restoration and Recognition using Optical Flow

Kihyun Na, Junseok Oh, Youngkwan Cho, Bumjin Kim, Sungmin Cho, Jinyoung Choi, Injung Kim

Main category: cs.CV

TL;DR: MF-LPR^2 is a multi-frame license plate restoration framework that uses optical flow alignment and spatio-temporal consistency to enhance poor-quality dash cam images, achieving superior restoration and recognition results compared to existing methods.

Details

Motivation: License plate recognition in dash cam images is challenging due to low resolution, motion blur, and glare. Existing generative models often introduce artifacts when restoring such poor-quality images, making accurate recognition difficult.

Method: Proposed MF-LPR^2 framework that aligns and aggregates neighboring frames using optical flow estimation with error detection and correction algorithms. Created RLPR dataset with 200 pairs of low-quality sequences and high-quality ground-truth images for evaluation.

Result: Outperformed 8 restoration models in PSNR, SSIM, and LPIPS metrics. Achieved 86.44% recognition accuracy, significantly better than single-frame (14.04%) and multi-frame (82.55%) baselines. Ablation studies confirmed the effectiveness of filtering and refinement algorithms.

Conclusion: The multi-frame approach with optical flow alignment and error correction effectively addresses license plate restoration challenges in poor-quality dash cam images, improving both image quality and recognition accuracy while preserving evidential content.

Abstract: License plate recognition (LPR) is important for traffic law enforcement, crime investigation, and surveillance. However, license plate areas in dash cam images often suffer from low resolution, motion blur, and glare, which make accurate recognition challenging. Existing generative models that rely on pretrained priors cannot reliably restore such poor-quality images, frequently introducing severe artifacts and distortions. To address this issue, we propose a novel multi-frame license plate restoration and recognition framework, MF-LPR$^2$, which addresses ambiguities in poor-quality images by aligning and aggregating neighboring frames instead of relying on pretrained knowledge. To achieve accurate frame alignment, we employ a state-of-the-art optical flow estimator in conjunction with carefully designed algorithms that detect and correct erroneous optical flow estimations by leveraging the spatio-temporal consistency inherent in license plate image sequences. Our approach enhances both image quality and recognition accuracy while preserving the evidential content of the input images. In addition, we constructed a novel Realistic LPR (RLPR) dataset to evaluate MF-LPR$^2$. The RLPR dataset contains 200 pairs of low-quality license plate image sequences and high-quality pseudo ground-truth images, reflecting the complexities of real-world scenarios. In experiments, MF-LPR$^2$ outperformed eight recent restoration models in terms of PSNR, SSIM, and LPIPS by significant margins. In recognition, MF-LPR$^2$ achieved an accuracy of 86.44%, outperforming both the best single-frame LPR (14.04%) and the multi-frame LPR (82.55%) among the eleven baseline models. The results of ablation studies confirm that our filtering and refinement algorithms significantly contribute to these improvements.

[138] DINOv3 with Test-Time Training for Medical Image Registration

Shansong Wang, Mojtaba Safari, Mingzhe Hu, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Xiaofeng Yang

Main category: cs.CV

TL;DR: Training-free medical image registration using frozen DINOv3 encoder and test-time optimization in feature space, achieving state-of-the-art results without requiring large training datasets.

Details

Motivation: Overcome the limitation of learning-based medical image registration methods that require large amounts of training data, which hinders clinical adoption.

Method: Propose a training-free pipeline that uses a frozen DINOv3 encoder and performs test-time optimization of the deformation field directly in feature space.

Result: Achieved best mean Dice score (0.790) and lowest Hausdorff Distance (4.9±5.0) on Abdomen MR-CT, and improved mean DSC to 0.769 with reduced SDLogJ to 0.11 and HD95 to 4.8 on ACDC cardiac MRI.

Conclusion: Operating in a compact foundation feature space at test time provides a practical and general solution for clinical registration without additional training requirements.

Abstract: Prior medical image registration approaches, particularly learning-based methods, often require large amounts of training data, which constrains clinical adoption. To overcome this limitation, we propose a training-free pipeline that relies on a frozen DINOv3 encoder and test-time optimization of the deformation field in feature space. Across two representative benchmarks, the method is accurate and yields regular deformations. On Abdomen MR-CT, it attained the best mean Dice score (DSC) of 0.790 together with the lowest 95th percentile Hausdorff Distance (HD95) of 4.9+-5.0 and the lowest standard deviation of Log-Jacobian (SDLogJ) of 0.08+-0.02. On ACDC cardiac MRI, it improves mean DSC to 0.769 and reduces SDLogJ to 0.11 and HD95 to 4.8, a marked gain over the initial alignment. The results indicate that operating in a compact foundation feature space at test time offers a practical and general solution for clinical registration without additional training.

[139] Tinker: Diffusion’s Gift to 3D–Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: Tinker is a versatile 3D editing framework that enables high-fidelity, multi-view consistent edits from just 1-2 images without per-scene finetuning, using pretrained diffusion models and novel components for reference-driven editing and view synthesis.

Details

Motivation: To overcome the limitations of prior 3D editing techniques that require extensive per-scene optimization and multiple consistent input views, making 3D content creation more accessible and scalable.

Method: Repurposes pretrained diffusion models for latent 3D awareness, introduces a referring multi-view editor for precise reference-driven edits, and an any-view-to-video synthesizer leveraging spatial-temporal priors from video diffusion for scene completion and novel-view generation.

Result: Achieves state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks, significantly reducing the barrier to generalizable 3D content creation with robust multi-view consistent edits from sparse inputs.

Conclusion: Tinker represents a key step towards truly scalable, zero-shot 3D editing by eliminating per-scene training requirements and enabling high-quality edits from minimal input.

Abstract: We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker

[140] Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: A novel framework for video-language retrieval that uses fine-grained feature learning and an inference pipeline to improve accuracy without additional training, achieving state-of-the-art results on multiple benchmarks.

Details

Motivation: Address the challenges of high computational costs in large-scale pre-training for video retrieval and the underexplored fine-grained information in videos and texts.

Method: Uses coarse-to-fine objectives with contrastive and matching learning, Granularity-Aware Representation module for fine-grained data, and an inference pipeline with voting mechanism and Matching Entropy metric leveraging keyword repetition.

Result: Outperforms previous approaches on four benchmarks, with inference pipeline achieving 2.1% increase in Recall@1 on MSR-VTT and 1.6% increase on DiDeMo dataset.

Conclusion: The proposed framework effectively improves video-language retrieval performance through fine-grained feature learning and innovative inference techniques without requiring additional pre-training.

Abstract: The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as “Repetition”, can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset.

[141] TransLight: Image-Guided Customized Lighting Control with Generative Decoupling

Zongming Li, Lianghui Zhu, Haocheng Shen, Longjin Ran, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: TransLight is a novel framework that enables high-fidelity transfer of complex light effects between images using generative decoupling and diffusion models.

Details

Motivation: Existing illumination-editing approaches fail to provide customized light control while preserving content integrity, especially for transferring complex light effects from reference to target images.

Method: Uses Generative Decoupling with two fine-tuned diffusion models to separate image content and light effects, creating a million-scale dataset of image-content-light triplets. Employs IC-Light as generative model with reference lighting as conditioning signal.

Result: TransLight successfully transfers light effects across disparate images with high fidelity and flexibility, delivering more customized illumination control than existing techniques.

Conclusion: The method establishes a new approach for illumination harmonization and editing by thoroughly disentangling light effects, enabling natural transfer of diverse light effects.

Abstract: Most existing illumination-editing approaches fail to simultaneously provide customized control of light effects and preserve content integrity. This makes them less effective for practical lighting stylization requirements, especially in the challenging task of transferring complex light effects from a reference image to a user-specified target image. To address this problem, we propose TransLight, a novel framework that enables high-fidelity and high-freedom transfer of light effects. Extracting the light effect from the reference image is the most critical and challenging step in our method. The difficulty lies in the complex geometric structure features embedded in light effects that are highly coupled with content in real-world scenarios. To achieve this, we first present Generative Decoupling, where two fine-tuned diffusion models are used to accurately separate image content and light effects, generating a newly curated, million-scale dataset of image-content-light triplets. Then, we employ IC-Light as the generative model and train our model with our triplets, injecting the reference lighting image as an additional conditioning signal. The resulting TransLight model enables customized and natural transfer of diverse light effects. Notably, by thoroughly disentangling light effects from reference images, our generative decoupling strategy endows TransLight with highly flexible illumination control. Experimental results establish TransLight as the first method to successfully transfer light effects across disparate images, delivering more customized illumination control than existing techniques and charting new directions for research in illumination harmonization and editing.

[142] EventSSEG: Event-driven Self-Supervised Segmentation with Probabilistic Attention

Lakshmi Annamalai, Chetan Singh Thakur

Main category: cs.CV

TL;DR: EventSSEG is a novel event-based road segmentation method that uses event-only computing with probabilistic attention and self-supervised learning to overcome the lack of labeled event data, achieving state-of-the-art performance with minimal labeled events.

Details

Motivation: Road segmentation is crucial for autonomous vehicles but challenging with frame-based cameras due to latency and compute requirements. Event cameras offer low-power sensing but face challenges in transferring pretrained weights and lack abundant labeled data.

Method: EventSSEG uses event-only computing with a probabilistic attention mechanism and employs event-based self-supervised learning to eliminate the need for extensive labeled data.

Result: Experiments on DSEC-Semantic and DDD17 datasets show EventSSEG achieves state-of-the-art performance with minimal labeled events.

Conclusion: This approach successfully maximizes event cameras’ capabilities while addressing the critical challenge of limited labeled event data for road segmentation tasks.

Abstract: Road segmentation is pivotal for autonomous vehicles, yet achieving low latency and low compute solutions using frame based cameras remains a challenge. Event cameras offer a promising alternative. To leverage their low power sensing, we introduce EventSSEG, a method for road segmentation that uses event only computing and a probabilistic attention mechanism. Event only computing poses a challenge in transferring pretrained weights from the conventional camera domain, requiring abundant labeled data, which is scarce. To overcome this, EventSSEG employs event-based self supervised learning, eliminating the need for extensive labeled data. Experiments on DSEC-Semantic and DDD17 show that EventSSEG achieves state of the art performance with minimal labeled events. This approach maximizes event cameras capabilities and addresses the lack of labeled events.

[143] Lifespan Pancreas Morphology for Control vs Type 2 Diabetes using AI on Largescale Clinical Imaging

Lucas W. Remedios, Chloe Cho, Trent M. Schwartz, Dingjie Su, Gaurav Rudravaram, Chenyu Gao, Aravind R. Krishnan, Adam M. Saunders, Michael E. Kim, Shunxing Bao, Thomas A. Lasko, Alvin C. Powers, Bennett A. Landman, John Virostko

Main category: cs.CV

TL;DR: Study analyzes pancreas size/shape changes across lifespan (0-90 years) using AI on CT/MRI scans, finds significant morphological differences in type 2 diabetes patients compared to controls, and establishes normative aging trends.

Details

Motivation: Understanding pancreatic changes is crucial for detecting deviations in type 2 diabetes and other pancreatic diseases. Need to establish normative morphological aging trends and reliable clinical imaging methods for AI-based pancreas measurement.

Method: Analyzed 2533 patients’ abdominal CT/MRI scans, resampled to 3mm resolution, used automated segmentation to extract 13 morphological pancreas features. Compared CT vs MRI measurements, characterized normative patterns by age/sex, used GAMLSS regression on 1350 age/sex-matched patients (675 controls, 675 diabetes) to identify diabetes-related deviations.

Result: 10 of 13 morphological features showed significantly different aging trends between diabetic and non-diabetic patients after adjusting for confounders (p<0.05). MRI yielded different measurements than CT. Pancreas was smaller in type 2 diabetes.

Conclusion: Provides lifespan trends showing altered pancreas size/shape in type 2 diabetes, reinforces that pancreas is smaller in diabetes, and contributes a large reference dataset of normative pancreas morphology from non-diabetic controls in clinical setting.

Abstract: Purpose: Understanding how the pancreas changes is critical for detecting deviations in type 2 diabetes and other pancreatic disease. We measure pancreas size and shape using morphological measurements from ages 0 to 90. Our goals are to 1) identify reliable clinical imaging modalities for AI-based pancreas measurement, 2) establish normative morphological aging trends, and 3) detect potential deviations in type 2 diabetes. Approach: We analyzed a clinically acquired dataset of 2533 patients imaged with abdominal CT or MRI. We resampled the scans to 3mm isotropic resolution, segmented the pancreas using automated methods, and extracted 13 morphological pancreas features across the lifespan. First, we assessed CT and MRI measurements to determine which modalities provide consistent lifespan trends. Second, we characterized distributions of normative morphological patterns stratified by age group and sex. Third, we used GAMLSS regression to model pancreas morphology trends in 1350 patients matched for age, sex, and type 2 diabetes status to identify any deviations from normative aging associated with type 2 diabetes. Results: When adjusting for confounders, the aging trends for 10 of 13 morphological features were significantly different between patients with type 2 diabetes and non-diabetic controls (p < 0.05 after multiple comparisons corrections). Additionally, MRI appeared to yield different pancreas measurements than CT using our AI-based method. Conclusions: We provide lifespan trends demonstrating that the size and shape of the pancreas is altered in type 2 diabetes using 675 control patients and 675 diabetes patients. Moreover, our findings reinforce that the pancreas is smaller in type 2 diabetes. Additionally, we contribute a reference of lifespan pancreas morphology from a large cohort of non-diabetic control patients in a clinical setting.

[144] MS-CLR: Multi-Skeleton Contrastive Learning for Human Action Recognition

Mert Kiray, Alvaro Ritter, Nassir Navab, Benjamin Busam

Main category: cs.CV

TL;DR: MS-CLR is a self-supervised contrastive learning framework that aligns pose representations across multiple skeleton conventions from the same sequence to learn more generalizable features for skeleton-based action recognition.

Details

Motivation: Existing contrastive learning methods rely on single skeleton conventions, limiting generalization across datasets with diverse joint structures and anatomical coverage.

Method: Propose Multi-Skeleton Contrastive Learning (MS-CLR) that aligns representations across multiple skeleton conventions from the same sequence. Adapt ST-GCN architecture with unified representation scheme to handle varying joint layouts and scales.

Result: MS-CLR consistently improves performance over single-skeleton contrastive learning baselines on NTU RGB+D 60 and 120 datasets. Multi-skeleton ensemble further boosts performance, achieving new state-of-the-art results.

Conclusion: Aligning representations across multiple skeleton conventions enables learning of structural invariances and diverse anatomical cues, resulting in more expressive and generalizable features for skeleton-based action recognition.

Abstract: Contrastive learning has gained significant attention in skeleton-based action recognition for its ability to learn robust representations from unlabeled data. However, existing methods rely on a single skeleton convention, which limits their ability to generalize across datasets with diverse joint structures and anatomical coverage. We propose Multi-Skeleton Contrastive Learning (MS-CLR), a general self-supervised framework that aligns pose representations across multiple skeleton conventions extracted from the same sequence. This encourages the model to learn structural invariances and capture diverse anatomical cues, resulting in more expressive and generalizable features. To support this, we adapt the ST-GCN architecture to handle skeletons with varying joint layouts and scales through a unified representation scheme. Experiments on the NTU RGB+D 60 and 120 datasets demonstrate that MS-CLR consistently improves performance over strong single-skeleton contrastive learning baselines. A multi-skeleton ensemble further boosts performance, setting new state-of-the-art results on both datasets.

[145] GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects

Licheng Shen, Saining Zhang, Honghan Li, Peilin Yang, Zihao Huang, Zongzheng Zhang, Hao Zhao

Main category: cs.CV

TL;DR: A unified method using articulated 3D Gaussians that jointly models geometry and motion for reconstructing complex articulated objects with up to 20 parts, outperforming previous approaches.

Details

Motivation: Prior methods decouple geometry and motion reconstruction, complicating pipelines and limiting scalability for objects with complex multi-part articulation.

Method: Introduces a unified representation using articulated 3D Gaussians that jointly models geometry and motion, improving robustness in motion decomposition.

Result: Achieves superior accuracy in part-level geometry reconstruction and motion estimation across diverse object types, significantly outperforming methods that struggle beyond 2-3 parts.

Conclusion: The unified articulated representation shows strong potential for scalable physical modeling and downstream applications like robotic simulation and human-scene interaction.

Abstract: Reconstructing articulated objects is essential for building digital twins of interactive environments. However, prior methods typically decouple geometry and motion by first reconstructing object shape in distinct states and then estimating articulation through post-hoc alignment. This separation complicates the reconstruction pipeline and restricts scalability, especially for objects with complex, multi-part articulation. We introduce a unified representation that jointly models geometry and motion using articulated 3D Gaussians. This formulation improves robustness in motion decomposition and supports articulated objects with up to 20 parts, significantly outperforming prior approaches that often struggle beyond 2–3 parts due to brittle initialization. To systematically assess scalability and generalization, we propose MPArt-90, a new benchmark consisting of 90 articulated objects across 20 categories, each with diverse part counts and motion configurations. Extensive experiments show that our method consistently achieves superior accuracy in part-level geometry reconstruction and motion estimation across a broad range of object types. We further demonstrate applicability to downstream tasks such as robotic simulation and human-scene interaction modeling, highlighting the potential of unified articulated representations in scalable physical modeling.

[146] Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration

Yifan Zhang, Junhui Hou, Siyu Ren, Jinjian Wu, Yixuan Yuan, Guangming Shi

Main category: cs.CV

TL;DR: NCLR is a self-supervised learning framework for 3D perception in autonomous driving that uses 2D-3D neural calibration to align camera and LiDAR coordinate systems through learnable transformation alignment, overlapping area identification, and dense correspondence establishment.

Details

Motivation: To enhance 3D perception in autonomous driving by bridging the domain gap between image and point cloud data through self-supervised learning of LiDAR-to-camera extrinsic parameters.

Method: Proposes learnable transformation alignment to unify image and point cloud features, identifies overlapping areas between modalities, and establishes dense 2D-3D correspondences to estimate rigid pose alignment.

Result: Superior performance over existing self-supervised methods on downstream tasks including 3D semantic segmentation, object detection, and panoptic segmentation across various datasets.

Conclusion: Joint learning from different modalities significantly enhances network understanding and representation effectiveness, demonstrating the value of 2D-3D neural calibration for autonomous driving perception.

Abstract: This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes. Specifically, our approach, namely NCLR, focuses on 2D-3D neural calibration, a novel pretext task that estimates the rigid pose aligning camera and LiDAR coordinate systems. First, we propose the learnable transformation alignment to bridge the domain gap between image and point cloud data, converting features into a unified representation space for effective comparison and matching. Second, we identify the overlapping area between the image and point cloud with the fused features. Third, we establish dense 2D-3D correspondences to estimate the rigid pose. The framework not only learns fine-grained matching from points to pixels but also achieves alignment of the image and point cloud at a holistic level, understanding the LiDAR-to-camera extrinsic parameters. We demonstrate the efficacy of NCLR by applying the pre-trained backbone to downstream tasks, such as LiDAR-based 3D semantic segmentation, object detection, and panoptic segmentation. Comprehensive experiments on various datasets illustrate the superiority of NCLR over existing self-supervised methods. The results confirm that joint learning from different modalities significantly enhances the network’s understanding abilities and effectiveness of learned representation. The code is publicly available at https://github.com/Eaphan/NCLR.

[147] ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu

Main category: cs.CV

TL;DR: Comprehensive analysis of design choices for visual grounding in MLLMs using LLaVA-1.5, achieving significant performance improvements on RefCOCO benchmarks.

Details

Motivation: Existing MLLM approaches for visual grounding use disparate design choices without systematic verification, creating a need for comprehensive analysis to identify optimal configurations.

Method: Systematic study of various design choices impacting VG performance using LLaVA-1.5, covering different visual grounding paradigms and ablation studies on grounding data design.

Result: Achieved improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g benchmarks over the baseline LLaVA-1.5 model.

Conclusion: The findings provide validated design choices that significantly enhance MLLM performance for visual grounding tasks, with results that are broadly applicable to other architectures.

Abstract: Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs’ fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.

[148] RNDiff: Rainfall nowcasting with Condition Diffusion Model

Xudong Ling, Chaorong Li, Fengqing Qin, Peng Yang, Yuanyuan Huang

Main category: cs.CV

TL;DR: SRNDiff applies diffusion models to precipitation forecasting, achieving better accuracy than GANs through conditional generation with historical data.

Details

Motivation: Diffusion models generate higher quality samples than GANs and VAEs, making them suitable for improving precipitation nowcasting accuracy.

Method: Uses conditional diffusion model with additional decoder module, composed of denoising network and conditional encoder with multiple UNet networks for multi-resolution feature extraction.

Result: Surpasses GANs in prediction accuracy, generates higher quality precipitation samples, and shows better training stability despite higher computational requirements.

Conclusion: Diffusion models demonstrate advantages for precipitation forecasting, providing new insights for rainfall prediction enhancement.

Abstract: Diffusion models are widely used in image generation because they can generate high-quality and realistic samples. This is in contrast to generative adversarial networks (GANs) and variational autoencoders (VAEs), which have some limitations in terms of image quality.We introduce the diffusion model to the precipitation forecasting task and propose a short-term precipitation nowcasting with condition diffusion model based on historical observational data, which is referred to as SRNDiff. By incorporating an additional conditional decoder module in the denoising process, SRNDiff achieves end-to-end conditional rainfall prediction. SRNDiff is composed of two networks: a denoising network and a conditional Encoder network. The conditional network is composed of multiple independent UNet networks. These networks extract conditional feature maps at different resolutions, providing accurate conditional information that guides the diffusion model for conditional generation.SRNDiff surpasses GANs in terms of prediction accuracy, although it requires more computational resources.The SRNDiff model exhibits higher stability and efficiency during training than GANs-based approaches, and generates high-quality precipitation distribution samples that better reflect future actual precipitation conditions. This fully validates the advantages and potential of diffusion models in precipitation forecasting, providing new insights for enhancing rainfall prediction.

[149] Consistent and Optimal Solution to Camera Motion Estimation

Guangyang Zeng, Qingcheng Zeng, Xinghan Li, Biqiang Mu, Jiming Chen, Ling Shi, Junfeng Wu

Main category: cs.CV

TL;DR: A two-step maximum likelihood estimator for camera motion from 2D point correspondences that achieves consistency and asymptotic efficiency with linear time complexity.

Details

Motivation: Existing methods based on epipolar constraint and essential matrix estimation are not optimal in maximum likelihood sense for camera motion inference from 2D point correspondences.

Method: Two-step algorithm: 1) estimate noise variance with bias elimination for consistent estimation, 2) one-step Gauss-Newton iteration on manifold for refinement. Proves consistency and asymptotic efficiency properties.

Result: The estimator converges to ground truth as point number increases (consistency) and achieves theoretical lower bound for mean squared error (Cramer-Rao bound). Has linear time complexity and outperforms state-of-the-art methods with hundreds of points.

Conclusion: The proposed ML estimator provides superior accuracy and computational efficiency for dense point correspondence scenarios, making it advantageous for practical computer vision applications requiring camera motion estimation.

Abstract: Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose a two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimate. We prove that the proposed estimate owns the same asymptotic statistical properties as the ML estimate: The first is consistency, i.e., the estimate converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimate converges to the theoretical lower bound – Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.

[150] MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection

Chenqi Kong, Anwei Luo, Peijun Bao, Yi Yu, Haoliang Li, Zengwei Zheng, Shiqi Wang, Alex C. Kot

Main category: cs.CV

TL;DR: MoE-FFD is a parameter-efficient ViT-based face forgery detection method that combines transformer expressivity with CNN local priors, using LoRA and Adapter layers for efficient training while achieving state-of-the-art performance.

Details

Motivation: Address limitations of current ViT-based face forgery detectors: high computational costs from full fine-tuning, inability to capture local forgery clues leading to model bias, and limited generalizability from focusing on only one or few forgery features.

Method: Proposes Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD) that keeps ViT backbone frozen while updating lightweight LoRA and Adapter layers. Combines transformer global expressivity with CNN local priors, uses novel MoE modules to scale capacity and select optimal forgery experts.

Result: Extensive experiments show state-of-the-art face forgery detection performance with significantly reduced parameter overhead. The method can be seamlessly adapted to various transformer backbones in plug-and-play manner.

Conclusion: MoE-FFD provides an effective and parameter-efficient solution for face forgery detection that addresses key limitations of existing approaches while maintaining high detection performance and generalizability.

Abstract: Deepfakes have recently raised significant trust issues and security concerns among the public. Compared to CNN face forgery detectors, ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. However, these approaches still exhibit the following limitations: (1) Fully fine-tuning ViT-based models from ImageNet weights demands substantial computational and storage resources; (2) ViT-based methods struggle to capture local forgery clues, leading to model bias; (3) These methods limit their scope on only one or few face forgery features, resulting in limited generalizability. To tackle these challenges, this work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach. MoE-FFD only updates lightweight Low-Rank Adaptation (LoRA) and Adapter layers while keeping the ViT backbone frozen, thereby achieving parameter-efficient training. Moreover, MoE-FFD leverages the expressivity of transformers and local priors of CNNs to simultaneously extract global and local forgery clues. Additionally, novel MoE modules are designed to scale the model’s capacity and smartly select optimal forgery experts, further enhancing forgery detection performance. Our proposed learning scheme can be seamlessly adapted to various transformer backbones in a plug-and-play manner. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art face forgery detection performance with significantly reduced parameter overhead. The code is released at: https://github.com/LoveSiameseCat/MoE-FFD.

[151] What Makes for Good Image Captions?

Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, Pascale Fung

Main category: cs.CV

TL;DR: A formal information-theoretic framework for image captioning that conceptualizes captions as compressed linguistic representations, with a method called Pyramid of Captions (PoCa) that integrates local and global visual information.

Details

Motivation: To establish a theoretical foundation for evaluating and optimizing image captioning systems by defining what constitutes good captions - informationally sufficient, minimally redundant, and human-comprehensible.

Method: Developed an information-theoretic framework with quantitative measures for caption quality, and introduced Pyramid of Captions (PoCa) method that integrates local and global visual information to generate enriched captions.

Result: Provided theoretical proof that PoCa improves caption quality under certain assumptions, and demonstrated empirical validation across various image captioning models and datasets.

Conclusion: The framework offers a flexible foundation for analyzing and optimizing image captioning systems, with PoCa method showing both theoretical and empirical effectiveness in generating high-quality captions.

Abstract: This paper establishes a formal information-theoretic framework for image captioning, conceptualizing captions as compressed linguistic representations that selectively encode semantic units in images. Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans. By formulating these aspects as quantitative measures with adjustable weights, our framework provides a flexible foundation for analyzing and optimizing image captioning systems across diverse task requirements. To demonstrate its applicability, we introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information. We present both theoretical proof that PoCa improves caption quality under certain assumptions, and empirical validation of its effectiveness across various image captioning models and datasets.

[152] RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.CV

TL;DR: MLLMs struggle to accurately identify image rotations, particularly distinguishing between 90° and 270° rotations, despite being able to recognize right-side-up and upside-down images to some extent.

Details

Motivation: To evaluate the spatial reasoning capabilities of Multimodal Large Language Models in detecting image orientation, which requires robust visual reasoning to understand rotational cues and spatial relationships.

Method: Created RotBench - a 350-image benchmark with lifestyle, portrait, and landscape images. Tested various MLLMs including GPT-5, o3, and Gemini-2.5-Pro on identifying 0°, 90°, 180°, and 270° rotations. Used auxiliary information (captions, depth maps), chain-of-thought prompting, simultaneous multi-orientation presentation, and fine-tuning approaches.

Result: Most models can identify 0° rotations reliably, some can identify 180° rotations, but none can reliably distinguish between 90° and 270° rotations. Auxiliary information and prompting provided only small improvements. Fine-tuning improved 180° identification but not 90°/270° distinction.

Conclusion: There is a significant gap between MLLMs’ spatial reasoning capabilities and human perception in identifying image rotation, particularly for distinguishing between 90° and 270° orientations.

Abstract: We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench – a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information – including captions, depth maps, and more – or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models’ ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs’ spatial reasoning capabilities and human perception in identifying rotation.

[153] FlightPatchNet: Multi-Scale Patch Network with Differential Coding for Flight Trajectory Prediction

Lan Wu, Xuebin Wang, Ruijuan Chu, Guangyi Liu, Jing Zhang, Linyu Wang

Main category: cs.CV

TL;DR: FlightPatchNet: A multi-scale patch network with differential coding that improves flight trajectory prediction by handling data range differences and capturing complex temporal dependencies across multiple time scales.

Details

Motivation: Existing flight trajectory prediction methods suffer from accuracy issues due to significant data range differences and fail to capture the underlying complex temporal dependencies in real-world flight trajectories across multiple time scales.

Method: Uses differential coding to encode longitude/latitude into first-order differences, employs global temporal attention for time step dependencies, and designs a multi-scale patch network with patch mixer blocks to capture inter- and intra-patch dependencies across different time scales.

Result: Extensive experiments on ADS-B datasets demonstrate that FlightPatchNet outperforms competitive baselines in flight trajectory prediction accuracy.

Conclusion: The proposed FlightPatchNet effectively addresses data range issues and captures complex temporal patterns, providing superior performance for multi-step flight trajectory prediction in air traffic control applications.

Abstract: Accurate multi-step flight trajectory prediction plays an important role in Air Traffic Control, which can ensure the safety of air transportation. Two main issues limit the flight trajectory prediction performance of existing works. The first issue is the negative impact on prediction accuracy caused by the significant differences in data range. The second issue is that real-world flight trajectories involve underlying temporal dependencies, and most existing methods fail to reveal the hidden complex temporal variations and extract features from one single time scale. To address the above issues, we propose FlightPatchNet, a multi-scale patch network with differential coding for flight trajectory prediction. Specifically, FlightPatchNet first utilizes differential coding to encode the original values of longitude and latitude into first-order differences and generates embeddings for all variables at each time step. Then, global temporal attention is introduced to explore the dependencies between different time steps. To fully explore the diverse temporal patterns in flight trajectories, a multi-scale patch network is delicately designed to serve as the backbone. The multi-scale patch network exploits stacked patch mixer blocks to capture inter- and intra-patch dependencies under different time scales, and further integrates multi-scale temporal features across different scales and variables. Finally, FlightPatchNet ensembles multiple predictors to make direct multi-step prediction. Extensive experiments on ADS-B datasets demonstrate that our model outperforms the competitive baselines.

[154] Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models

Jie Ren, Kangrui Chen, Yingqian Cui, Shenglai Zeng, Hui Liu, Yue Xing, Jiliang Tang, Lingjuan Lyu

Main category: cs.CV

TL;DR: A new benchmark called Six-CD is proposed to evaluate concept removal methods in text-to-image diffusion models, addressing gaps in existing research through comprehensive dataset and novel evaluation metrics.

Details

Motivation: Text-to-image diffusion models can be exploited for malicious purposes like generating violent or inappropriate content, and existing concept removal methods lack proper evaluation benchmarks.

Method: Introduces Six-CD dataset and novel evaluation metrics to benchmark concept removal methods, conducting thorough evaluation of their effectiveness.

Result: The benchmark provides comprehensive evaluation of concept removal techniques, offering valuable experimental observations and insights for the field.

Conclusion: The proposed Six-CD benchmark addresses critical gaps in evaluating concept removal methods and provides valuable insights for improving safety in text-to-image diffusion models.

Abstract: Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. However, the advancement of T2I diffusion models presents significant risks, as the models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. To mitigate these risks, concept removal methods have been proposed. These methods aim to modify diffusion models to prevent the generation of malicious and unwanted concepts. Despite these efforts, existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric. In this benchmark, we conduct a thorough evaluation of concept removals, with the experimental observations and discussions offering valuable insights in the field.

[155] MMAD: Multi-label Micro-Action Detection in Videos

Kun Li, Pengyu Liu, Dan Guo, Fei Wang, Zhiliang Wu, Hehe Fan, Meng Wang

Main category: cs.CV

TL;DR: This paper introduces Multi-label Micro-Action Detection (MMAD) - a new task for detecting multiple overlapping micro-actions in videos, along with a new dataset MMA-52 and a baseline method with dual-path spatial-temporal adapter.

Details

Motivation: Current research focuses on recognizing individual micro-actions but overlooks their co-occurring nature in real-world scenarios where multiple micro-actions often overlap temporally, such as concurrent head and hand movements.

Method: Proposed a baseline method equipped with a dual-path spatial-temporal adapter to address the challenges of subtle visual changes in micro-action detection. Also introduced the MMA-52 dataset to facilitate the MMAD task.

Result: The paper presents the MMA-52 dataset and a baseline approach for multi-label micro-action detection, enabling identification of all micro-actions in short videos with start/end times and categorization.

Conclusion: The MMA-52 dataset is expected to stimulate research on micro-action analysis and prompt development of spatio-temporal modeling in human-centric video understanding, addressing the gap in detecting overlapping micro-actions.

Abstract: Human body actions are an important form of non-verbal communication in social interactions. This paper specifically focuses on a subset of body actions known as micro-actions, which are subtle, low-intensity body movements with promising applications in human emotion analysis. In real-world scenarios, human micro-actions often temporally co-occur, with multiple micro-actions overlapping in time, such as concurrent head and hand movements. However, current research primarily focuses on recognizing individual micro-actions while overlooking their co-occurring nature. To address this gap, we propose a new task named Multi-label Micro-Action Detection (MMAD), which involves identifying all micro-actions in a given short video, determining their start and end times, and categorizing them. Accomplishing this requires a model capable of accurately capturing both long-term and short-term action relationships to detect multiple overlapping micro-actions. To facilitate the MMAD task, we introduce a new dataset named Multi-label Micro-Action-52 (MMA-52) and propose a baseline method equipped with a dual-path spatial-temporal adapter to address the challenges of subtle visual change in MMAD. We hope that MMA-52 can stimulate research on micro-action analysis in videos and prompt the development of spatio-temporal modeling in human-centric video understanding. The proposed MMA-52 dataset is available at: https://github.com/VUT-HFUT/Micro-Action.

[156] VisioPhysioENet: Visual Physiological Engagement Detection Network

Alakhsimar Singh, Kanav Goyal, Nischay Verma, Puneet Kumar, Xiaobai Li, Amritpal Singh

Main category: cs.CV

TL;DR: VisioPhysioENet is a multimodal system combining visual and physiological signals for learner engagement detection, achieving 63.09% accuracy on DAiSEE dataset with 8.6% improvement over previous multimodal approaches.

Details

Motivation: To develop a more accurate learner engagement detection system by leveraging both visual cues (facial features) and physiological signals (cardiovascular activity) rather than relying on single modalities.

Method: Two-level feature extraction: visual features using Dlib for facial landmarks and OpenCV for additional estimations; physiological signals extracted using plane-orthogonal-to-skin method for cardiovascular activity. Features integrated with advanced machine learning classifiers.

Result: Achieved 63.09% accuracy on DAiSEE dataset, outperforming existing multimodal methods by 8.6% and demonstrating superior performance in identifying different engagement levels.

Conclusion: The multimodal approach combining visual and physiological signals significantly improves engagement detection accuracy compared to single-modality or existing multimodal methods, showing promise for practical educational applications.

Abstract: This paper presents VisioPhysioENet, a novel multimodal system that leverages visual and physiological signals to detect learner engagement. It employs a two-level approach for extracting both visual and physiological features. For visual feature extraction, Dlib is used to detect facial landmarks, while OpenCV provides additional estimations. The face recognition library, built on Dlib, is used to identify the facial region of interest specifically for physiological signal extraction. Physiological signals are then extracted using the plane-orthogonal-toskin method to assess cardiovascular activity. These features are integrated using advanced machine learning classifiers, enhancing the detection of various levels of engagement. We thoroughly tested VisioPhysioENet on the DAiSEE dataset. It achieved an accuracy of 63.09%. This shows it can better identify different levels of engagement compared to many existing methods. It performed 8.6% better than the only other model that uses both physiological and visual features.

[157] Dark Miner: Defend against undesirable generation for text-to-image diffusion models

Zheling Meng, Bo Peng, Xiaochuan Jin, Yue Jiang, Wei Wang, Jing Dong, Tieniu Tan

Main category: cs.CV

TL;DR: Dark Miner is a three-stage method that effectively erases unwanted concepts from text-to-image diffusion models by mining problematic embeddings and reducing their generation probabilities, achieving better performance than previous methods especially against adversarial attacks.

Details

Motivation: Text-to-image diffusion models often generate undesired content due to unfiltered training data, and existing methods fail to guarantee erasure for unseen or adversarial texts that weren't in training.

Method: A recurring three-stage process comprising mining (finding embeddings with maximum generation probabilities of target concepts), verifying, and circumventing to greedily reduce unwanted generation.

Result: Achieves better erasure and defense results compared to previous methods, especially under multiple adversarial attacks, while preserving the model’s native generation capabilities.

Conclusion: Dark Miner effectively minimizes total probabilities of undesired generation in diffusion models and provides robust defense against adversarial attacks while maintaining model functionality.

Abstract: Text-to-image diffusion models have been demonstrated with undesired generation due to unfiltered large-scale training data, such as sexual images and copyrights, necessitating the erasure of undesired concepts. Most existing methods focus on modifying the generation probabilities conditioned on the texts containing target concepts. However, they fail to guarantee the desired generation of texts unseen in the training phase, especially for the adversarial texts from malicious attacks. In this paper, we analyze the erasure task and point out that existing methods cannot guarantee the minimization of the total probabilities of undesired generation. To tackle this problem, we propose Dark Miner. It entails a recurring three-stage process that comprises mining, verifying, and circumventing. This method greedily mines embeddings with maximum generation probabilities of target concepts and more effectively reduces their generation. In the experiments, we evaluate its performance on the inappropriateness, object, and style concepts. Compared with the previous methods, our method achieves better erasure and defense results, especially under multiple adversarial attacks, while preserving the native generation capability of the models. Our code will be available on GitHub.

[158] Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance

Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, Yaling Liang

Main category: cs.CV

TL;DR: LetsTalk is a diffusion transformer framework that addresses long-duration talking video synthesis challenges through multimodal guidance and a novel memory bank mechanism, achieving high-quality, consistent results with improved efficiency.

Details

Motivation: Long-duration talking video synthesis faces persistent challenges including visual degradation, loss of identity consistency, temporal incoherence, and error accumulation as video length increases, which severely impact realism and reliability.

Method: LetsTalk incorporates multimodal guidance and a memory bank mechanism with noise-regularized training to mitigate error accumulation. It uses a deep compression autoencoder and spatiotemporal-aware transformer with linear attention for efficient multimodal fusion. The framework employs deep symbiotic fusion for portrait features and shallow direct fusion for audio synchronization.

Result: Extensive experiments demonstrate state-of-the-art generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, while maintaining remarkable efficiency with 8x fewer parameters than previous approaches.

Conclusion: LetsTalk successfully addresses the key challenges in long-duration talking video synthesis by maintaining contextual continuity through its memory bank mechanism and efficient multimodal fusion strategies, achieving robust, high-quality generation with significantly improved computational efficiency.

Abstract: Long-duration talking video synthesis faces persistent challenges in simultaneously achieving high video quality, portrait and temporal consistency, and computational efficiency. As video length increases, issues such as visual degradation, loss of identity consistency, temporal incoherence, and error accumulation become increasingly prominent, severely impacting the realism and reliability of generated results. To address these issues, we present LetsTalk, a diffusion transformer framework that incorporates multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient long-duration talking video generation. Specifically, LetsTalk introduces a memory bank combined with a noise-regularized training strategy to mitigate error accumulation and sampling artifacts during long video generation. To further enhance efficiency and spatiotemporal consistency, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. Furthermore, we systematically analyze three multimodal fusion schemes, adopting deep (Symbiotic Fusion) for portrait features to ensure visual consistency, and shallow (Direct Fusion) for audio to synchronize animation with speech while preserving motion diversity. Extensive experiments demonstrate that LetsTalk achieves state-of-the-art generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, while maintaining remarkable efficiency with 8 fewer parameters than previous approaches.

[159] UnZipLoRA: Separating Content and Style from a Single Image

Chang Liu, Viraj Shah, Aiyu Cui, Svetlana Lazebnik

Main category: cs.CV

TL;DR: UnZipLoRA decomposes single images into separate subject and style LoRAs that can be independently manipulated and recombined through direct addition.

Details

Motivation: Existing personalization techniques focus on either subject or style in isolation or require separate training sets, lacking the ability to disentangle both elements from a single image.

Method: Trains two distinct LoRAs simultaneously using novel prompt separation technique and column/block separation strategies to ensure subject-style disentanglement and compatibility.

Result: Enables independent manipulation of subject and style, generating variations, applying extracted style to new subjects, and reconstructing original images with novel variations.

Conclusion: Outperforms state-of-the-art methods (DreamBooth-LoRA, Inspiration Tree, B-LoRA) in human studies and quantitative metrics, demonstrating effective subject-style disentanglement from single images.

Abstract: This paper introduces UnZipLoRA, a method for decomposing an image into its constituent subject and style, represented as two distinct LoRAs (Low-Rank Adaptations). Unlike existing personalization techniques that focus on either subject or style in isolation, or require separate training sets for each, UnZipLoRA disentangles these elements from a single image by training both the LoRAs simultaneously. UnZipLoRA ensures that the resulting LoRAs are compatible, i.e., they can be seamlessly combined using direct addition. UnZipLoRA enables independent manipulation and recontextualization of subject and style, including generating variations of each, applying the extracted style to new subjects, and recombining them to reconstruct the original image or create novel variations. To address the challenge of subject and style entanglement, UnZipLoRA employs a novel prompt separation technique, as well as column and block separation strategies to accurately preserve the characteristics of subject and style, and ensure compatibility between the learned LoRAs. Evaluation with human studies and quantitative metrics demonstrates UnZipLoRA’s effectiveness compared to other state-of-the-art methods, including DreamBooth-LoRA, Inspiration Tree, and B-LoRA.

[160] Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

Yifan Zhang, Junhui Hou

Main category: cs.CV

TL;DR: CMCR framework improves 3D representation learning by integrating both modality-shared and modality-specific features through masked image modeling, occupancy estimation, and a unified multi-modal codebook.

Details

Motivation: Existing cross-modal contrastive distillation methods focus only on modality-shared features, neglecting modality-specific features, leading to suboptimal 3D representations.

Method: Proposes CMCR framework with masked image modeling, occupancy estimation tasks, a multi-modal unified codebook for shared embedding space, and geometry-enhanced masked image modeling.

Result: Extensive experiments show CMCR mitigates traditional challenges and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks.

Conclusion: CMCR effectively addresses limitations of current contrastive methods by better integrating both modality-shared and modality-specific features for improved 3D representation learning.

Abstract: Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. However, existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process, which leads to suboptimal representations. In this paper, we theoretically analyze the limitations of current contrastive methods for 3D representation learning and propose a new framework, namely CMCR, to address these shortcomings. Our approach improves upon traditional methods by better integrating both modality-shared and modality-specific features. Specifically, we introduce masked image modeling and occupancy estimation tasks to guide the network in learning more comprehensive modality-specific features. Furthermore, we propose a novel multi-modal unified codebook that learns an embedding space shared across different modalities. Besides, we introduce geometry-enhanced masked image modeling to further boost 3D representation learning. Extensive experiments demonstrate that our method mitigates the challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks. Code will be available at https://github.com/Eaphan/CMCR.

[161] MetaWild: A Multimodal Dataset for Animal Re-Identification with Environmental Metadata

Yuzhuo Li, Di Zhao, Tingrui Qiao, Yihao Wu, Bo Pang, Yun Sing Koh

Main category: cs.CV

TL;DR: Proposes MetaWild dataset and Meta-Feature Adapter (MFA) to incorporate environmental metadata with visual data for improved animal re-identification, addressing limitations of existing visual-only approaches.

Details

Motivation: Existing Animal ReID datasets rely exclusively on visual data, overlooking environmental metadata that ecologists have identified as highly correlated with animal behavior and identity. Multimodal models' text-processing capabilities are underutilized in current datasets.

Method: Proposes Meta-Feature Adapter (MFA), a lightweight module that can be incorporated into existing vision-language model-based Animal ReID methods to leverage both environmental metadata and visual information. Introduces MetaWild dataset with multimodal data.

Result: Experiments on MetaWild show that combining baseline ReID models with MFA to incorporate metadata consistently improves performance compared to using visual information alone.

Conclusion: The approach validates the effectiveness of incorporating metadata in re-identification and hopes to inspire further exploration of multimodal approaches for Animal ReID.

Abstract: Identifying individual animals within large wildlife populations is essential for effective wildlife monitoring and conservation efforts. Recent advancements in computer vision have shown promise in animal re-identification (Animal ReID) by leveraging data from camera traps. However, existing Animal ReID datasets rely exclusively on visual data, overlooking environmental metadata that ecologists have identified as highly correlated with animal behavior and identity, such as temperature and circadian rhythms. Moreover, the emergence of multimodal models capable of jointly processing visual and textual data presents new opportunities for Animal ReID, but existing datasets fail to leverage these models’ text-processing capabilities, limiting their full potential. Additionally, to facilitate the use of metadata in existing ReID methods, we propose the Meta-Feature Adapter (MFA), a lightweight module that can be incorporated into existing vision-language model (VLM)-based Animal ReID methods, allowing ReID models to leverage both environmental metadata and visual information to improve ReID performance. Experiments on MetaWild show that combining baseline ReID models with MFA to incorporate metadata consistently improves performance compared to using visual information alone, validating the effectiveness of incorporating metadata in re-identification. We hope that our proposed dataset can inspire further exploration of multimodal approaches for Animal ReID.

[162] MMHMER:Multi-viewer and Multi-task for Handwritten Mathematical Expression Recognition

Kehua Chen, Haoyang Shen, Lifan Zhong, Mingyi Chen

Main category: cs.CV

TL;DR: Proposes MMHMER, a multi-viewer multi-task approach combining CNN and Transformer architectures for handwritten math expression recognition, achieving state-of-the-art performance on CROHME datasets.

Details

Motivation: Existing HMER methods use either CNN/RNN or Transformer architectures, each with strengths and weaknesses. The paper aims to effectively fuse these two approaches to leverage their complementary capabilities.

Method: Develops a CNN-Transformer multi-viewer, multi-task framework that integrates CNN’s feature extraction capabilities with Transformer’s sequence modeling abilities.

Result: Achieves 63.96%, 62.51%, and 65.46% ExpRate on CROHME14, CROHME16, and CROHME19 datasets, outperforming Posformer by 1.28%, 1.48%, and 0.58% absolute gains.

Conclusion: The proposed multi-view, multi-task framework successfully integrates CNN and Transformer strengths, demonstrating improved performance in handling handwritten mathematical expression complexity.

Abstract: Handwritten Mathematical Expression Recognition (HMER) methods have made remarkable progress, with most existing HMER approaches based on either a hybrid CNN/RNN-based with GRU architecture or Transformer architectures. Each of these has its strengths and weaknesses. Leveraging different model structures as viewers and effectively integrating their diverse capabilities presents an intriguing avenue for exploration. This involves addressing two key challenges: 1) How to fuse these two methods effectively, and 2) How to achieve higher performance under an appropriate level of complexity. This paper proposes an efficient CNN-Transformer multi-viewer, multi-task approach to enhance the model’s recognition performance. Our MMHMER model achieves 63.96%, 62.51%, and 65.46% ExpRate on CROHME14, CROHME16, and CROHME19, outperforming Posformer with an absolute gain of 1.28%, 1.48%, and 0.58%. The main contribution of our approach is that we propose a new multi-view, multi-task framework that can effectively integrate the strengths of CNN and Transformer. By leveraging the feature extraction capabilities of CNN and the sequence modeling capabilities of Transformer, our model can better handle the complexity of handwritten mathematical expressions.

[163] Dynamic watermarks in images generated by diffusion models

Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian

Main category: cs.CV

TL;DR: A multi-stage watermarking framework for diffusion models that embeds both fixed and dynamic watermarks to enable copyright protection and source verification of AI-generated images.

Details

Motivation: Address ethical concerns around intellectual property protection and misuse of synthetic media from text-to-image diffusion models by establishing copyright and traceability mechanisms.

Method: Multi-stage watermarking with: (i) fixed watermark in diffusion model’s noise distribution, (ii) dynamic human-imperceptible watermark in generated images using fine-tuned decoder. Uses SSIM and cosine similarity to adapt watermark shape/color to content while maintaining robustness.

Result: Reliable source verification through watermark classification, robustness against various attacks, minimal impact on image quality, and creation of watermarked image dataset for research.

Conclusion: Provides scalable solution for AI-generated content security, enabling model ownership verification and misuse prevention while advancing the field of content protection.

Abstract: High-fidelity text-to-image diffusion models have revolutionized visual content generation, but their widespread use raises significant ethical concerns, including intellectual property protection and the misuse of synthetic media. To address these challenges, we propose a novel multi-stage watermarking framework for diffusion models, designed to establish copyright and trace generated images back to their source. Our multi-stage watermarking technique involves embedding: (i) a fixed watermark that is localized in the diffusion model’s learned noise distribution and, (ii) a human-imperceptible, dynamic watermark in generates images, leveraging a fine-tuned decoder. By leveraging the Structural Similarity Index Measure (SSIM) and cosine similarity, we adapt the watermark’s shape and color to the generated content while maintaining robustness. We demonstrate that our method enables reliable source verification through watermark classification, even when the dynamic watermark is adjusted for content-specific variations. Source model verification is enabled through watermark classification. o support further research, we generate a dataset of watermarked images and introduce a methodology to evaluate the statistical impact of watermarking on generated content.Additionally, we rigorously test our framework against various attack scenarios, demonstrating its robustness and minimal impact on image quality. Our work advances the field of AI-generated content security by providing a scalable solution for model ownership verification and misuse prevention.

[164] Hands-On: Segmenting Individual Signs from Continuous Sequences

JianHe Low, Harry Walsh, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: Transformer-based architecture for continuous sign language segmentation using BIO tagging with HaMeR hand features and 3D angles, achieving SOTA results on DGS Corpus and surpassing benchmarks on BSLCorpus.

Details

Motivation: Address the challenge of continuous sign language segmentation, which is crucial for sign language translation and data annotation tasks.

Method: Transformer-based architecture that models temporal dynamics of signing, frames segmentation as sequence labeling using BIO tagging scheme, leverages HaMeR hand features complemented with 3D angles.

Result: Achieves state-of-the-art results on DGS Corpus, and the proposed features surpass prior benchmarks on BSLCorpus.

Conclusion: The proposed transformer-based approach with specialized hand features and 3D angles effectively addresses continuous sign language segmentation, demonstrating superior performance on major sign language corpora.

Abstract: This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

[165] Real-time Neural Rendering of LiDAR Point Clouds

Joni Vanherck, Brent Zoomers, Tom Mertens, Lode Jorissen, Nick Michiels

Main category: cs.CV

TL;DR: Efficient real-time rendering method for static LiDAR scans using U-Net and depth-based filtering to produce photorealistic images without expensive preprocessing.

Details

Motivation: Static LiDAR scanners produce accurate but artifact-filled point clouds that are unsuitable for direct display, requiring a solution that avoids expensive preprocessing or scene-specific training.

Method: Uses a U-Net deep convolutional model with depth-based heuristic prefiltering to transform naive point cloud projections into realistic renderings, handling occlusion, color inconsistencies, and varying point densities. Includes synthetic training data generation for imperfect ground truth alignment.

Result: Achieves real-time rendering rates using off-the-shelf GPU and outperforms state-of-the-art methods in both speed and quality.

Conclusion: The proposed method provides an efficient, real-time solution for rendering photorealistic images from LiDAR scans without the need for expensive preprocessing or scene-specific model training.

Abstract: Static LiDAR scanners produce accurate, dense, colored point clouds, but often contain obtrusive artifacts which makes them ill-suited for direct display. We propose an efficient method to render photorealistic images of such scans without any expensive preprocessing or training of a scene-specific model. A naive projection of the point cloud to the output view using 1x1 pixels is fast and retains the available detail, but also results in unintelligible renderings as background points leak in between the foreground pixels. The key insight is that these projections can be transformed into a realistic result using a deep convolutional model in the form of a U-Net, and a depth-based heuristic that prefilters the data. The U-Net also handles LiDAR-specific problems such as missing parts due to occlusion, color inconsistencies and varying point densities. We also describe a method to generate synthetic training data to deal with imperfectly-aligned ground truth images. Our method achieves real-time rendering rates using an off-the-shelf GPU and outperforms the state-of-the-art in both speed and quality.

[166] DuCos: Duality Constrained Depth Super-Resolution via Foundation Model

Zhiqiang Yan, Zhengxue Wang, Haoye Dong, Jun Li, Jian Yang, Gim Hee Lee

Main category: cs.CV

TL;DR: DuCos is a novel depth super-resolution framework using Lagrangian duality theory with foundation model prompts, achieving superior generalization and performance.

Details

Motivation: To enhance depth super-resolution accuracy and robustness by integrating multiple constraints and reconstruction objectives through a principled Lagrangian duality framework.

Method: Uses Lagrangian duality theory with two prompt components: Correlative Fusion (CF) for geometric alignment and feature fusion, and Gradient Regulation (GR) for consistency with sharp-edged depth maps from foundation models.

Result: Extensive experiments show DuCos outperforms state-of-the-art methods with superior accuracy, robustness, and generalization across diverse scenarios.

Conclusion: DuCos provides a synergistic and principled framework that significantly improves depth super-resolution through flexible constraint integration and foundation model prompting.

Abstract: We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory, offering a flexible integration of multiple constraints and reconstruction objectives to enhance accuracy and robustness. Our DuCos is the first to significantly improve generalization across diverse scenarios with foundation models as prompts. The prompt design consists of two key components: Correlative Fusion (CF) and Gradient Regulation (GR). CF facilitates precise geometric alignment and effective fusion between prompt and depth features, while GR refines depth predictions by enforcing consistency with sharp-edged depth maps derived from foundation models. Crucially, these prompts are seamlessly embedded into the Lagrangian constraint term, forming a synergistic and principled framework. Extensive experiments demonstrate that DuCos outperforms existing state-of-the-art methods, achieving superior accuracy, robustness, and generalization.

[167] Endo-FASt3r: Endoscopic Foundation model Adaptation for Structure from motion

Mona Sheikh Zeinoddin, Mobarak I. Hoque, Zafer Tandogdu, Greg Shaw, Matthew J. Clarkson, Evangelos Mazomenos, Danail Stoyanov

Main category: cs.CV

TL;DR: Endo-FASt3r is the first monocular self-supervised learning framework that uses foundation models for both depth and pose estimation in endoscopic surgery, achieving 10% improvement in pose estimation and 2% improvement in depth estimation over prior methods.

Details

Motivation: Accurate depth and camera pose estimation is crucial for high-quality 3D visualizations in robotic-assisted surgery, but previous foundation model adaptations only focused on depth estimation and used low-rank adaptation approaches that constrained model updates.

Method: Proposed Endo-FASt3r framework with Reloc3rX (extended relative pose estimation foundation model) and DoMoRA (novel adaptation technique enabling higher-rank updates and faster convergence) for monocular self-supervised learning of both depth and pose.

Result: Experiments on SCARED dataset show 10% improvement in pose estimation and 2% improvement in depth estimation over prior work. Similar performance gains observed on Hamlyn and StereoMIS datasets, demonstrating generalizability.

Conclusion: Endo-FASt3r successfully demonstrates the effectiveness of using foundation models for both depth and pose estimation in endoscopic surgery, with significant performance improvements and cross-dataset generalizability through the proposed Reloc3rX and DoMoRA techniques.

Abstract: Accurate depth and camera pose estimation is essential for achieving high-quality 3D visualisations in robotic-assisted surgery. Despite recent advancements in foundation model adaptation to monocular depth estimation of endoscopic scenes via self-supervised learning (SSL), no prior work has explored their use for pose estimation. These methods rely on low rank-based adaptation approaches, which constrain model updates to a low-rank space. We propose Endo-FASt3r, the first monocular SSL depth and pose estimation framework that uses foundation models for both tasks. We extend the Reloc3r relative pose estimation foundation model by designing Reloc3rX, introducing modifications necessary for convergence in SSL. We also present DoMoRA, a novel adaptation technique that enables higher-rank updates and faster convergence. Experiments on the SCARED dataset show that Endo-FASt3r achieves a substantial $10%$ improvement in pose estimation and a $2%$ improvement in depth estimation over prior work. Similar performance gains on the Hamlyn and StereoMIS datasets reinforce the generalisability of Endo-FASt3r across different datasets.

[168] VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, Ziwei Liu

Main category: cs.CV

TL;DR: VBench-2.0 is a new benchmark that evaluates video generation models for intrinsic faithfulness beyond visual appeal, assessing physical laws, commonsense reasoning, anatomical correctness, and compositional integrity.

Details

Motivation: Current video generation benchmarks focus on superficial faithfulness (visual appeal and temporal coherence) but fail to assess whether generated videos adhere to real-world principles like physics and commonsense reasoning.

Method: VBench-2.0 evaluates five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, using a combination of SOTA vision-language models, LLMs, and specialized anomaly detection methods, validated through extensive human annotations.

Result: The benchmark provides automated evaluation of intrinsic faithfulness, going beyond current metrics to assess whether generated videos follow real-world physical laws and commonsense principles.

Conclusion: VBench-2.0 sets a new standard for evaluating video generative models, pushing the field toward achieving true “world models” through intrinsic faithfulness essential for applications like AI filmmaking and simulated world modeling.

Abstract: Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real “world models” through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored to individual dimensions, our evaluation framework integrates generalists such as SOTA VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive human annotations to ensure evaluation alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

[169] CoMatcher: Multi-View Collaborative Feature Matching

Jintao Zhang, Zimin Xia, Mingyue Dong, Shuhan Shen, Linwei Yue, Xianwei Zheng

Main category: cs.CV

TL;DR: CoMatcher is a multi-view matching approach that addresses limitations of pairwise matching by leveraging complementary context from multiple views and cross-view projection consistency for more reliable 3D scene understanding.

Details

Motivation: Pairwise matching paradigms often fail in complex scenarios with occlusions or extreme viewpoint changes due to inherent uncertainty in interpreting 3D structures from limited two-view observations and information loss from 3D-to-2D projection.

Method: CoMatcher uses deep multi-view matching to leverage complementary context cues from different views for holistic 3D scene understanding and utilizes cross-view projection consistency to infer reliable global solutions. A groupwise framework is developed to exploit cross-view relationships for large-scale matching.

Result: Extensive experiments on various complex scenarios demonstrate the method’s superiority over mainstream two-view matching paradigms.

Conclusion: Multi-view collaborative matching with CoMatcher provides more reliable track construction in complex scenarios compared to traditional pairwise approaches, effectively addressing challenges from occlusions and viewpoint changes.

Abstract: This paper proposes a multi-view collaborative matching strategy for reliable track construction in complex scenarios. We observe that the pairwise matching paradigms applied to image set matching often result in ambiguous estimation when the selected independent pairs exhibit significant occlusions or extreme viewpoint changes. This challenge primarily stems from the inherent uncertainty in interpreting intricate 3D structures based on limited two-view observations, as the 3D-to-2D projection leads to significant information loss. To address this, we introduce CoMatcher, a deep multi-view matcher to (i) leverage complementary context cues from different views to form a holistic 3D scene understanding and (ii) utilize cross-view projection consistency to infer a reliable global solution. Building on CoMatcher, we develop a groupwise framework that fully exploits cross-view relationships for large-scale matching tasks. Extensive experiments on various complex scenarios demonstrate the superiority of our method over the mainstream two-view matching paradigm.

[170] Neural Restoration of Greening Defects in Historical Autochrome Photographs Based on Purely Synthetic Data

Saptarshi Neil Sinha, P. Julius Kuehn, Johannes Koppe, Arjan Kuijper, Michael Weinmann

Main category: cs.CV

TL;DR: First AI approach for automatic removal of greening color defects in digitized autochrome photographs using simulated defect data and generative AI with specialized loss function.

Details

Motivation: Preservation of early color photographs is challenged by deterioration issues like color bleeding and fading that current software cannot effectively restore, and there are no publicly available datasets with defect annotations for autochromes.

Method: Introduced approach for accurately simulating greening defects, used synthesized data with ground truth annotations to train generative AI model with carefully designed loss function that accounts for color imbalances between defected and non-defected areas.

Result: The approach allows for efficient and effective restoration of greening color defects, overcoming limitations of alternative techniques that struggle with accurately reproducing original colors and require significant manual effort.

Conclusion: This work presents the first successful method for automatic removal of greening defects in autochrome photographs, addressing a critical preservation challenge through simulated data and specialized AI training.

Abstract: The preservation of early visual arts, particularly color photographs, is challenged by deterioration caused by aging and improper storage, leading to issues like blurring, scratches, color bleeding, and fading defects. Despite great advances in image restoration and enhancement in recent years, such systematic defects often cannot be restored by current state-of-the-art software features as available e.g. in Adobe Photoshop, but would require the incorporation of defect-aware priors into the underlying machine learning techniques. However, there are no publicly available datasets of autochromes with defect annotations. In this paper, we address these limitations and present the first approach that allows the automatic removal of greening color defects in digitized autochrome photographs. For this purpose, we introduce an approach for accurately simulating respective defects and use the respectively obtained synthesized data with its ground truth defect annotations to train a generative AI model with a carefully designed loss function that accounts for color imbalances between defected and non-defected areas. As demonstrated in our evaluation, our approach allows for the efficient and effective restoration of the considered defects, thereby overcoming limitations of alternative techniques that struggle with accurately reproducing original colors and may require significant manual effort.

[171] Reconstruction-Free Anomaly Detection with Diffusion Models

Shunsuke Sakai, Xiangteng He, Chunzhi Gu, Leonid Sigal, Tatsuhito Hasegawa

Main category: cs.CV

TL;DR: A novel inversion-based anomaly detection method that avoids explicit reconstruction by noising in latent space instead of denoising in RGB space, achieving state-of-the-art performance with 2x faster inference.

Details

Motivation: Current reconstruction-based anomaly detection methods using diffusion models require fine-grained noise-strength tuning and computationally expensive multi-step denoising, creating a tension between fidelity and efficiency.

Method: Proposes detection via noising in latent space using DDIM inversion with few inversion steps to noise clean images, then measures deviation from prior distribution for anomaly scoring without explicit reconstruction.

Result: Achieves state-of-the-art anomaly detection performance across three widely used image AD datasets with approximately 2 times inference time speedup compared to previous methods.

Conclusion: The proposed reconstruction-free formulation through latent space noising effectively addresses the fidelity-efficiency trade-off in anomaly detection, providing both high accuracy and computational efficiency.

Abstract: Despite the remarkable success, recent reconstruction-based anomaly detection (AD) methods via diffusion modeling still involve fine-grained noise-strength tuning and computationally expensive multi-step denoising, leading to a fundamental tension between fidelity and efficiency. In this paper, we propose a novel inversion-based AD approach - detection via noising in latent space - which circumvents explicit reconstruction. Importantly, we contend that the limitations in prior reconstruction-based methods originate from the prevailing detection via denoising in RGB space paradigm. To address this, we model AD under a reconstruction-free formulation, which directly infers the final latent variable corresponding to the input image via DDIM inversion, and then measures the deviation based on the known prior distribution for anomaly scoring. Specifically, in approximating the original probability flow ODE using the Euler method, we only enforce very few inversion steps to noise the clean image to pursue inference efficiency. As the added noise is adaptively derived with the learned diffusion model, the original features for the clean testing image can still be leveraged to yield high detection accuracy. We perform extensive experiments and detailed analysis across three widely used image AD datasets under the unsupervised unified setting to demonstrate the effectiveness of our model, regarding state-of-the-art AD performance, and about 2 times inference time speedup without diffusion distillation.

[172] Enhanced Anomaly Detection for Capsule Endoscopy Using Ensemble Learning Strategies

Julia Werner, Christoph Gerum, Jorg Nick, Maxime Le Floch, Franz Brinkmann, Jochen Hampe, Oliver Bringmann

Main category: cs.CV

TL;DR: Proposes an ensemble strategy using diverse loss functions for anomaly detection in capsule endoscopy, achieving high AUC scores with fewer parameters than current baselines.

Details

Motivation: Capsule endoscopy AI faces challenges due to limited model size constraints and data scarcity, requiring efficient anomaly detection methods that can operate within hardware limitations.

Method: Uses ensemble learning with multiple neural networks trained with different loss functions from anomaly detection field, rather than same training algorithm for each network.

Result: Achieved 76.86% AUC on Kvasir-Capsule dataset and 76.98% AUC on Galar dataset, outperforming baselines with significantly fewer parameters.

Conclusion: The approach enables effective anomaly detection in capsule endoscopy with reduced computational requirements, making AI integration more feasible for real-world medical applications.

Abstract: Capsule endoscopy is a method to capture images of the gastrointestinal tract and screen for diseases which might remain hidden if investigated with standard endoscopes. Due to the limited size of a video capsule, embedding AI models directly into the capsule demands careful consideration of the model size and thus complicates anomaly detection in this field. Furthermore, the scarcity of available data in this domain poses an ongoing challenge to achieving effective anomaly detection. Thus, this work introduces an ensemble strategy to address this challenge in anomaly detection tasks in video capsule endoscopies, requiring only a small number of individual neural networks during both the training and inference phases. Ensemble learning combines the predictions of multiple independently trained neural networks. This has shown to be highly effective in enhancing both the accuracy and robustness of machine learning models. However, this comes at the cost of higher memory usage and increased computational effort, which quickly becomes prohibitive in many real-world applications. Instead of applying the same training algorithm to each individual network, we propose using various loss functions, drawn from the anomaly detection field, to train each network. The methods are validated on the two largest publicly available datasets for video capsule endoscopy images, the Galar and the Kvasir-Capsule dataset. We achieve an AUC score of 76.86% on the Kvasir-Capsule and an AUC score of 76.98% on the Galar dataset. Our approach outperforms current baselines with significantly fewer parameters across all models, which is a crucial step towards incorporating artificial intelligence into capsule endoscopies.

[173] Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression

Dingcheng Zhen, Qian Qiao, Xu Zheng, Tan Yu, Kangxi Wu, Ziwei Zhang, Siyuan Liu, Shunshun Yin, Ming Tao

Main category: cs.CV

TL;DR: TransDiff combines autoregressive transformers with diffusion models for superior image generation, achieving state-of-the-art FID/IS scores and significantly faster inference, plus introduces MRAR paradigm for further quality improvements.

Details

Motivation: To overcome limitations of standalone AR Transformer or diffusion models by combining their strengths for better image generation performance and efficiency.

Method: Joint modeling framework that encodes labels/images into semantic features, uses diffusion to estimate image distribution, and introduces Multi-Reference Autoregression (MRAR) for referencing multiple previous images during generation.

Result: Achieves FID 1.61 and IS 293.4 on ImageNet 256x256, with 2x faster inference than AR Transformer methods and 112x faster than diffusion-only models. MRAR further improves FID to 1.42.

Conclusion: TransDiff successfully integrates AR Transformer and diffusion models, establishing a new frontier in image generation with superior performance and efficiency.

Abstract: We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Frechet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which performs autoregressive generation by predicting the next image. MRAR enables the model to reference multiple previously generated images, thereby facilitating the learning of more diverse representations and improving the quality of generated images in subsequent iterations. By applying MRAR, the performance of TransDiff is improved, with the FID reduced from 1.61 to 1.42. We expect TransDiff to open up a new frontier in the field of image generation.

[174] Improving Token-based Object Detection with Video

Abhineet Singh, Nilanjan Ray

Main category: cs.CV

TL;DR: Extends Pix2Seq object detector to videos using sequence token representation, eliminating box sampling and postprocessing heuristics while enabling direct 3D box/tracklet output instead of 2D box linking.

Details

Motivation: To create an end-to-end video object detector that avoids limitations of conventional methods like loss sparsity from box sampling and heuristics-based postprocessing, while providing integrated 3D object representations.

Method: Represents video objects as variable-length sequences of discrete tokens, processes video subsequences to output fully integrated 3D boxes or tracklets directly without 2D box linking.

Result: Shows consistent improvement over baseline Pix2Seq static detector across datasets and competitive performance with state-of-the-art video detectors on UA-DETRAC, despite computational bottlenecks.

Conclusion: The approach successfully extends sequence-based object detection to videos, eliminating conventional constraints and enabling scalable video object detection with potential for multi-object tracking applications.

Abstract: This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by representing objects as variable-length sequences of discrete tokens, we can succinctly represent widely varying numbers of video objects, with diverse shapes and locations, without having to inject any localization cues in the training process. This eliminates the need to sample the space of all possible boxes that constrains conventional detectors and thus solves the dual problems of loss sparsity during training and heuristics-based postprocessing during inference. Second, it conceptualizes and outputs the video objects as fully integrated and indivisible 3D boxes or tracklets instead of generating image-specific 2D boxes and linking these boxes together to construct the video object, as done in most conventional detectors. This allows it to scale effortlessly with available computational resources by simply increasing the length of the video subsequence that the network takes as input, even generalizing to multi-object tracking if the subsequence can span the entire video. We compare our video detector with the baseline Pix2Seq static detector on several datasets and demonstrate consistent improvement, although with strong signs of being bottlenecked by our limited computational resources. We also compare it with several video detectors on UA-DETRAC to show that it is competitive with the current state of the art even with the computational bottleneck. We make our code and models publicly available.

[175] CoT-Segmenter: Enhancing OOD Detection in Dense Road Scenes via Chain-of-Thought Reasoning

Jeonghyo Song, Kimin Yun, DaeUng Jo, Jinyoung Kim, Youngjoon Yoo

Main category: cs.CV

TL;DR: Proposes a Chain-of-Thought (CoT) framework using GPT-4 for improved Out-of-Distribution detection in semantic segmentation of road scenes, addressing three challenging scenarios where current methods fail.

Details

Motivation: Current OOD detection methods struggle with complex road environments, particularly in densely packed objects, distant scenes with small objects, and large foreground-dominant objects. The potential of CoT-based visual reasoning from LLMs like GPT-4 remains unexplored for this task.

Method: A novel CoT-based framework leveraging foundation models (GPT-4) for enhanced image understanding and prompt-based reasoning aligned with problematic scene attributes in road anomaly detection.

Result: The framework consistently outperforms state-of-the-art methods on both standard benchmarks and a newly defined challenging subset of the RoadAnomaly dataset.

Conclusion: Provides a robust and interpretable solution for OOD semantic segmentation in complex driving environments by effectively utilizing LLM reasoning capabilities through CoT prompting.

Abstract: Effective Out-of-Distribution (OOD) detection is criti-cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT-4, which significantly enhanced multimodal reasoning through Chain-of-Thought (CoT) prompting, the application of CoT-based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state-of-the-art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground-dominant objects. To address the presented challenges, we propose a novel CoT-based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT-4, to enhance OOD detection through improved image understanding and prompt-based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state-of-the-art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.

[176] SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation

Sathvik Chereddy, John Femiani

Main category: cs.CV

TL;DR: SketchDNN is a generative model for CAD sketch synthesis that uses a unified continuous-discrete diffusion process with Gaussian-Softmax diffusion to handle both continuous parameters and discrete class labels, achieving state-of-the-art results.

Details

Motivation: To address the challenges of modeling heterogeneous primitive parameterizations and permutation invariance of primitives in CAD sketches, which require handling both continuous and discrete variables simultaneously.

Method: Uses Gaussian-Softmax diffusion where logits perturbed with Gaussian noise are projected onto the probability simplex via softmax transformation, enabling blended class labels for discrete variables in a unified diffusion framework.

Result: Significantly improves generation quality, reducing FID from 16.04 to 7.80 and NLL from 84.8 to 81.33 on the SketchGraphs dataset.

Conclusion: Establishes a new state-of-the-art in CAD sketch generation by effectively modeling both continuous and discrete variables through a unified diffusion process.

Abstract: We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fr'echet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset.

[177] Explicit Context Reasoning with Supervision for Visual Tracking

Fansheng Zeng, Bineng Zhong, Haiying Xia, Yufei Tan, Xiantao Hu, Liangtao Shi, Shuxiang Song

Main category: cs.CV

TL;DR: RSTrack introduces three core mechanisms for explicit context reasoning in visual tracking: context reasoning mechanism, forward supervision strategy, and efficient state modeling, achieving state-of-the-art performance with real-time speed.

Details

Motivation: Mainstream tracking algorithms lack explicit supervision for context association, making it difficult to effectively model target's evolving dynamics and maintain temporal consistency in cross-frame modeling.

Method: Proposes three core mechanisms: 1) Context Reasoning Mechanism that converts contextual associations into temporal reasoning process, 2) Forward Supervision Strategy using true target features as anchors to constrain reasoning, and 3) Efficient State Modeling with compression-reconstruction to extract core features and remove redundancy.

Result: Achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds.

Conclusion: RSTrack effectively alleviates contextual association divergence in traditional temporal modeling through explicit supervision and reasoning mechanisms, demonstrating superior tracking performance with efficiency.

Abstract: Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target’s evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. \textit{1) Context Reasoning Mechanism}: Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. \textit{2) Forward Supervision Strategy}: Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. \textit{3) Efficient State Modeling}: Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at https://github.com/GXNU-ZhongLab/RSTrack.

[178] Seeing More with Less: Video Capsule Endoscopy with Multi-Task Learning

Julia Werner, Oliver Bause, Julius Oexle, Maxime Le Floch, Franz Brinkmann, Jochen Hampe, Oliver Bringmann

Main category: cs.CV

TL;DR: A multi-task neural network for video capsule endoscopy that combines self-localization and anomaly detection in a single compact model with only 1M parameters, achieving 93.63% localization accuracy and 87.48% anomaly detection accuracy.

Details

Motivation: Overcome short battery lifetime in capsule endoscopy devices by integrating AI for intelligent real-time decision-making, while addressing challenges of data sparsity and limited device resources.

Method: Multi-task neural network combining gastrointestinal tract localization and small intestine anomaly detection within a single model, using established multi-task methods and Viterbi decoding for time-series analysis, with strict parameter constraints.

Result: Achieves 93.63% accuracy on localization task and 87.48% accuracy on anomaly detection task, outperforming current single-task models with only 1 million parameters.

Conclusion: The approach represents a significant advance in AI-based capsule endoscopy, enabling deployment in compact devices while improving performance over existing single-task solutions.

Abstract: Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision-making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant advance in AI-based approaches in this field. Our model achieves an accuracy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.

[179] Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving

Tianyuan Zhang, Ting Jin, Lu Wang, Jiangfan Liu, Siyuan Liang, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: Bench2ADVLM is a unified closed-loop evaluation framework for Vision-Language Models in autonomous driving that enables real-time testing across simulation and physical platforms, revealing performance limitations in current ADVLMs.

Details

Motivation: Current evaluation protocols for Vision-Language Model-based autonomous driving systems are limited to open-loop settings with static inputs, neglecting realistic closed-loop testing that captures interactive behavior, feedback resilience, and real-world safety requirements.

Method: The framework uses a dual-system adaptation architecture inspired by cognitive theories, where high-level driving commands from ADVLMs are interpreted by a general-purpose VLM into standardized mid-level control actions. It includes a physical control abstraction layer to bridge simulation-reality gaps and a self-reflective scenario generation module for comprehensive testing.

Result: Experiments across diverse scenarios and multiple state-of-the-art ADVLMs on physical platforms validated the framework’s diagnostic capabilities, revealing that existing ADVLMs exhibit limited performance under closed-loop conditions.

Conclusion: Bench2ADVLM establishes a hierarchical evaluation pipeline that integrates high-level reasoning, mid-level simulation actions, and low-level real-world execution, providing a comprehensive framework for assessing ADVLM performance in realistic autonomous driving scenarios.

Abstract: Vision-Language Models (VLMs) have recently emerged as a promising paradigm in autonomous driving (AD). However, current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs, neglecting the more realistic and informative closed-loop setting that captures interactive behavior, feedback resilience, and real-world safety. To address this, we introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across both simulation and physical platforms. Inspired by dual-process theories of cognition, we first adapt diverse ADVLMs to simulation environments via a dual-system adaptation architecture. In this design, heterogeneous high-level driving commands generated by target ADVLMs (fast system) are interpreted by a general-purpose VLM (slow system) into standardized mid-level control actions suitable for execution in simulation. To bridge the gap between simulation and reality, we design a physical control abstraction layer that translates these mid-level actions into low-level actuation signals, enabling, for the first time, closed-loop testing of ADVLMs on physical vehicles. To enable more comprehensive evaluation, Bench2ADVLM introduces a self-reflective scenario generation module that automatically explores model behavior and uncovers potential failure modes for safety-critical scenario generation. Overall, Bench2ADVLM establishes a hierarchical evaluation pipeline that seamlessly integrates high-level abstract reasoning, mid-level simulation actions, and low-level real-world execution. Experiments on diverse scenarios across multiple state-of-the-art ADVLMs and physical platforms validate the diagnostic strength of our framework, revealing that existing ADVLMs still exhibit limited performance under closed-loop conditions.

[180] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, Alex Wong

Main category: cs.CV

TL;DR: A method to adapt monocular depth estimators trained on perspective images to work with fisheye cameras using calibration tokens for latent space alignment, without retraining.

Details

Motivation: Foundational monocular depth estimators (FMDEs) trained on perspective images fail on fisheye images due to covariate shift from different camera calibration parameters, leading to inaccurate depth estimates.

Method: Introduces calibration tokens as a lightweight adaptation mechanism to modulate latent embeddings, aligning fisheye image distributions to perspective ones. Uses self-supervised training by recalibrating perspective images to fisheye format and enforcing consistency between estimates.

Result: Consistently improves performance over state-of-the-art methods on both indoor and outdoor scenes using a single set of tokens, without requiring fisheye training data.

Conclusion: The proposed calibration token approach effectively bridges the domain gap between perspective and fisheye images, enabling FMDE reuse for fisheye cameras without retraining or fine-tuning.

Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

[181] Toward Errorless Training ImageNet-1k

Bo Deng, Levi Heath

Main category: cs.CV

TL;DR: ANN trained on ImageNet 2012 achieves 98.3% accuracy using new method, with 322M parameters. Suspects double-labeling prevents 100% accuracy.

Details

Motivation: To achieve high accuracy on ImageNet dataset using feedforward artificial neural networks with a new training method.

Method: Feedforward artificial neural network trained on ImageNet 2012 dataset using a new method, employing 322,430,160 parameters with 4 decimal places precision.

Result: Achieved 98.3% accuracy rate with 99.69% Top-1 rate, and average 285.9 perfectly classified labels per batch partition. Best model uses 322M parameters.

Conclusion: The model performs exceptionally well but doesn’t reach 100% accuracy likely due to double-labeling issues (duplicate images with different labels) in the dataset.

Abstract: In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.

[182] ETA: Energy-based Test-time Adaptation for Depth Completion

Younjoon Chung, Hyoungseob Park, Patrick Rim, Xiaoran Zhang, Jihe He, Ziyao Zeng, Safa Cicek, Byung-Woo Hong, James S. Duncan, Alex Wong

Main category: cs.CV

TL;DR: Energy-based Test-time Adaptation (ETA) method improves depth completion model performance on novel target data by using adversarial perturbations to train an energy model that scores predictions as in/out-of-distribution, then updating model parameters at test time to minimize energy and align with source distribution.

Details

Motivation: Depth completion models trained on source data often fail when transferred to target data from novel environments due to covariate shift, and there's no access to target data prior to deployment.

Method: Utilize adversarial perturbations to explore data space and train an energy model that scores depth predictions as in- or out-of-distribution. Update pretrained model parameters at test time to minimize energy and align predictions with source distribution.

Result: ETA improves over previous state-of-the-art by average 6.94% for outdoors and 10.23% for indoors across three indoor and three outdoor datasets.

Conclusion: The proposed energy-based test-time adaptation method effectively addresses covariate shift in depth completion without requiring target data prior to deployment, achieving significant performance improvements.

Abstract: We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some source'' data, often predict erroneous outputs when transferred to target’’ data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation’’, or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.

[183] A Novel Image Similarity Metric for Scene Composition Structure

Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee

Main category: cs.CV

TL;DR: SCSSIM is a novel training-free metric that evaluates Scene Composition Structure preservation in AI-generated images using cuboidal hierarchical partitioning and statistical measures, outperforming traditional metrics in structural fidelity assessment.

Details

Motivation: Traditional image similarity metrics fail to adequately assess Scene Composition Structure (SCS) integrity in generative AI outputs, as pixel-level methods are noise-sensitive and perception-based metrics prioritize aesthetics over structural accuracy.

Method: The proposed SCSSIM metric uses analytical, training-free approach with cuboidal hierarchical partitioning of images to extract statistical measures that capture non-object-based structural relationships and quantify SCS preservation.

Result: SCSSIM demonstrates high invariance to non-compositional distortions while showing strong monotonic decrease for compositional distortions, accurately indicating when SCS has been altered, outperforming existing metrics.

Conclusion: SCSSIM provides superior structural evaluation capabilities for generative models, ensuring scene composition integrity without training overheads, making it an invaluable tool for AI model development and evaluation.

Abstract: The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image’s underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM’s high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.

[184] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang

Main category: cs.CV

TL;DR: WeTok is a new visual tokenizer that achieves superior compression ratios and reconstruction fidelity through group-wise lookup-free quantization and generative decoding, setting new state-of-the-art performance on ImageNet benchmarks.

Details

Motivation: Existing visual tokenizers face unsatisfactory trade-offs between compression ratios and reconstruction fidelity, creating a need for more efficient and effective tokenization methods.

Method: Two core innovations: (1) Group-wise lookup-free Quantization (GQ) that partitions latent features into groups for efficient quantization, and (2) Generative Decoding (GD) with prior noise variable to probabilistically model visual data distribution.

Result: Achieves record-low zero-shot rFID of 0.12 with 400% compression ratio on ImageNet 50k validation set, and 3.49 rFID with 768 compression ratio, outperforming previous state-of-the-art methods like FLUX-VAE and SD-VAE 3.5.

Conclusion: WeTok tokenizer demonstrates superior performance in both compression efficiency and reconstruction quality, making it a powerful solution for vision generation tasks with significant improvements over existing approaches.

Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19) with a 400% compression ratio. Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.

[185] JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics

Simindokht Jahangard, Mehrzad Mohammadi, Yi Shen, Zhixi Cai, Hamid Rezatofighi

Main category: cs.CV

TL;DR: The paper introduces JRDB-Reasoning, a new benchmark for visual reasoning in human-crowded environments, addressing limitations of existing benchmarks by formalizing reasoning complexity and providing customizable questions with detailed step-by-step annotations.

Details

Motivation: Existing visual reasoning benchmarks lack clear complexity definitions, control over question difficulty, task customization, and structured reasoning annotations, which limits their effectiveness for evaluating embodied AI agents like robots.

Method: The authors formalize reasoning complexity, develop an adaptive query engine that generates customizable questions with varying complexity levels, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning.

Result: The paper presents JRDB-Reasoning as a comprehensive benchmark that enables fine-grained evaluation of visual reasoning frameworks and dynamic assessment of visual-language models across different reasoning complexity levels.

Conclusion: The proposed benchmark and adaptive query engine bridge critical gaps in visual reasoning evaluation, providing structured annotations and complexity-controlled questions that better serve the needs of embodied AI research in human-crowded environments.

Abstract: Recent advances in Vision-Language Models (VLMs) and large language models (LLMs) have greatly enhanced visual reasoning, a key capability for embodied AI agents like robots. However, existing visual reasoning benchmarks often suffer from several limitations: they lack a clear definition of reasoning complexity, offer have no control to generate questions over varying difficulty and task customization, and fail to provide structured, step-by-step reasoning annotations (workflows). To bridge these gaps, we formalize reasoning complexity, introduce an adaptive query engine that generates customizable questions of varying complexity with detailed intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning, a benchmark tailored for visual reasoning in human-crowded environments. Our engine and benchmark enable fine-grained evaluation of visual reasoning frameworks and dynamic assessment of visual-language models across reasoning levels.

[186] AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition

Ying Huang, Yuanbin Man, Wenqi Jia, Zhengzhong Tu, Junzhou Huang, Miao Yin

Main category: cs.CV

TL;DR: AdaRing: A novel vision-language fine-tuning framework using cross-layer tensor ring decomposition to reduce adapter redundancy and improve parameter efficiency by 90% while maintaining state-of-the-art performance.

Details

Motivation: Existing adapter-based fine-tuning methods for vision-language models face limitations in compression rates due to cross-layer redundancy and limited representational capacity across homogeneous adapters.

Method: Proposes AdaRing framework using cross-layer tensor ring decomposition to formulate adapters as layer-shared tensor cores and layer-specific slices, with diverse rank-driven adapters guided by generalization-aware fine-tuning.

Result: Achieves state-of-the-art performance while reducing average training parameters by 90% across various tasks.

Conclusion: The proposed tensor ring decomposition approach effectively removes cross-layer redundancy and enables diverse adapter collaboration, providing ultra-light parameter-efficient adaptation for vision-language models.

Abstract: Adapter-based fine-tuning has gained remarkable attention in adapting large pre-trained vision language models (VLMs) for a wide range of downstream tasks efficiently. In this paradigm, only the inserted adapters are fine-tuned, without the need for training the original VLM backbone. Existing works scale adapters by integrating them into every layer of VLMs to increase the capacity of adapters. However, these methods face two primary limitations: 1) limited compression rate due to ignoring cross-layer redundancy, and 2) limited representational capacity across homogeneous adapters. In this paper, we propose a novel vision-language fine-tuning framework based on cross-layer tensor ring decomposition (TRD) with the integration and collaboration of diverse adapters, called AdaRing, achieving ultra-light parameter-efficient adaptation of VLMs on various tasks. To remove the high redundancy that exists among adapters across layers, we exploit the tensor-level low-rankness to formulate adapters as layer-shared tensor cores and layer-specific slices. Moreover, guided by generalization-aware fine-tuning, diverse rank-driven adapters cooperate to handle tasks that require different representations. Our experiments show that the proposed AdaRing achieves the state-of-the-art performance while reducing average training parameters by 90%.

[187] VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip

Wenqi Guo, Shan Du

Main category: cs.CV

TL;DR: VSF is a simple method that flips attention values from negative prompts to suppress unwanted content in few-step diffusion models, outperforming existing methods with minimal computational overhead.

Details

Motivation: Existing negative prompt guidance methods like CFG, NASA, and NAG have limitations in effectively suppressing undesired content in few-step diffusion models, requiring a more efficient and dynamic approach.

Method: Value Sign Flip (VSF) dynamically suppresses unwanted content by flipping the sign of attention values from negative prompts, requiring minimal computational overhead and working with MMDiT-style and cross-attention-based architectures.

Result: VSF demonstrates superior performance in both static image and video generation tasks, significantly improving negative prompt adherence compared to prior methods while maintaining competitive image quality.

Conclusion: VSF provides an efficient and effective solution for negative prompt guidance in few-step diffusion models, outperforming existing approaches with minimal computational cost.

Abstract: We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.

[188] Impact of Clinical Image Quality on Efficient Foundation Model Finetuning

Yucheng Tang, Pawel Rajwa, Alexander Ng, Yipei Wang, Wen Yan, Natasha Thorley, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, Shonit Punwani, Daniel C. Alexander, Veeru Kasivisvanathan, Yipeng Hu

Main category: cs.CV

TL;DR: Study evaluates how image quality affects label-efficient finetuning of ProFound foundation model for prostate MRI, finding that quality distribution and mismatch between finetuning/test sets significantly impact performance.

Details

Motivation: To investigate the impact of variable image quality on label-efficient finetuning of medical imaging foundation models and quantify how image quality distribution affects model generalizability.

Method: Comprehensive experiments systematically varying ratios of high- and low-quality images in finetuning and evaluation sets using ProFound, a domain-specific vision foundation model pretrained on large-scale prostate MRI datasets.

Result: Image quality distribution and finetune-test mismatch significantly affect performance. Sufficient high-quality images in finetuning set is critical. Label efficiency is not independent of image quality distribution - without enough high-quality images, finetuned models may fail to outperform non-pretrained models.

Conclusion: While foundation models show promising label efficiency, their performance is highly dependent on image quality distribution and consistency between finetuning and testing data, with varying importance across different downstream tasks like radiology reporting vs cancer detection.

Abstract: Foundation models in medical imaging have shown promising label efficiency, achieving high performance on downstream tasks using only a fraction of the annotated data otherwise required. In this study, we evaluate this potential in the context of prostate multiparametric MRI using ProFound, a recently developed domain-specific vision foundation model pretrained on large-scale prostate MRI datasets. We investigate the impact of variable image quality on the label-efficient finetuning, by quantifying the generalisability of the finetuned models. We conduct a comprehensive set of experiments by systematically varying the ratios of high- and low-quality images in the finetuning and evaluation sets. Our findings indicate that image quality distribution and its finetune-and-test mismatch significantly affect model performance. In particular: a) Varying the ratio of high- to low-quality images between finetuning and test sets leads to notable differences in downstream performance; and b) The presence of sufficient high-quality images in the finetuning set is critical for maintaining strong performance, whilst the importance of matched finetuning and testing distribution varies between different downstream tasks, such as automated radiology reporting and prostate cancer detection. Importantly, experimental results also show that, although finetuning requires significantly less labeled data compared to training from scratch when the quality ratio is consistent, this label efficiency is not independent of the image quality distribution. For example, we show cases that, without sufficient high-quality images in finetuning, finetuned models may fail to outperform those without pretraining.

[189] Superpixel-informed Continuous Low-Rank Tensor Representation for Multi-Dimensional Data Recovery

Zhizhou Wang, Jianli Wang, Ruijing Zheng, Zhenyu Wu

Main category: cs.CV

TL;DR: SCTR framework overcomes limitations of traditional low-rank tensor methods by using superpixels as modeling units and neural network parameterization for continuous, flexible multi-dimensional data processing.

Details

Motivation: Traditional low-rank tensor representation methods assume holistic data is low-rank and are limited to discrete meshgrid data, which doesn't hold in real-world scenarios with spatial variations.

Method: Proposes Superpixel-informed Continuous low-rank Tensor Representation (SCTR) using superpixels as basic units and asymmetric low-rank tensor factorization with neural network parameterization to separate global patterns from local adaptations.

Result: Achieves 3-5 dB PSNR improvements over existing LRTR-based methods across multispectral images, videos, and color images on benchmark datasets.

Conclusion: SCTR provides a more flexible and effective framework for multi-dimensional data processing that handles spatial variations and continuous data beyond traditional grid constraints.

Abstract: Low-rank tensor representation (LRTR) has emerged as a powerful tool for multi-dimensional data processing. However, classical LRTR-based methods face two critical limitations: (1) they typically assume that the holistic data is low-rank, this assumption is often violated in real-world scenarios with significant spatial variations; and (2) they are constrained to discrete meshgrid data, limiting their flexibility and applicability. To overcome these limitations, we propose a Superpixel-informed Continuous low-rank Tensor Representation (SCTR) framework, which enables continuous and flexible modeling of multi-dimensional data beyond traditional grid-based constraints. Our approach introduces two main innovations: First, motivated by the observation that semantically coherent regions exhibit stronger low-rank characteristics than holistic data, we employ superpixels as the basic modeling units. This design not only encodes rich semantic information, but also enhances adaptability to diverse forms of data streams. Second, we propose a novel asymmetric low-rank tensor factorization (ALTF) where superpixel-specific factor matrices are parameterized by a shared neural network with specialized heads. By strategically separating global pattern learning from local adaptation, this framework efficiently captures both cross-superpixel commonalities and within-superpixel variations. This yields a representation that is both highly expressive and compact, balancing model efficiency with adaptability. Extensive experiments on several benchmark datasets demonstrate that SCTR achieves 3-5 dB PSNR improvements over existing LRTR-based methods across multispectral images, videos, and color images.

[190] 2D Gaussians Meet Visual Tokenizer

Yiang Shi, Xiaoyang Guo, Wei Yin, Mingkai Jia, Qian Zhang, Xiaolin Hu, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: VGQ is a new image tokenizer that uses 2D Gaussians to better capture geometric structures in images, outperforming existing methods like VQ-GAN on reconstruction quality.

Details

Motivation: Existing quantization-based tokenizers like VQ-GAN focus mainly on appearance features (texture, color) and neglect geometric structures due to their patch-based design, limiting their ability to model structured visual information.

Method: Proposed Visual Gaussian Quantization (VGQ) framework that integrates 2D Gaussians into visual codebook quantization. Encodes image latents as 2D Gaussian distributions that explicitly model structure-related parameters like position, rotation, and scale.

Result: Achieved strong reconstruction quality with rFID score of 1.00 on ImageNet 256x256. By increasing 2D Gaussian density, achieved state-of-the-art rFID score of 0.556 and PSNR of 24.93, substantially outperforming existing methods.

Conclusion: VGQ provides a flexible trade-off between token efficiency and visual richness, demonstrating that incorporating explicit structural modeling through 2D Gaussians significantly improves image tokenization and reconstruction capabilities.

Abstract: The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast, VGQ encodes image latents as 2D Gaussian distributions, effectively capturing geometric and spatial structures by directly modeling structure-related parameters such as position, rotation and scale. We further demonstrate that increasing the density of 2D Gaussians within the tokens leads to significant gains in reconstruction fidelity, providing a flexible trade-off between token efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves strong reconstruction quality with an rFID score of 1.00. Furthermore, by increasing the density of 2D Gaussians within the tokens, VGQ gains a significant boost in reconstruction capability and achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially outperforming existing methods. Codes will be released soon.

[191] DiffIER: Optimizing Diffusion Models with Iterative Error Reduction

Ao Chen, Lihe Ding, Tianfan Xue

Main category: cs.CV

TL;DR: Identifies training-inference gap in diffusion models causing sensitivity to guidance weight, proposes DiffIER optimization method to minimize accumulated error during inference, improving conditional generation quality across multiple domains.

Details

Motivation: Classifier-Free Guidance (CFG) in diffusion models shows high sensitivity to guidance weight selection due to a training-inference gap that undermines conditional generation performance.

Method: Proposes DiffIER, an optimization-based method that performs iterative error minimization at each inference step to reduce accumulated error and bridge the training-inference gap.

Result: Empirical results show DiffIER outperforms baseline approaches in conditional generation tasks and achieves consistent success in text-to-image generation, image super-resolution, and text-to-speech generation.

Conclusion: The proposed plug-and-play optimization framework effectively reduces accumulated error during inference, enhancing generation quality and demonstrating versatility across multiple applications.

Abstract: Diffusion models have demonstrated remarkable capabilities in generating high-quality samples and enhancing performance across diverse domains through Classifier-Free Guidance (CFG). However, the quality of generated samples is highly sensitive to the selection of the guidance weight. In this work, we identify a critical ``training-inference gap’’ and we argue that it is the presence of this gap that undermines the performance of conditional generation and renders outputs highly sensitive to the guidance weight. We quantify this gap by measuring the accumulated error during the inference stage and establish a correlation between the selection of guidance weight and minimizing this gap. Furthermore, to mitigate this gap, we propose DiffIER, an optimization-based method for high-quality generation. We demonstrate that the accumulated error can be effectively reduced by an iterative error minimization at each step during inference. By introducing this novel plug-and-play optimization framework, we enable the optimization of errors at every single inference step and enhance generation quality. Empirical results demonstrate that our proposed method outperforms baseline approaches in conditional generation tasks. Furthermore, the method achieves consistent success in text-to-image generation, image super-resolution, and text-to-speech generation, underscoring its versatility and potential for broad applications in future research.

[192] Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering

Diaa Addeen Abuhani, Marco Seccaroni, Martina Mazzarello, Imran Zualkernan, Fabio Duarte, Carlo Ratti

Main category: cs.CV

TL;DR: Unsupervised AI framework uses street imagery and spatial patterns to estimate urban tree biodiversity without labels, achieving high accuracy across multiple cities.

Details

Motivation: Traditional field inventories are costly and time-consuming, while supervised AI methods require labeled data that doesn't generalize well across regions. Cities need scalable solutions for biodiversity monitoring.

Method: Unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without requiring labeled data.

Result: Applied to eight North American cities, the method recovered genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices while preserving spatial autocorrelation.

Conclusion: This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and supports continuous, low-cost monitoring for equitable greenery access and adaptive urban ecosystem management.

Abstract: Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.

[193] ViT-FIQA: Assessing Face Image Quality using Vision Transformers

Andrea Atzori, Fadi Boutros, Naser Damer

Main category: cs.CV

TL;DR: ViT-FIQA is a novel face image quality assessment method that extends Vision Transformer backbones with a learnable quality token to predict face recognition utility scores, achieving state-of-the-art performance across various benchmarks.

Details

Motivation: Current FIQA methods primarily rely on CNNs, leaving the potential of Vision Transformer architectures underexplored for face image quality assessment tasks.

Method: Extends standard ViT backbones with a learnable quality token concatenated with image patch tokens, processed via global self-attention. Uses two output heads: one for face representation learning and another for quality score regression.

Result: Extensive experiments show ViT-FIQA consistently achieves top-tier performance on challenging benchmarks across both CNN- and ViT-based face recognition models.

Conclusion: Transformer-based architectures are highly effective for modeling face image utility, and ViTs represent a scalable foundation for future FIQA research.

Abstract: Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample’s utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research https://cutt.ly/irHlzXUC.

cs.AI

[194] Large Language Models are Highly Aligned with Human Ratings of Emotional Stimuli

Mattson Ogg, Chace Ashcraft, Ritwik Bose, Raphael Norman-Tenazas, Michael Wolmetz

Main category: cs.AI

TL;DR: LLMs like GPT-4o show strong alignment with human emotional ratings for words and images, particularly for happiness, but differ in arousal ratings and show less variability than human responses.

Details

Motivation: To understand how large language models evaluate emotionally loaded stimuli and assess their alignment with human emotional responses, which is crucial for determining their effectiveness in roles involving human interaction.

Method: Elicited ratings from multiple popular LLMs for datasets of words and images previously rated by humans, comparing responses across different emotional rating scales and frameworks.

Result: GPT-4o responded very similarly to human participants (r = 0.9+ in many cases), with best alignment for happiness ratings and poorer alignment for arousal. LLMs aligned better with five-category emotion framework than two-dimensional organization, and showed substantially more homogeneous ratings than humans.

Conclusion: LLMs demonstrate significant but imperfect alignment with human emotional evaluation, highlighting both similarities and differences between biological and artificial intelligence in emotional processing domains.

Abstract: Emotions exert an immense influence over human behavior and cognition in both commonplace and high-stress tasks. Discussions of whether or how to integrate large language models (LLMs) into everyday life (e.g., acting as proxies for, or interacting with, human agents), should be informed by an understanding of how these tools evaluate emotionally loaded stimuli or situations. A model’s alignment with human behavior in these cases can inform the effectiveness of LLMs for certain roles or interactions. To help build this understanding, we elicited ratings from multiple popular LLMs for datasets of words and images that were previously rated for their emotional content by humans. We found that when performing the same rating tasks, GPT-4o responded very similarly to human participants across modalities, stimuli and most rating scales (r = 0.9 or higher in many cases). However, arousal ratings were less well aligned between human and LLM raters, while happiness ratings were most highly aligned. Overall LLMs aligned better within a five-category (happiness, anger, sadness, fear, disgust) emotion framework than within a two-dimensional (arousal and valence) organization. Finally, LLM ratings were substantially more homogenous than human ratings. Together these results begin to describe how LLM agents interpret emotional stimuli and highlight similarities and differences among biological and artificial intelligence in key behavioral domains.

[195] Explaining Hitori Puzzles: Neurosymbolic Proof Staging for Sequential Decisions

Maria Leonor Pacheco, Fabio Somenzi, Dananjay Srinivas, Ashutosh Trivedi

Main category: cs.AI

TL;DR: Neurosymbolic approach combining SAT solvers and LLMs for explaining Hitori puzzle solutions, using resolution proofs for local constraints and visual explanations for connectivity.

Details

Motivation: To create effective explanations for complex decision sequences by leveraging both formal reasoning (SAT solvers) and natural language capabilities (LLMs), using Hitori puzzles as a test case that requires both local constraint reasoning and visual connectivity explanations.

Method: Combines decision procedures (SAT solvers) for generating short resolution proofs of local constraints with Large Language Models for producing visual explanations of connectivity constraints in Hitori puzzles.

Result: Implemented a tool that assists humans in solving Hitori puzzles, with experimental evidence demonstrating its effectiveness in providing comprehensive explanations.

Conclusion: The neurosymbolic approach successfully integrates formal reasoning and language models to provide flexible and effective explanations for complex decision problems like Hitori puzzles.

Abstract: We propose a neurosymbolic approach to the explanation of complex sequences of decisions that combines the strengths of decision procedures and Large Language Models (LLMs). We demonstrate this approach by producing explanations for the solutions of Hitori puzzles. The rules of Hitori include local constraints that are effectively explained by short resolution proofs. However, they also include a connectivity constraint that is more suitable for visual explanations. Hence, Hitori provides an excellent testing ground for a flexible combination of SAT solvers and LLMs. We have implemented a tool that assists humans in solving Hitori puzzles, and we present experimental evidence of its effectiveness.

[196] Automated Optimization Modeling through Expert-Guided Large Language Model Reasoning

Beinuo Yang, Qishen Zhou, Junyi Li, Xingchen Su, Simon Hu

Main category: cs.AI

TL;DR: ORThought framework uses chain-of-thought reasoning to automate optimization modeling, outperforming existing methods on complex problems with improved benchmarks and error correction.

Details

Motivation: Current LLM approaches for optimization modeling suffer from high error rates (up to 42%), narrow evaluation scope, and computational inefficiency, requiring heavy reliance on domain experts.

Method: Enhanced datasets through systematic error correction and comprehensive annotation, introduced LogiOR benchmark from logistics domain, and developed ORThought framework using expert-level optimization principles through chain-of-thought reasoning.

Result: ORThought outperforms existing approaches including multi-agent frameworks, with significant advantages on complex optimization problems.

Conclusion: The framework provides valuable insights for future LLM-based optimization modeling research, addressing critical limitations in current methods.

Abstract: Optimization Modeling (OM) is essential for solving complex decision-making problems. However, the process remains time-consuming and error-prone, heavily relying on domain experts. While Large Language Models (LLMs) show promise in addressing these challenges through their natural language understanding and reasoning capabilities, current approaches face three critical limitations: high benchmark labeling error rates reaching up to 42%, narrow evaluation scope that only considers optimal values, and computational inefficiency due to heavy reliance on multi-agent systems or model fine-tuning. In this work, we first enhance existing datasets through systematic error correction and more comprehensive annotation. Additionally, we introduce LogiOR, a new optimization modeling benchmark from the logistics domain, containing more complex problems with standardized annotations. Furthermore, we present ORThought, a novel framework that leverages expert-level optimization modeling principles through chain-of-thought reasoning to automate the OM process. Through extensive empirical evaluation, we demonstrate that ORThought outperforms existing approaches, including multi-agent frameworks, with particularly significant advantages on complex optimization problems. Finally, we provide a systematic analysis of our method, identifying critical success factors and failure modes, providing valuable insights for future research on LLM-based optimization modeling.

[197] The Agent Behavior: Model, Governance and Challenges in the AI Digital Age

Qiang Zhang, Pei Yan, Yijia Xu, Chuanpo Fu, Yong Fang, Yang Liu

Main category: cs.AI

TL;DR: Proposes Network Behavior Lifecycle model and A4A paradigm to analyze human-agent behavioral differences across 6 stages and 5 dimensions, addressing trust and security challenges in AI-human collaboration.

Details

Motivation: AI agents increasingly mimic human behavior in networked environments, creating challenges in trust, responsibility, ethics, and security due to difficulties in supervising agent behaviors and unclear accountability.

Method: Develops Network Behavior Lifecycle model (6 stages) and Agent for Agent (A4A) paradigm with Human-Agent Behavioral Disparity (HABD) model analyzing 5 dimensions: decision mechanism, execution efficiency, intention-behavior consistency, behavioral inertia, and irrational patterns.

Result: Model effectiveness verified through real-world cases including red team penetration and blue team defense scenarios.

Conclusion: Provides theoretical foundation and technical roadmap for secure human-agent collaboration, with future research directions in dynamic cognitive governance, behavioral disparity quantification, and meta-governance protocol stacks.

Abstract: Advancements in AI have led to agents in networked environments increasingly mirroring human behavior, thereby blurring the boundary between artificial and human actors in specific contexts. This shift brings about significant challenges in trust, responsibility, ethics, security and etc. The difficulty in supervising of agent behaviors may lead to issues such as data contamination and unclear accountability. To address these challenges, this paper proposes the “Network Behavior Lifecycle” model, which divides network behavior into 6 stages and systematically analyzes the behavioral differences between humans and agents at each stage. Based on these insights, the paper further introduces the “Agent for Agent (A4A)” paradigm and the “Human-Agent Behavioral Disparity (HABD)” model, which examine the fundamental distinctions between human and agent behaviors across 5 dimensions: decision mechanism, execution efficiency, intention-behavior consistency, behavioral inertia, and irrational patterns. The effectiveness of the model is verified through real-world cases such as red team penetration and blue team defense. Finally, the paper discusses future research directions in dynamic cognitive governance architecture, behavioral disparity quantification, and meta-governance protocol stacks, aiming to provide a theoretical foundation and technical roadmap for secure and trustworthy human-agent collaboration.

[198] Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs

Luca Annese, Sabrina Patania, Silvia Serino, Tom Foulsham, Silvia Rossi, Azzurra Ruggeri, Dimitri Ognibene

Main category: cs.AI

TL;DR: Using structured examples from planner solution graphs to improve LLM-based agents’ perspective-taking, but results show limited success with persistent challenges in complex scenarios.

Details

Motivation: Improve perspective-taking capabilities of LLM-based agents for tasks involving active perception, collaborative reasoning, and understanding what other agents can see or know.

Method: Proposed structured solution-processing pipeline generating three example types (G-type, E-type, L-type) from Fast Downward planner solution graphs, converted into thought-action examples via LLM prompting.

Result: L-type examples slightly reduced clarification requests and action steps but didn’t yield consistent improvements. Agents succeeded in basic attentional filtering but struggled with mentalising about occluded spaces and weighing epistemic action costs.

Conclusion: Structured examples alone are insufficient for robust perspective-taking; explicit belief tracking, cost modelling, and richer environments are needed for socially grounded collaboration in LLM-based agents.

Abstract: Recent advances in large language models (LLMs) and reasoning frameworks have opened new possibilities for improving the perspective -taking capabilities of autonomous agents. However, tasks that involve active perception, collaborative reasoning, and perspective taking (understanding what another agent can see or knows) pose persistent challenges for current LLM-based systems. This study investigates the potential of structured examples derived from transformed solution graphs generated by the Fast Downward planner to improve the performance of LLM-based agents within a ReAct framework. We propose a structured solution-processing pipeline that generates three distinct categories of examples: optimal goal paths (G-type), informative node paths (E-type), and step-by-step optimal decision sequences contrasting alternative actions (L-type). These solutions are further converted into ``thought-action’’ examples by prompting an LLM to explicitly articulate the reasoning behind each decision. While L-type examples slightly reduce clarification requests and overall action steps, they do not yield consistent improvements. Agents are successful in tasks requiring basic attentional filtering but struggle in scenarios that required mentalising about occluded spaces or weighing the costs of epistemic actions. These findings suggest that structured examples alone are insufficient for robust perspective-taking, underscoring the need for explicit belief tracking, cost modelling, and richer environments to enable socially grounded collaboration in LLM-based agents.

[199] LeanGeo: Formalizing Competitional Geometry problems in Lean

Chendong Song, Zihan Wang, Frederick Pu, Haiming Wang, Xiaohan Lin, Junqi Liu, Jia Li, Zhengying Liu

Main category: cs.AI

TL;DR: LeanGeo is a unified formal system for solving competition-level geometry problems using Lean 4 theorem prover, featuring comprehensive geometric theorems and rigorous proof verification integrated with Mathlib.

Details

Motivation: Existing geometry solving systems lack unified frameworks and struggle with verification due to reliance on intuitive diagrams, making integration with other mathematical fields difficult.

Method: Developed LeanGeo within Lean 4 theorem prover with a comprehensive library of high-level geometric theorems, and created LeanGeo-Bench benchmark with problems from IMO and advanced sources.

Result: Evaluation shows capabilities and limitations of state-of-the-art Large Language Models on the benchmark, demonstrating the system’s effectiveness for formal geometry problem solving.

Conclusion: LeanGeo provides a unified framework for rigorous geometry problem solving and verification, highlighting the need for further advancements in automated geometric reasoning.

Abstract: Geometry problems are a crucial testbed for AI reasoning capabilities. Most existing geometry solving systems cannot express problems within a unified framework, thus are difficult to integrate with other mathematical fields. Besides, since most geometric proofs rely on intuitive diagrams, verifying geometry problems is particularly challenging. To address these gaps, we introduce LeanGeo, a unified formal system for formalizing and solving competition-level geometry problems within the Lean 4 theorem prover. LeanGeo features a comprehensive library of high-level geometric theorems with Lean’s foundational logic, enabling rigorous proof verification and seamless integration with Mathlib. We also present LeanGeo-Bench, a formal geometry benchmark in LeanGeo, comprising problems from the International Mathematical Olympiad (IMO) and other advanced sources. Our evaluation demonstrates the capabilities and limitations of state-of-the-art Large Language Models on this benchmark, highlighting the need for further advancements in automated geometric reasoning. We open source the theorem library and the benchmark of LeanGeo at https://github.com/project-numina/LeanGeo/tree/master.

[200] Entropy-Constrained Strategy Optimization in Urban Floods: A Multi-Agent Framework with LLM and Knowledge Graph Integration

Peilin Ji, Xiao Xue, Simeng Wang, Wenhao Yan

Main category: cs.AI

TL;DR: H-J is a hierarchical multi-agent framework that addresses urban flood emergency scheduling challenges by integrating knowledge-guided prompting, entropy-constrained generation, and feedback-driven optimization, outperforming existing methods in traffic management and system robustness.

Details

Motivation: Extreme urban rainfall events cause severe traffic congestion and service disruptions, but current emergency scheduling systems struggle with dynamic trade-offs between competing goals, rapidly changing environmental conditions, and instability in LLM-generated strategies.

Method: H-J framework uses hierarchical multi-agent architecture with knowledge-guided prompting, entropy-constrained generation, and feedback-driven optimization to create a closed-loop pipeline from multi-source perception to strategic execution and refinement.

Result: H-J outperforms rule-based and reinforcement-learning baselines across extreme rainfall, intermittent bursts, and daily light rain conditions, achieving better traffic smoothness, task success rate, and system robustness.

Conclusion: The framework demonstrates the promise of uncertainty-aware, knowledge-constrained LLM-based approaches for enhancing urban flood response resilience through effective multi-agent coordination and dynamic optimization.

Abstract: In recent years, the increasing frequency of extreme urban rainfall events has posed significant challenges to emergency scheduling systems. Urban flooding often leads to severe traffic congestion and service disruptions, threatening public safety and mobility. However, effective decision making remains hindered by three key challenges: (1) managing trade-offs among competing goals (e.g., traffic flow, task completion, and risk mitigation) requires dynamic, context-aware strategies; (2) rapidly evolving environmental conditions render static rules inadequate; and (3) LLM-generated strategies frequently suffer from semantic instability and execution inconsistency. Existing methods fail to align perception, global optimization, and multi-agent coordination within a unified framework. To tackle these challenges, we introduce H-J, a hierarchical multi-agent framework that integrates knowledge-guided prompting, entropy-constrained generation, and feedback-driven optimization. The framework establishes a closed-loop pipeline spanning from multi-source perception to strategic execution and continuous refinement. We evaluate H-J on real-world urban topology and rainfall data under three representative conditions: extreme rainfall, intermittent bursts, and daily light rain. Experiments show that H-J outperforms rule-based and reinforcement-learning baselines in traffic smoothness, task success rate, and system robustness. These findings highlight the promise of uncertainty-aware, knowledge-constrained LLM-based approaches for enhancing resilience in urban flood response.

[201] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Junnan Li

Main category: cs.AI

TL;DR: MCP-Universe is the first comprehensive benchmark for evaluating LLMs in realistic tasks through interaction with real-world MCP servers, revealing significant performance limitations in even state-of-the-art models across 6 core domains.

Details

Motivation: Existing benchmarks are overly simplistic and fail to capture real application challenges like long-horizon reasoning and large, unfamiliar tool spaces in the rapidly adopted Model Context Protocol ecosystem.

Method: Developed MCP-Universe benchmark with 6 domains spanning 11 MCP servers, implementing execution-based evaluators including format, static, and dynamic evaluators that automatically retrieve real-time ground truth for temporal tasks.

Result: Even SOTA models show significant limitations: GPT-5 (43.72%), Grok-4 (33.33%), Claude-4.0-Sonnet (29.44%). Benchmark reveals long-context and unknown-tools challenges, with enterprise agents performing no better than standard ReAct frameworks.

Conclusion: The benchmark exposes critical gaps in current LLM capabilities for real-world MCP interactions and provides an open-source extensible framework to foster innovation in the MCP ecosystem.

Abstract: The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.

[202] Data-Driven Probabilistic Evaluation of Logic Properties with PAC-Confidence on Mealy Machines

Swantje Plambeck, Ali Salamati, Eyke Huellermeier, Goerschwin Fey

Main category: cs.AI

TL;DR: Data-driven PAC learning approach to determine safety probability of cyber-physical systems with discrete Mealy machine abstractions, validated on automated lane-keeping system.

Details

Motivation: Cyber-physical systems require powerful models for verification and diagnosis, but manual model extraction is difficult. Data-driven approaches provide solutions when suitable models are unavailable.

Method: Active learning approach based on Probably Approximately Correct (PAC) paradigm, using guided sampling after initial data collection to learn safety probabilities on finite time horizons.

Result: Establishes connection between discrete logic and probabilistic reachability analysis, providing additional confidence on determined safety probabilities.

Conclusion: The proposed data-driven approach successfully determines safety probabilities for CPS with discrete abstractions, validated through case study on automated lane-keeping systems.

Abstract: Cyber-Physical Systems (CPS) are complex systems that require powerful models for tasks like verification, diagnosis, or debugging. Often, suitable models are not available and manual extraction is difficult. Data-driven approaches then provide a solution to, e.g., diagnosis tasks and verification problems based on data collected from the system. In this paper, we consider CPS with a discrete abstraction in the form of a Mealy machine. We propose a data-driven approach to determine the safety probability of the system on a finite horizon of n time steps. The approach is based on the Probably Approximately Correct (PAC) learning paradigm. Thus, we elaborate a connection between discrete logic and probabilistic reachability analysis of systems, especially providing an additional confidence on the determined probability. The learning process follows an active learning paradigm, where new learning data is sampled in a guided way after an initial learning set is collected. We validate the approach with a case study on an automated lane-keeping system.

[203] Privileged Self-Access Matters for Introspection in AI

Siyuan Song, Harvey Lederman, Jennifer Hu, Kyle Mahowald

Main category: cs.AI

TL;DR: The paper argues for a thicker definition of AI introspection, proposing it as any process yielding information about internal states more reliably than what’s available to third parties at equal/lower computational cost. Experiments show LLMs can appear to have lightweight introspection while failing meaningful introspection.

Details

Motivation: There is no consensus on how to define introspection in AI models, and existing 'lightweight' definitions may be insufficient to capture meaningful introspection capabilities.

Method: The authors propose a new thicker definition of AI introspection and conduct experiments where LLMs reason about their internal temperature parameters to test both lightweight and their proposed definition.

Result: Experiments show that LLMs can appear to have lightweight introspection capabilities while failing to demonstrate meaningful introspection according to the authors’ proposed thicker definition.

Conclusion: A thicker definition of introspection is needed to properly evaluate AI models’ true introspective capabilities, as lightweight definitions may produce misleading results about whether models can genuinely introspect.

Abstract: Whether AI models can introspect is an increasingly important practical question. But there is no consensus on how introspection is to be defined. Beginning from a recently proposed ‘’lightweight’’ definition, we argue instead for a thicker one. According to our proposal, introspection in AI is any process which yields information about internal states through a process more reliable than one with equal or lower computational cost available to a third party. Using experiments where LLMs reason about their internal temperature parameters, we show they can appear to have lightweight introspection while failing to meaningfully introspect per our proposed definition.

[204] Benchmarking graph construction by large language models for coherence-driven inference

Steve Huntsman, Jewell Thomas

Main category: cs.AI

TL;DR: Algorithm generates propositions for coherence graphs, LLMs show promising ability to reconstruct graphs from natural language propositions, with some models achieving perfect reconstruction on sparse graphs.

Details

Motivation: To advance machine cognition capabilities through coherence-driven inference and benchmark LLMs' ability to reconstruct coherence graphs from natural language propositions.

Method: Devised an algorithm to generate propositions that instantiate coherence graphs, then tested LLMs’ reconstruction ability from natural language propositions using single prompts to reasoning-optimized models.

Result: Promising results with o1/3/4-mini achieving perfect reconstruction half of the time on sparse graphs from a single prompt.

Conclusion: Coherence-driven inference on consistency evaluations by LLMs may significantly advance machine cognition capabilities.

Abstract: We devise an algorithm to generate propositions that objectively instantiate graphs supporting coherence-driven inference. We also benchmark the ability of large language models (LLMs) to reconstruct coherence graphs from (a simple transformation of) propositions expressed in natural language, with promising results from a single prompt to reasoning-optimized LLMs. For example, o1/3/4-mini achieve perfect reconstruction half of the time on sparse graphs. Coherence-driven inference on consistency evaluations by LLMs may advance machine cognition capabilities.

[205] Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents

Nayoung Choi, Grace Byun, Andrew Chung, Ellie S. Paek, Shinsun Lee, Jinho D. Choi

Main category: cs.AI

TL;DR: Proposed RAG-QA framework for enterprise internal documents that handles multi-modal data, ensures privacy, and provides source traceability, showing significant improvements over non-RAG baselines.

Details

Motivation: Corporate documents contain valuable domain knowledge but are difficult to access due to volume and disorganization. Current RAG-QA systems face challenges with multi-modal data, confidentiality, and source traceability.

Method: Framework with: (1) data pipeline converting raw multi-modal docs to structured corpus and QA pairs, (2) fully on-premise privacy-preserving architecture, (3) lightweight reference matcher for source linking.

Result: Applied to automotive domain, improved factual correctness (+1.79, +1.94), informativeness (+1.33, +1.16), and helpfulness (+1.08, +1.67) over baseline on 1-5 scale from human and LLM judges.

Conclusion: The proposed RAG-QA framework effectively addresses enterprise document challenges while maintaining privacy and enabling source traceability, demonstrating practical value for internal knowledge management.

Abstract: Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests, each costing hundreds of thousands of dollars, produce highly detailed documentation. However, retrieving relevant content during decision-making remains time-consuming due to the scale and complexity of the material. While Retrieval-Augmented Generation (RAG)-based Question Answering (QA) systems offer a promising solution, building an internal RAG-QA system poses several challenges: (1) handling heterogeneous multi-modal data sources, (2) preserving data confidentiality, and (3) enabling traceability between each piece of information in the generated answer and its original source document. To address these, we propose a RAG-QA framework for internal enterprise use, consisting of: (1) a data pipeline that converts raw multi-modal documents into a structured corpus and QA pairs, (2) a fully on-premise, privacy-preserving architecture, and (3) a lightweight reference matcher that links answer segments to supporting content. Applied to the automotive domain, our system improves factual correctness (+1.79, +1.94), informativeness (+1.33, +1.16), and helpfulness (+1.08, +1.67) over a non-RAG baseline, based on 1-5 scale ratings from both human and LLM judge.

[206] Unsupervised Learning for Quadratic Assignment

Yimeng Min, Carla P. Gomes

Main category: cs.AI

TL;DR: PLUME search is an unsupervised learning framework that improves combinatorial optimization search efficiency using permutation-based loss without requiring labeled data or reinforcement learning.

Details

Motivation: To enhance search efficiency in combinatorial optimization problems through a data-driven approach that doesn't rely on supervised learning or reinforcement learning, addressing the limitations of existing methods.

Method: Uses unsupervised learning with permutation-based loss and non-autoregressive approach, learning directly from problem instances without labeled data.

Result: Consistently improves solution quality on quadratic assignment problems and demonstrates generalization across different problem densities and sizes.

Conclusion: PLUME search provides an effective unsupervised framework for combinatorial optimization that generalizes well and outperforms traditional approaches without requiring supervision.

Abstract: We introduce PLUME search, a data-driven framework that enhances search efficiency in combinatorial optimization through unsupervised learning. Unlike supervised or reinforcement learning, PLUME search learns directly from problem instances using a permutation-based loss with a non-autoregressive approach. We evaluate its performance on the quadratic assignment problem, a fundamental NP-hard problem that encompasses various combinatorial optimization problems. Experimental results demonstrate that PLUME search consistently improves solution quality. Furthermore, we study the generalization behavior and show that the learned model generalizes across different densities and sizes.

[207] Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Xinji Mai, Haotian Xu, Zhong-Zhi Li, Xing W, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang

Main category: cs.AI

TL;DR: ZeroTIR uses outcome-based RL to train LLMs to spontaneously generate and execute Python code for math problems without supervised examples, showing predictable scaling of tool use and accuracy with training.

Details

Motivation: LLMs struggle with precise mathematical reasoning, and understanding how agents autonomously learn to leverage external tools like code execution remains crucial for enhanced reasoning capabilities.

Method: Reinforcement Learning from outcome-based rewards for Tool-Integrated Reasoning (ZeroTIR), training base LLMs to generate and execute Python code without supervised tool-use examples, using a decoupled code execution environment.

Result: ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks, with strong positive correlations showing increased training steps lead to higher code execution frequency, response length, and task accuracy.

Conclusion: The research provides foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark and demonstrating quantifiable relationships between training effort and emergent tool-augmented reasoning strategies.

Abstract: Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at \href{https://github.com/yyht/openrlhf_async_pipline}{https://github.com/yyht/openrlhf_async_pipline}.

[208] Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs

Maris F. L. Galesloot, Roman Andriushchenko, Milan Češka, Sebastian Junges, Nils Jansen

Main category: cs.AI

TL;DR: A method for computing robust policies in Hidden-model POMDPs by combining formal verification and subgradient ascent to handle uncertainty across multiple environment models.

Details

Motivation: Optimal POMDP policies lack robustness against environmental perturbations, and HM-POMDPs model sets of potential environment models where the true model is unknown at execution time.

Method: Combines deductive formal verification for tractable robust policy evaluation (computing worst-case POMDP) with subgradient ascent to optimize policies for worst-case scenarios.

Result: Produces more robust policies that generalize better to unseen POMDPs and scales to HM-POMDPs with over 100,000 environments, outperforming various baselines.

Conclusion: The approach successfully addresses robustness in HM-POMDPs by integrating verification and optimization techniques, enabling scalable robust policy computation for complex uncertain environments.

Abstract: Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs. We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP, and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.

[209] The NordDRG AI Benchmark for Large Language Models

Tapio Pitkäranta

Main category: cs.AI

TL;DR: First public benchmark for testing LLMs on hospital funding DRG (Diagnosis-Related Groups) coding tasks, with GPT-5 Thinking achieving 13/13 on logic tasks and 7/13 on full grouper emulation.

Details

Motivation: LLMs are being used for clinical coding but lack open benchmarks for hospital funding systems where DRGs determine reimbursement, creating transparency and auditability concerns in multi-trillion-dollar health spending.

Method: Created NordDRG-AI-Benchmark with machine-readable NordDRG definition tables, expert manuals, and change-log templates. Includes 13 Logic tasks and 13 Grouper tasks requiring full DRG grouper emulation with strict exact-match scoring.

Result: GPT-5 Thinking and Opus 4.1 scored 13/13 on Logic tasks; GPT-5 Thinking solved 7/13 Grouper tasks. Most models performed poorly, with many scoring 0/13 on full grouper emulation.

Conclusion: This benchmark provides the first reproducible yardstick for evaluating LLMs in hospital funding systems, enabling head-to-head and longitudinal evaluation with governance-grade traceability.

Abstract: Large language models (LLMs) are being piloted for clinical coding and decision support, yet no open benchmark targets the hospital-funding layer where Diagnosis-Related Groups (DRGs) determine reimbursement. In most OECD systems, DRGs route a substantial share of multi-trillion-dollar health spending through governed grouper software, making transparency and auditability first-order concerns. We release NordDRG-AI-Benchmark, the first public, rule-complete test bed for DRG reasoning. The package includes (i) machine-readable approximately 20-sheet NordDRG definition tables and (ii) expert manuals and change-log templates that capture governance workflows. It exposes two suites: a 13-task Logic benchmark (code lookup, cross-table inference, grouping features, multilingual terminology, and CC/MCC validity checks) and a 13-task Grouper benchmark that requires full DRG grouper emulation with strict exact-match scoring on both the DRG and the triggering drg_logic.id. Lightweight reference agents (LogicAgent, GrouperAgent) enable artefact-only evaluation. Under an artefact-only (no web) setting, on the 13 Logic tasks GPT-5 Thinking and Opus 4.1 score 13/13, o3 scores 12/13; mid-tier models (GPT-5 Thinking Mini, o4-mini, GPT-5 Fast) achieve 6-8/13, and remaining models score 5/13 or below. On full grouper emulation across 13 tasks, GPT-5 Thinking solves 7/13, o3 6/13, o4-mini 3/13; GPT-5 Thinking Mini solves 1/13, and all other tested endpoints score 0/13. To our knowledge, this is the first public report of an LLM partially emulating the complete NordDRG grouper logic with governance-grade traceability. Coupling a rule-complete release with exact-match tasks and open scoring provides a reproducible yardstick for head-to-head and longitudinal evaluation in hospital funding. Benchmark materials available in Github.

[210] Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN)

Sarat Ahmad, Zeinab Nezami, Maryam Hafeez, Syed Ali Raza Zaidi

Main category: cs.AI

TL;DR: Comparative evaluation of RAG variants (Vector RAG, GraphRAG, Hybrid GraphRAG) for generating ORAN xApps/rApps using LLMs, showing GraphRAG and Hybrid GraphRAG outperform traditional RAG with improved factual correctness and context relevance.

Details

Motivation: Fine-tuning LLMs for telecom-specific tasks is expensive and resource-intensive, while traditional RAG systems lack systematic evaluation in high-stakes domains like ORAN.

Method: Comparative evaluation using ORAN specifications, assessing performance across varying question complexities with metrics: faithfulness, answer relevance, context relevance, and factual correctness.

Result: GraphRAG and Hybrid GraphRAG outperform traditional RAG - Hybrid GraphRAG improves factual correctness by 8%, GraphRAG improves context relevance by 11%.

Conclusion: GraphRAG and Hybrid GraphRAG offer superior performance over traditional vector-based RAG for ORAN applications, providing better factual grounding and context relevance without expensive fine-tuning.

Abstract: Generative AI (GenAI) is expected to play a pivotal role in enabling autonomous optimization in future wireless networks. Within the ORAN architecture, Large Language Models (LLMs) can be specialized to generate xApps and rApps by leveraging specifications and API definitions from the RAN Intelligent Controller (RIC) platform. However, fine-tuning base LLMs for telecom-specific tasks remains expensive and resource-intensive. Retrieval-Augmented Generation (RAG) offers a practical alternative through in-context learning, enabling domain adaptation without full retraining. While traditional RAG systems rely on vector-based retrieval, emerging variants such as GraphRAG and Hybrid GraphRAG incorporate knowledge graphs or dual retrieval strategies to support multi-hop reasoning and improve factual grounding. Despite their promise, these methods lack systematic, metric-driven evaluations, particularly in high-stakes domains such as ORAN. In this study, we conduct a comparative evaluation of Vector RAG, GraphRAG, and Hybrid GraphRAG using ORAN specifications. We assess performance across varying question complexities using established generation metrics: faithfulness, answer relevance, context relevance, and factual correctness. Results show that both GraphRAG and Hybrid GraphRAG outperform traditional RAG. Hybrid GraphRAG improves factual correctness by 8%, while GraphRAG improves context relevance by 11%.

[211] SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang

Main category: cs.AI

TL;DR: SE-Agent is a self-evolution framework that enhances LLM-based agents’ reasoning by revisiting and improving previous trajectories through revision, recombination, and refinement operations, achieving state-of-the-art performance on software engineering tasks.

Details

Motivation: Current LLM-based agents' problem-solving trajectories contain rich feedback but remain underexploited. Existing approaches like MCTS ignore trajectory interdependence and lack search space diversity, leading to redundant reasoning and suboptimal outcomes.

Method: Proposes SE-Agent framework with three key operations: revision (improving existing trajectories), recombination (combining elements from different trajectories), and refinement (polishing solutions). This evolutionary mechanism explores diverse solution paths and leverages cross-trajectory inspiration.

Result: Achieves up to 55% relative improvement on SWE-bench Verified for resolving real-world GitHub issues across five strong LLMs. State-of-the-art performance among all open-source agents on this benchmark.

Conclusion: SE-Agent enables continuous self-evolution that incrementally improves reasoning quality by intelligently exploring diverse solution paths and leveraging cross-trajectory inspiration, effectively mitigating the impact of suboptimal reasoning paths.

Abstract: Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents’ interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS-Xs/SE-Agent.

[212] EoH-S: Evolution of Heuristic Set using LLMs for Automated Heuristic Design

Fei Liu, Yilu Liu, Qingfu Zhang, Xialiang Tong, Mingxuan Yuan

Main category: cs.AI

TL;DR: Proposes Automated Heuristic Set Design (AHSD) using LLMs to generate complementary heuristic sets instead of single heuristics, achieving up to 60% performance improvement over state-of-the-art methods.

Details

Motivation: Existing Automated Heuristic Design approaches only create single heuristics that perform poorly across diverse problem instances and distributions, lacking generalization capability.

Method: Proposes Evolution of Heuristic Set (EoH-S) with complementary population management and complementary-aware memetic search to generate small sets of high-quality complementary heuristics.

Result: Comprehensive experiments on three AHD tasks show EoH-S consistently outperforms state-of-the-art methods with up to 60% performance improvements across diverse instance sizes and distributions.

Conclusion: AHSD formulation with EoH-S effectively addresses generalization limitations of single-heuristic approaches by generating complementary heuristic sets that cover diverse problem instances.

Abstract: Automated Heuristic Design (AHD) using Large Language Models (LLMs) has achieved notable success in recent years. Despite the effectiveness of existing approaches, they only design a single heuristic to serve all problem instances, often inducing poor generalization across different distributions or settings. To address this issue, we propose Automated Heuristic Set Design (AHSD), a new formulation for LLM-driven AHD. The aim of AHSD is to automatically generate a small-sized complementary heuristic set to serve diverse problem instances, such that each problem instance could be optimized by at least one heuristic in this set. We show that the objective function of AHSD is monotone and supermodular. Then, we propose Evolution of Heuristic Set (EoH-S) to apply the AHSD formulation for LLM-driven AHD. With two novel mechanisms of complementary population management and complementary-aware memetic search, EoH-S could effectively generate a set of high-quality and complementary heuristics. Comprehensive experimental results on three AHD tasks with diverse instances spanning various sizes and distributions demonstrate that EoH-S consistently outperforms existing state-of-the-art AHD methods and achieves up to 60% performance improvements.

[213] KIRETT: Knowledge-Graph-Based Smart Treatment Assistant for Intelligent Rescue Operations

Mubaris Nadeem, Johannes Zenkert, Lisa Bender, Christian Weber, Madjid Fathi

Main category: cs.AI

TL;DR: A knowledge graph system for emergency responders that uses AI to provide real-time treatment recommendations based on vital data analysis.

Details

Motivation: The increasing need for rescue operations and time-critical emergency situations require first responders to have immediate access to processed medical knowledge and AI-assisted recommendations to provide optimal care.

Method: Developed a knowledge graph as central knowledge representation that enables intelligent treatment recommendations with AI-based pre-recognition of emergency situations using freshly recorded vital data.

Result: The system provides first responders with innovative knowledge management that assists in making treatment decisions during time-dependent emergency scenarios.

Conclusion: The knowledge graph approach offers a valuable solution for improving emergency medical care by delivering AI-powered, real-time treatment recommendations to first responders when they need it most.

Abstract: Over the years, the need for rescue operations throughout the world has increased rapidly. Demographic changes and the resulting risk of injury or health disorders form the basis for emergency calls. In such scenarios, first responders are in a rush to reach the patient in need, provide first aid, and save lives. In these situations, they must be able to provide personalized and optimized healthcare in the shortest possible time and estimate the patients condition with the help of freshly recorded vital data in an emergency situation. However, in such a timedependent situation, first responders and medical experts cannot fully grasp their knowledge and need assistance and recommendation for further medical treatments. To achieve this, on the spot calculated, evaluated, and processed knowledge must be made available to improve treatments by first responders. The Knowledge Graph presented in this article as a central knowledge representation provides first responders with an innovative knowledge management that enables intelligent treatment recommendations with an artificial intelligence-based pre-recognition of the situation.

[214] TASER: Table Agents for Schema-guided Extraction and Recommendation

Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso

Main category: cs.AI

TL;DR: TASER is an agentic table extraction system that handles messy financial tables through continuous learning and schema-guided extraction, outperforming existing models by 10.1% and improving extraction accuracy with larger batch sizes.

Details

Motivation: Real-world financial documents contain essential information buried in messy, multi-page tables with complex structures (99.4% without bounding boxes, up to 426 rows across 44 pages), requiring specialized extraction systems.

Method: TASER uses table agents for detection, classification, extraction, and recommendations leveraging an initial schema, with a Recommender Agent that reviews outputs, suggests schema revisions, and enables continuous learning through larger batch sizes.

Result: Outperforms Table Transformer by 10.1%, with 104.3% increase in actionable schema recommendations and 9.8% increase in extracted holdings. Trained on 22,584 pages with $731B+ holdings data.

Conclusion: Agentic, schema-guided extraction systems show promise for robust understanding of real-world financial tables, with continuous learning significantly improving performance.

Abstract: Real-world financial documents report essential information about an entity’s financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.

[215] EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making

Yang Cheng, Zilai Wang, Weiyu Ma, Wenhui Zhu, Yue Deng, Jian Zhao

Main category: cs.AI

TL;DR: EvoCurr is a self-evolve framework where a curriculum-generation LLM creates progressively difficult problem sequences for a solver LLM, enabling better performance on complex reasoning tasks.

Details

Motivation: LLMs struggle with highly complex problems requiring deep reasoning over long horizons due to lack of structured intermediate guidance, leading to inefficiency or failure.

Method: A dedicated curriculum-generation LLM constructs problem instances with gradually increasing difficulty, dynamically adapting based on the solver LLM’s learning progress. The solver LLM is implemented as a code-generation model producing Python decision-tree scripts.

Result: Experimental results on challenging decision-making benchmarks show significant improvements in task success rates and solution efficiency compared to direct-solving baselines.

Conclusion: LLM-driven curriculum learning holds strong potential for enhancing automated reasoning in real-world, high-complexity domains.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, including programming, planning, and decision-making. However, their performance often degrades when faced with highly complex problem instances that require deep reasoning over long horizons. In such cases, direct problem-solving approaches can lead to inefficiency or failure due to the lack of structured intermediate guidance. To address this, we propose a novel self-evolve framework, EvoCurr, in which a dedicated curriculum-generation LLM constructs a sequence of problem instances with gradually increasing difficulty, tailored to the solver LLM’s learning progress. The curriculum dynamically adapts easing challenges when the solver struggles and escalating them when success is consistent, thus maintaining an optimal learning trajectory. This approach enables the solver LLM, implemented as a code-generation model producing Python decision-tree scripts, to progressively acquire the skills needed for complex decision-making tasks. Experimental results on challenging decision-making benchmarks show that our method significantly improves task success rates and solution efficiency compared to direct-solving baselines. These findings suggest that LLM-driven curriculum learning holds strong potential for enhancing automated reasoning in real-world, high-complexity domains.

[216] Modeling Relational Logic Circuits for And-Inverter Graph Convolutional Network

Weihao Sun, Shikai Guo, Siwen Wang, Qian Ma, Hui Li

Main category: cs.AI

TL;DR: AIGer is a novel framework for modeling And-Inverter Graphs that combines node logic feature embedding with heterogeneous graph convolutional networks to jointly capture functional and structural characteristics, achieving state-of-the-art performance in circuit analysis tasks.

Details

Motivation: Existing methods struggle to accurately model real-world AIGs due to their complex structure and large scale, lacking the ability to jointly model functional and structural characteristics with sufficient dynamic information propagation capability.

Method: AIGer consists of two components: 1) Node logic feature initialization embedding that projects logic nodes into independent semantic spaces, and 2) AIGs feature learning network using heterogeneous graph convolutional networks with dynamic relationship weight matrices and differentiated information aggregation approaches.

Result: AIGer outperforms current best models in Signal Probability Prediction (improving MAE by 18.95% and MSE by 44.44%) and Truth Table Distance Prediction (improving MAE by 33.57% and MSE by 14.79%).

Conclusion: The proposed AIGer framework effectively addresses the challenges of joint functional-structural modeling in AIGs and demonstrates superior performance in key EDA tasks, representing a significant advancement in automated logic circuit design.

Abstract: The automation of logic circuit design enhances chip performance, energy efficiency, and reliability, and is widely applied in the field of Electronic Design Automation (EDA).And-Inverter Graphs (AIGs) efficiently represent, optimize, and verify the functional characteristics of digital circuits, enhancing the efficiency of EDA development.Due to the complex structure and large scale of nodes in real-world AIGs, accurate modeling is challenging, leading to existing work lacking the ability to jointly model functional and structural characteristics, as well as insufficient dynamic information propagation capability.To address the aforementioned challenges, we propose AIGer.Specifically, AIGer consists of two components: 1) Node logic feature initialization embedding component and 2) AIGs feature learning network component.The node logic feature initialization embedding component projects logic nodes, such as AND and NOT, into independent semantic spaces, to enable effective node embedding for subsequent processing.Building upon this, the AIGs feature learning network component employs a heterogeneous graph convolutional network, designing dynamic relationship weight matrices and differentiated information aggregation approaches to better represent the original structure and information of AIGs.The combination of these two components enhances AIGer’s ability to jointly model functional and structural characteristics and improves its message passing capability. Experimental results indicate that AIGer outperforms the current best models in the Signal Probability Prediction (SSP) task, improving MAE and MSE by 18.95% and 44.44%, respectively. In the Truth Table Distance Prediction (TTDP) task, AIGer achieves improvements of 33.57% and 14.79% in MAE and MSE, respectively, compared to the best-performing models.

cs.SD

[217] Systematic FAIRness Assessment of Open Voice Biomarker Datasets for Mental Health and Neurodegenerative Diseases

Ishaan Mahapatra, Nihar R. Mahapatra

Main category: cs.SD

TL;DR: First systematic FAIR evaluation of 27 voice biomarker datasets for mental health and neurodegenerative diseases, revealing high Findability but weaknesses in Accessibility, Interoperability, and Reusability.

Details

Motivation: Clinical adoption of voice biomarkers is constrained by inconsistent quality and limited usability of publicly available datasets, requiring systematic evaluation to improve their FAIRness.

Method: Used FAIR Data Maturity Model and structured priority-weighted scoring to assess 27 datasets at subprinciple, principle, and composite levels across Findable, Accessible, Interoperable, Reusable dimensions.

Result: Found consistently high Findability but substantial variability and weaknesses in Accessibility, Interoperability, and Reusability. Mental health datasets showed greater variability while neurodegenerative datasets were more consistent. Repository choice significantly influenced FAIRness scores.

Conclusion: Recommend adopting structured domain-specific metadata standards, prioritizing FAIR-compliant repositories, and applying structured FAIR evaluation frameworks to improve dataset quality and accelerate clinical translation of voice biomarker technologies.

Abstract: Voice biomarkers–human-generated acoustic signals such as speech, coughing, and breathing–are promising tools for scalable, non-invasive detection and monitoring of mental health and neurodegenerative diseases. Yet, their clinical adoption remains constrained by inconsistent quality and limited usability of publicly available datasets. To address this gap, we present the first systematic FAIR (Findable, Accessible, Interoperable, Reusable) evaluation of 27 publicly available voice biomarker datasets focused on these disease areas. Using the FAIR Data Maturity Model and a structured, priority-weighted scoring method, we assessed FAIRness at subprinciple, principle, and composite levels. Our analysis revealed consistently high Findability but substantial variability and weaknesses in Accessibility, Interoperability, and Reusability. Mental health datasets exhibited greater variability in FAIR scores, while neurodegenerative datasets were slightly more consistent. Repository choice also significantly influenced FAIRness scores. To enhance dataset quality and clinical utility, we recommend adopting structured, domain-specific metadata standards, prioritizing FAIR-compliant repositories, and routinely applying structured FAIR evaluation frameworks. These findings provide actionable guidance to improve dataset interoperability and reuse, thereby accelerating the clinical translation of voice biomarker technologies.

[218] EffiFusion-GAN: Efficient Fusion Generative Adversarial Network for Speech Enhancement

Bin Wen, Tien-Ping Tan

Main category: cs.SD

TL;DR: EffiFusion-GAN is a lightweight speech enhancement model using depthwise separable convolutions, enhanced attention mechanisms, and dynamic pruning to achieve high performance with reduced computational requirements.

Details

Motivation: To develop an efficient speech enhancement model suitable for resource-constrained environments while maintaining high performance through optimized architecture and pruning techniques.

Method: Integrates depthwise separable convolutions in multi-scale blocks, enhanced attention mechanism with dual normalization and residual refinement, and applies dynamic pruning to reduce model size.

Result: Achieves PESQ score of 3.45 on VoiceBank+DEMAND dataset, outperforming existing models with same parameter settings.

Conclusion: EffiFusion-GAN provides an effective and efficient solution for speech enhancement that balances performance with computational efficiency, making it suitable for practical deployment in constrained environments.

Abstract: We introduce EffiFusion-GAN (Efficient Fusion Generative Adversarial Network), a lightweight yet powerful model for speech enhancement. The model integrates depthwise separable convolutions within a multi-scale block to capture diverse acoustic features efficiently. An enhanced attention mechanism with dual normalization and residual refinement further improves training stability and convergence. Additionally, dynamic pruning is applied to reduce model size while maintaining performance, making the framework suitable for resource-constrained environments. Experimental evaluation on the public VoiceBank+DEMAND dataset shows that EffiFusion-GAN achieves a PESQ score of 3.45, outperforming existing models under the same parameter settings.

[219] Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

Euiyeon Kim, Yong-Hoon Choi

Main category: cs.SD

TL;DR: A new music source separation model using Mamba2 state space model achieves state-of-the-art vocal isolation with 11.03 dB cSDR, outperforming Transformer-based approaches by better capturing long-range temporal dependencies and intermittent vocals.

Details

Motivation: Transformer-based models often fail to capture intermittently occurring vocals in music source separation tasks, creating a need for better long-range temporal dependency modeling.

Method: Leverages Mamba2 state space model combined with band-splitting strategy and dual-path architecture to efficiently handle long input sequences and capture long-range temporal dependencies.

Result: Achieves 11.03 dB cSDR (best reported to date) with substantial gains in uSDR, showing stable and consistent performance across varying input lengths and vocal occurrence patterns.

Conclusion: Mamba-based models are highly effective for high-resolution audio processing and open new directions for broader audio research applications.

Abstract: We introduce a new music source separation model tailored for accurate vocal isolation. Unlike Transformer-based approaches, which often fail to capture intermittently occurring vocals, our model leverages Mamba2, a recent state space model, to better capture long-range temporal dependencies. To handle long input sequences efficiently, we combine a band-splitting strategy with a dual-path architecture. Experiments show that our approach outperforms recent state-of-the-art models, achieving a cSDR of 11.03 dB-the best reported to date-and delivering substantial gains in uSDR. Moreover, the model exhibits stable and consistent performance across varying input lengths and vocal occurrence patterns. These results demonstrate the effectiveness of Mamba-based models for high-resolution audio processing and open up new directions for broader applications in audio research.

[220] BioSonix: Can Physics-Based Sonification Perceptualize Tissue Deformations From Tool Interactions?

Veronica Ruozzi, Sasan Matinfar, Laura Schütz, Benedikt Wiestler, Alberto Redaelli, Emiliano Votta, Nassir Navab

Main category: cs.SD

TL;DR: BioSonix is a physics-informed framework that uses auditory feedback to represent tool-tissue interactions in mixed reality surgical environments, helping surgeons better understand complex deformable tissue dynamics through sound.

Details

Motivation: Perceptual challenges in surgical procedures where unimodal visualization fails to capture tool-tissue interactions due to occlusion and limited depth perception, particularly with soft tissue deformations.

Method: Physics-informed design framework that computes excitation forces from tissue displacements to generate sound models encoding tissue properties (stiffness, density). Uses biomechanical simulations for particle displacements and optimization for diverse interaction scenarios.

Result: Strong correlation between tool-tissue dynamics and auditory profiles. User studies showed high task accuracy with clinical professionals and 22 biomedical experts achieving high discrimination accuracy in tissue differentiation and targeting tasks.

Conclusion: Auditory representations significantly enhance intuitive understanding of complex tool-tissue interactions, demonstrating potential for improved surgical navigation in mixed reality environments.

Abstract: Perceptualizing tool interactions with deformable structures in surgical procedures remains challenging, as unimodal visualization techniques often fail to capture the complexity of these interactions due to constraints such as occlusion and limited depth perception. This paper presents a novel approach to augment tool navigation in mixed reality environments by providing auditory representations of tool-tissue dynamics, particularly for interactions with soft tissue. BioSonix, a physics-informed design framework, utilizes tissue displacements in 3D space to compute excitation forces for a sound model encoding tissue properties such as stiffness and density. Biomechanical simulations were employed to model particle displacements resulting from tool-tissue interactions, establishing a robust foundation for the method. An optimization approach was used to define configurations for capturing diverse interaction scenarios with varying tool trajectories. Experiments were conducted to validate the accuracy of the sound-displacement mappings. Additionally, two user studies were performed: the first involved two clinical professionals (a neuroradiologist and a cardiologist), who confirmed the method’s impact and achieved high task accuracy; the second included 22 biomedical experts, who demonstrated high discrimination accuracy in tissue differentiation and targeting tasks. The results revealed a strong correlation between tool-tissue dynamics and their corresponding auditory profiles, highlighting the potential of these sound representations to enhance the intuitive understanding of complex interactions.

[221] ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal

Yucong Zhang, Juan Liu, Ming Li

Main category: cs.SD

TL;DR: A novel foundation model for machine signal processing that integrates band-split architecture with relative frequency positional embeddings, enabling arbitrary-length inputs and achieving state-of-the-art performance on anomaly detection and fault identification tasks.

Details

Motivation: Pre-trained foundation models have shown success in vision and language but remain under-explored for general machine signal modeling. Existing approaches have limitations including fixed input lengths and lack of explicit frequency positional encoding.

Method: Proposes a foundation model with advanced band-split architecture and relative frequency positional embeddings, supporting inputs of arbitrary length without padding or segmentation while maintaining temporal and spectral fidelity.

Result: Achieves consistent state-of-the-art performance on the SIREN benchmark (including DCASE challenges 2020-2025 and industrial signal corpora) for anomaly detection and fault identification.

Conclusion: The proposed model demonstrates effectiveness and strong generalization capability for machine signal encoding tasks, with open-source implementation available as ECHO.

Abstract: Pre-trained foundation models have demonstrated remarkable success in vision and language, yet their potential for general machine signal modeling-covering acoustic, vibration, and other industrial sensor data-remains under-explored. Existing approach using sub-band-based encoders has achieved competitive results but are limited by fixed input lengths, and the absence of explicit frequency positional encoding. In this work, we propose a novel foundation model that integrates an advanced band-split architecture with relative frequency positional embeddings, enabling precise spectral localization across arbitrary sampling configurations. The model supports inputs of arbitrary length without padding or segmentation, producing a concise embedding that retains both temporal and spectral fidelity. We evaluate our method on SIREN (https://github.com/yucongzh/SIREN), a newly introduced large-scale benchmark for machine signal encoding that unifies multiple datasets, including all DCASE task 2 challenges (2020-2025) and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in anomaly detection and fault identification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.

[222] FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi

Main category: cs.SD

TL;DR: FMSD-TTS is a few-shot, multi-speaker, multi-dialect text-to-speech framework for Tibetan that synthesizes parallel dialectal speech using limited reference audio and dialect labels, outperforming baselines in dialect expressiveness and speaker similarity.

Details

Motivation: Tibetan is a low-resource language with minimal parallel speech corpora across its three major dialects (U-Tsang, Amdo, Kham), which limits progress in speech modeling and synthesis.

Method: Proposes FMSD-TTS with a speaker-dialect fusion module and Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity.

Result: Extensive evaluations show FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. Validated through speech-to-speech dialect conversion task.

Conclusion: Contributions include: novel few-shot TTS system for Tibetan multi-dialect synthesis, public release of large-scale synthetic Tibetan speech corpus, and open-source evaluation toolkit for standardized assessment.

Abstract: Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.

[223] When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Bodam Kim, Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

Main category: cs.SD

TL;DR: WhisperInject is a two-stage adversarial audio attack framework that manipulates audio language models to generate harmful content using imperceptible perturbations, achieving over 86% success rate across multiple models.

Details

Motivation: As audio becomes a key interface for human-AI interaction, it introduces new vulnerabilities where adversaries can exploit audio inputs to manipulate AI systems, creating a need to understand and address these audio-native threats.

Method: A two-stage framework: 1) Reinforcement Learning with Projected Gradient Descent (RL-PGD) to bypass safety protocols and generate harmful native responses, 2) Payload Injection using PGD to embed subtle perturbations into benign audio carriers like weather queries.

Result: The attack achieved success rates exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal models, validated under StrongREJECT, LlamaGuard, and Human Evaluation safety frameworks.

Conclusion: This work demonstrates practical audio-native threats that move beyond theoretical exploits, revealing a feasible and covert method for manipulating AI behavior through imperceptible audio perturbations.

Abstract: As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.

[224] Is Transfer Learning Necessary for Violin Transcription?

Yueh-Po Peng, Ting-Kang Wang, Li Su, Vincent K. M. Cheung

Main category: cs.SD

TL;DR: Violin music transcription can achieve competitive performance by training from scratch on violin-specific data, rather than relying on piano-pretrained models, despite the smaller dataset size.

Details

Motivation: Violin automatic music transcription (AMT) lags behind piano AMT due to limited annotated data. The effectiveness of transferring piano-pretrained models to violin transcription is unclear due to timbral and articulatory differences between instruments.

Method: Used a piano transcription architecture without modification and trained it from scratch on the MOSA dataset (30 hours of aligned violin recordings). Compared performance against fine-tuned piano-pretrained models on URMP and Bach10 datasets.

Result: Models trained from scratch on violin data achieved competitive or even superior performance compared to fine-tuned piano-pretrained counterparts.

Conclusion: Strong violin AMT is possible without relying on piano pretraining, emphasizing the importance of instrument-specific data collection and augmentation strategies rather than transfer learning from dissimilar instruments.

Abstract: Automatic music transcription (AMT) has achieved remarkable progress for instruments such as the piano, largely due to the availability of large-scale, high-quality datasets. In contrast, violin AMT remains underexplored due to limited annotated data. A common approach is to fine-tune pretrained models for other downstream tasks, but the effectiveness of such transfer remains unclear in the presence of timbral and articulatory differences. In this work, we investigate whether training from scratch on a medium-scale violin dataset can match the performance of fine-tuned piano-pretrained models. We adopt a piano transcription architecture without modification and train it on the MOSA dataset, which contains about 30 hours of aligned violin recordings. Our experiments on URMP and Bach10 show that models trained from scratch achieved competitive or even superior performance compared to fine-tuned counterparts. These findings suggest that strong violin AMT is possible without relying on pretrained piano representations, highlighting the importance of instrument-specific data collection and augmentation strategies.

cs.LG

[225] Deep Learning for School Dropout Detection: A Comparison of Tabular and Graph-Based Models for Predicting At-Risk Students

Pablo G. Almeida, Guilherme A. L. Silva, Valéria Santos, Gladston Moreira, Pedro Silva, Eduardo Luz

Main category: cs.LG

TL;DR: GNNs using graph structures from PCA-KMeans clustering outperformed tabular models for student dropout prediction, with GraphSAGE achieving 7% higher F1-score and 2% higher accuracy than XGBoost, but performance depends heavily on graph construction method.

Details

Motivation: Student dropout is a major educational challenge with significant social and economic costs. While traditional ML models work on tabular data, GNNs could potentially capture complex relationships in student data when structured as graphs, potentially improving prediction accuracy for timely interventions.

Method: Transformed tabular student data into graph structures using clustering techniques (K-Means, HDBSCAN) and dimensionality reduction (PCA, UMAP). Compared GNN performance (custom GCN and GraphSAGE) against tabular models (Random Forest, XGBoost, TabNet) on real-world student dataset.

Result: GraphSAGE on PCA-KMeans derived graph achieved best performance: ~7 percentage points higher macro F1-score and ~2 percentage points higher accuracy than strongest tabular baseline (XGBoost). However, other GNN configurations did not consistently outperform tabular models.

Conclusion: GNNs show potential for student dropout prediction but performance is highly dependent on graph generation strategy and architecture selection. Optimal transformation of tabular data for graph-based learning remains challenging, with PCA-KMeans clustering proving most effective in this study.

Abstract: Student dropout is a significant challenge in educational systems worldwide, leading to substantial social and economic costs. Predicting students at risk of dropout allows for timely interventions. While traditional Machine Learning (ML) models operating on tabular data have shown promise, Graph Neural Networks (GNNs) offer a potential advantage by capturing complex relationships inherent in student data if structured as graphs. This paper investigates whether transforming tabular student data into graph structures, primarily using clustering techniques, enhances dropout prediction accuracy. We compare the performance of GNNs (a custom Graph Convolutional Network (GCN) and GraphSAGE) on these generated graphs against established tabular models (Random Forest (RF), XGBoost, and TabNet) using a real-world student dataset. Our experiments explore various graph construction strategies based on different clustering algorithms (K-Means, HDBSCAN) and dimensionality reduction techniques (Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP)). Our findings demonstrate that a specific GNN configuration, GraphSAGE on a graph derived from PCA-KMeans clustering, achieved superior performance, notably improving the macro F1-score by approximately 7 percentage points and accuracy by nearly 2 percentage points over the strongest tabular baseline (XGBoost). However, other GNN configurations and graph construction methods did not consistently surpass tabular models, emphasizing the critical role of the graph generation strategy and GNN architecture selection. This highlights both the potential of GNNs and the challenges in optimally transforming tabular data for graph-based learning in this domain.

[226] Load Forecasting on A Highly Sparse Electrical Load Dataset Using Gaussian Interpolation

Chinmoy Biswas, Nafis Faisal, Vivek Chowdhury, Abrar Al-Shadid Abir, Sabir Mahmud, Mithon Rahman, Shaikh Anowarul Fattah, Hafiz Imtiaz

Main category: cs.LG

TL;DR: Gaussian interpolation effectively handles 62% sparse power plant load data for forecasting when data is Wide Sense Stationary, with LSTM models performing best among various ML approaches.

Details

Motivation: Sparsity in real-world datasets poses challenges for forecasting tasks, and traditional interpolation methods work best with Strict Sense Stationary data, but many real datasets are only Wide Sense Stationary.

Method: Statistical analysis of hourly load data, application of Gaussian interpolation for sparse data handling, and comparison of multiple machine learning and deep learning models including classical methods and neural networks.

Result: Gaussian interpolation proved suitable for handling approximately 62% sparse load data, and LSTM-based neural network models achieved the best forecasting performance among all tested models.

Conclusion: Gaussian interpolation is an effective approach for sparse load forecasting problems with WSS data, and LSTM neural networks provide superior forecasting accuracy compared to other machine learning methods.

Abstract: Sparsity, defined as the presence of missing or zero values in a dataset, often poses a major challenge while operating on real-life datasets. Sparsity in features or target data of the training dataset can be handled using various interpolation methods, such as linear or polynomial interpolation, spline, moving average, or can be simply imputed. Interpolation methods usually perform well with Strict Sense Stationary (SSS) data. In this study, we show that an approximately 62% sparse dataset with hourly load data of a power plant can be utilized for load forecasting assuming the data is Wide Sense Stationary (WSS), if augmented with Gaussian interpolation. More specifically, we perform statistical analysis on the data, and train multiple machine learning and deep learning models on the dataset. By comparing the performance of these models, we empirically demonstrate that Gaussian interpolation is a suitable option for dealing with load forecasting problems. Additionally, we demonstrate that Long Short-term Memory (LSTM)-based neural network model offers the best performance among a diverse set of classical and neural network-based models.

[227] Edge-Selector Model Applied for Local Search Neighborhood for Solving Vehicle Routing Problems

Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux, Daniele Vigo

Main category: cs.LG

TL;DR: Hybrid ML and metaheuristic approach for Vehicle Routing Problems using edge solution selector models to guide local search by identifying prohibited moves, achieving scalable performance improvements.

Details

Motivation: To enhance metaheuristic solutions for Vehicle Routing Problems by integrating machine learning to intelligently guide the search process and avoid unproductive moves.

Method: Two learning mechanisms: tabular binary classifier (Gradient Boosting Trees and Feedforward NN) and Graph Neural Network (GNN) to classify solution edges and predict prohibited moves during local search in metaheuristic baselines.

Result: Demonstrated scalability and generalizability across different metaheuristics, problem sizes up to 30,000 nodes, and variants including CVRP and CVRPTW, with statistically verified performance improvements.

Conclusion: The hybrid ML-metaheuristic approach effectively guides local search through edge classification, providing significant and scalable improvements for complex Vehicle Routing Problems.

Abstract: This research proposes a hybrid Machine Learning and metaheuristic mechanism that is designed to solve Vehicle Routing Problems (VRPs). The main of our method is an edge solution selector model, which classifies solution edges to identify prohibited moves during the local search, hence guiding the search process within metaheuristic baselines. Two learning-based mechanisms are used to develop the edge selector: a simple tabular binary classifier and a Graph Neural Network (GNN). The tabular classifier employs Gradient Boosting Trees and Feedforward Neural Network as the baseline algorithms. Adjustments to the decision threshold are also applied to handle the class imbalance in the problem instance. An alternative mechanism employs the GNN to utilize graph structure for direct solution edge prediction, with the objective of guiding local search by predicting prohibited moves. These hybrid mechanisms are then applied in state-fo-the-art metaheuristic baselines. Our method demonstrates both scalability and generalizability, achieving performance improvements across different baseline metaheuristics, various problem sizes and variants, including the Capacitated Vehicle Routing Problem (CVRP) and CVRP with Time Windows (CVRPTW). Experimental evaluations on benchmark datasets up to 30,000 customer nodes, supported by pair-wise statistical analysis, verify the observed improvements.

[228] Multi-Objective Bayesian Optimization with Independent Tanimoto Kernel Gaussian Processes for Diverse Pareto Front Exploration

Anabel Yong

Main category: cs.LG

TL;DR: GP-MOBO is a novel multi-objective Bayesian Optimization algorithm that outperforms traditional methods in molecular optimization by efficiently handling full-dimensional molecular fingerprints with minimal computational resources.

Details

Motivation: To advance molecular optimization by addressing the limitations of traditional methods in handling high-dimensional sparse molecular fingerprints and achieving better exploration of chemical search space with reduced computational overhead.

Method: Integrates a fast minimal package for Exact Gaussian Processes (GPs) that efficiently handles full dimensionality of sparse molecular fingerprints without extensive computational resources.

Result: Consistently outperforms traditional GP-BO, identifies higher-quality valid SMILES, achieves broader chemical space exploration with superior proximity to Pareto front, and yields higher geometric mean values across 20 BO iterations on DockSTRING dataset.

Conclusion: GP-MOBO is an effective and efficient solution for complex multi-objective optimization challenges in molecular discovery, demonstrating superior performance with minimal computational overhead.

Abstract: We present GP-MOBO, a novel multi-objective Bayesian Optimization algorithm that advances the state-of-the-art in molecular optimization. Our approach integrates a fast minimal package for Exact Gaussian Processes (GPs) capable of efficiently handling the full dimensionality of sparse molecular fingerprints without the need for extensive computational resources. GP-MOBO consistently outperforms traditional methods like GP-BO by fully leveraging fingerprint dimensionality, leading to the identification of higher-quality and valid SMILES. Moreover, our model achieves a broader exploration of the chemical search space, as demonstrated by its superior proximity to the Pareto front in all tested scenarios. Empirical results from the DockSTRING dataset reveal that GP-MOBO yields higher geometric mean values across 20 Bayesian optimization iterations, underscoring its effectiveness and efficiency in addressing complex multi-objective optimization challenges with minimal computational overhead.

[229] Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation

Lingkai Kong, Haichuan Wang, Charles A. Emogor, Vincent Börsch-Supan, Lily Xu, Milind Tambe

Main category: cs.LG

TL;DR: A flow matching approach for poaching prediction that addresses imperfect detection through occupancy modeling and data scarcity through composite flow initialization from linear models, showing improved accuracy in Ugandan national parks.

Details

Motivation: Poaching threatens wildlife and biodiversity, and forecasting poacher behavior is crucial for effective patrol planning. Existing methods using linear models or decision trees lack expressivity for complex spatiotemporal patterns.

Method: Integrates flow matching with occupancy-based detection model to handle imperfect detection, trains flow in latent space to infer underlying occupancy state, and uses composite flow initialized from linear-model predictions rather than random noise to address data scarcity.

Result: Evaluations on datasets from two national parks in Uganda show consistent gains in predictive accuracy compared to existing methods.

Conclusion: The proposed approach effectively addresses key challenges in poaching prediction (imperfect detection and data scarcity) and demonstrates improved performance in real-world conservation settings.

Abstract: Poaching poses significant threats to wildlife and biodiversity. A valuable step in reducing poaching is to forecast poacher behavior, which can inform patrol planning and other conservation interventions. Existing poaching prediction methods based on linear models or decision trees lack the expressivity to capture complex, nonlinear spatiotemporal patterns. Recent advances in generative modeling, particularly flow matching, offer a more flexible alternative. However, training such models on real-world poaching data faces two central obstacles: imperfect detection of poaching events and limited data. To address imperfect detection, we integrate flow matching with an occupancy-based detection model and train the flow in latent space to infer the underlying occupancy state. To mitigate data scarcity, we adopt a composite flow initialized from a linear-model prediction rather than random noise which is the standard in diffusion models, injecting prior knowledge and improving generalization. Evaluations on datasets from two national parks in Uganda show consistent gains in predictive accuracy.

[230] MCLPD:Multi-view Contrastive Learning for EEG-based PD Detection Across Datasets

Qian Zhanga, Ruilin Zhang, Jun Xiao, Yifan Liu, Zhe Wang

Main category: cs.LG

TL;DR: Proposes MCLPD, a semi-supervised learning framework combining multi-view contrastive pre-training with lightweight supervised fine-tuning to improve cross-dataset Parkinson’s disease detection from EEG data with limited labeled data.

Details

Motivation: Addresses challenges in Parkinson's disease detection from EEG data, including high annotation costs, limited dataset sizes, and dataset discrepancies that hinder model robustness and generalizability in cross-dataset scenarios.

Method: Uses self-supervised learning on unlabeled UNM dataset with dual augmentations in time and frequency domains to build contrastive pairs, followed by lightweight supervised fine-tuning using only 1-5% labeled data from UI and UC datasets.

Result: Achieves F1 scores of 0.91 on UI and 0.81 on UC with 1% labeled data, improving to 0.97 and 0.87 respectively with 5% labeled data, significantly outperforming existing methods.

Conclusion: MCLPD substantially improves cross-dataset generalization while reducing dependency on labeled data, demonstrating effectiveness for robust Parkinson’s disease detection from EEG data.

Abstract: Electroencephalography has been validated as an effective technique for detecting Parkinson’s disease,particularly in its early stages.However,the high cost of EEG data annotation often results in limited dataset size and considerable discrepancies across datasets,including differences in acquisition protocols and subject demographics,significantly hinder the robustness and generalizability of models in cross-dataset detection scenarios.To address such challenges,this paper proposes a semi-supervised learning framework named MCLPD,which integrates multi-view contrastive pre-training with lightweight supervised fine-tuning to enhance cross-dataset PD detection performance.During pre-training,MCLPD uses self-supervised learning on the unlabeled UNM dataset.To build contrastive pairs,it applies dual augmentations in both time and frequency domains,which enrich the data and naturally fuse time-frequency information.In the fine-tuning phase,only a small proportion of labeled data from another two datasets (UI and UC)is used for supervised optimization.Experimental results show that MCLPD achieves F1 scores of 0.91 on UI and 0.81 on UC using only 1%of labeled data,which further improve to 0.97 and 0.87,respectively,when 5%of labeled data is used.Compared to existing methods,MCLPD substantially improves cross-dataset generalization while reducing the dependency on labeled data,demonstrating the effectiveness of the proposed framework.

[231] GEPD:GAN-Enhanced Generalizable Model for EEG-Based Detection of Parkinson’s Disease

Qian Zhang, Ruilin Zhang, Biaokai Zhu, Xun Han, Jun Xiao, Yifan Liu, Zhe Wang

Main category: cs.LG

TL;DR: Proposes GEPD, a GAN-enhanced model for cross-dataset Parkinson’s disease detection using EEG signals, achieving 84.3% accuracy with improved generalizability.

Details

Motivation: Address variability in EEG detection methods across different datasets and small dataset sizes that challenge training generalizable models for cross-dataset Parkinson's disease classification.

Method: Uses generative network to create fusion EEG data by controlling distribution similarity, EEG signal quality assessment model, and classification network with multiple CNNs to capture time-frequency characteristics while maintaining generalizable structure.

Result: Achieves 84.3% accuracy and 84.0% F1-score in cross-dataset settings, performing comparably to state-of-the-art models.

Conclusion: The proposed GEPD model demonstrates strong generalizability for EEG-based Parkinson’s disease detection across different datasets, facilitating diagnosis and monitoring of neurological diseases.

Abstract: Electroencephalography has been established as an effective method for detecting Parkinson’s disease, typically diagnosed early.Current Parkinson’s disease detection methods have shown significant success within individual datasets, however, the variability in detection methods across different EEG datasets and the small size of each dataset pose challenges for training a generalizable model for cross-dataset scenarios. To address these issues, this paper proposes a GAN-enhanced generalizable model, named GEPD, specifically for EEG-based cross-dataset classification of Parkinson’s disease.First, we design a generative network that creates fusion EEG data by controlling the distribution similarity between generated data and real data.In addition, an EEG signal quality assessment model is designed to ensure the quality of generated data great.Second, we design a classification network that utilizes a combination of multiple convolutional neural networks to effectively capture the time-frequency characteristics of EEG signals, while maintaining a generalizable structure and ensuring easy convergence.This work is dedicated to utilizing intelligent methods to study pathological manifestations, aiming to facilitate the diagnosis and monitoring of neurological diseases.The evaluation results demonstrate that our model performs comparably to state-of-the-art models in cross-dataset settings, achieving an accuracy of 84.3% and an F1-score of 84.0%, showcasing the generalizability of the proposed model.

[232] Explainable Graph Spectral Clustering For Text Embeddings

Mieczysław A. Kłopotek, Sławomir T. Wierzchoń, Bartłomiej Starosta, Piotr Borkowski, Dariusz Czerski, Eryk Laskowski

Main category: cs.LG

TL;DR: Generalizing explainability methods for Graph Spectral Clustering from term vector embeddings to other document embeddings like GloVe

Details

Motivation: To extend previous work on explaining Graph Spectral Clustering results beyond cosine similarity in term vector space to include other document embeddings

Method: Generalize the explainability approach by considering different document embeddings, particularly those based on the GloVe embedding methodology

Result: A more comprehensive framework for explaining clustering results across various document representation methods

Conclusion: The proposed generalization enables explainability of Graph Spectral Clustering for diverse document embedding approaches beyond traditional term vector spaces

Abstract: In a previous paper, we proposed an introduction to the explainability of Graph Spectral Clustering results for textual documents, given that document similarity is computed as cosine similarity in term vector space. In this paper, we generalize this idea by considering other embeddings of documents, in particular, based on the GloVe embedding idea.

[233] PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

Mengdi Li, Guanqiao Chen, Xufeng Zhao, Haochen Wen, Shu Yang, Di Wang

Main category: cs.LG

TL;DR: PersRM-R1 is a reasoning-based reward modeling framework that uses synthetic data generation and a two-stage training pipeline to capture nuanced personal preferences from limited exemplars, outperforming similar-sized models and matching larger models.

Details

Motivation: Existing reward models struggle to capture user-specific preferences with limited data across diverse domains, creating a need for more personalized alignment methods.

Method: Combines synthetic data generation with a two-stage training pipeline: supervised fine-tuning followed by reinforcement fine-tuning to identify personal factors from few exemplars.

Result: Outperforms existing models of similar size and matches performance of much larger models in both accuracy and generalizability.

Conclusion: Paves the way for more effective personalized LLMs by enabling better capture of nuanced user preferences from limited data.

Abstract: Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.

[234] Label Smoothing is a Pragmatic Information Bottleneck

Sota Kudo

Main category: cs.LG

TL;DR: Label smoothing is shown to be equivalent to information bottleneck optimization, providing theoretical justification and demonstrating robustness to irrelevant factors.

Details

Motivation: To provide a theoretical foundation for label smoothing by connecting it to information bottleneck theory and understand its regularization properties.

Method: Theoretical analysis under assumptions of sufficient model flexibility and no conflicting labels, combined with experimental validation showing label smoothing’s insensitivity to irrelevant information factors.

Result: Label smoothing explores the optimal solution of information bottleneck, making it a practical implementation approach that exhibits robustness to non-informative factors.

Conclusion: Label smoothing can be interpreted as a practical information bottleneck method that provides regularization by being insensitive to irrelevant information while maintaining simple implementation.

Abstract: This study revisits label smoothing via a form of information bottleneck. Under the assumption of sufficient model flexibility and no conflicting labels for the same input, we theoretically and experimentally demonstrate that the model output obtained through label smoothing explores the optimal solution of the information bottleneck. Based on this, label smoothing can be interpreted as a practical approach to the information bottleneck, enabling simple implementation. As an information bottleneck method, we experimentally show that label smoothing also exhibits the property of being insensitive to factors that do not contain information about the target, or to factors that provide no additional information about it when conditioned on another variable.

[235] Out-of-Sample Hydrocarbon Production Forecasting: Time Series Machine Learning using Productivity Index-Driven Features and Inductive Conformal Prediction

Mohamed Hassan Abdalla Idris, Jakub Marek Cebula, Jebraeel Gholinezhad, Shamsul Masum, Hongjie Ma

Main category: cs.LG

TL;DR: New ML framework combining reservoir engineering knowledge (PI-based feature selection) with Inductive Conformal Prediction for robust oil production forecasting, showing LSTM outperforms other models.

Details

Motivation: To enhance robustness of out-of-sample hydrocarbon production forecasting by addressing multivariate time series analysis challenges and providing rigorous uncertainty quantification.

Method: Integrates Productivity Index-driven feature selection with Inductive Conformal Prediction. Tests LSTM, BiLSTM, GRU, and XGBoost algorithms on Volve and Norne oil field data for oil production rate forecasting.

Result: LSTM achieved lowest MAE (19.468 test, 29.638 out-of-sample) for well PF14. PI-based feature selection reduced input dimensionality. ICP provided valid 95% prediction intervals without distributional assumptions.

Conclusion: Combining domain-specific knowledge with advanced ML techniques significantly improves reliability of hydrocarbon production forecasts, with LSTM showing superior performance.

Abstract: This research introduces a new ML framework designed to enhance the robustness of out-of-sample hydrocarbon production forecasting, specifically addressing multivariate time series analysis. The proposed methodology integrates Productivity Index (PI)-driven feature selection, a concept derived from reservoir engineering, with Inductive Conformal Prediction (ICP) for rigorous uncertainty quantification. Utilizing historical data from the Volve (wells PF14, PF12) and Norne (well E1H) oil fields, this study investigates the efficacy of various predictive algorithms-namely Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit (GRU), and eXtreme Gradient Boosting (XGBoost) - in forecasting historical oil production rates (OPR_H). All the models achieved “out-of-sample” production forecasts for an upcoming future timeframe. Model performance was comprehensively evaluated using traditional error metrics (e.g., MAE) supplemented by Forecast Bias and Prediction Direction Accuracy (PDA) to assess bias and trend-capturing capabilities. The PI-based feature selection effectively reduced input dimensionality compared to conventional numerical simulation workflows. The uncertainty quantification was addressed using the ICP framework, a distribution-free approach that guarantees valid prediction intervals (e.g., 95% coverage) without reliance on distributional assumptions, offering a distinct advantage over traditional confidence intervals, particularly for complex, non-normal data. Results demonstrated the superior performance of the LSTM model, achieving the lowest MAE on test (19.468) and genuine out-of-sample forecast data (29.638) for well PF14, with subsequent validation on Norne well E1H. These findings highlight the significant potential of combining domain-specific knowledge with advanced ML techniques to improve the reliability of hydrocarbon production forecasts.

[236] A Guide to Robust Generalization: The Impact of Architecture, Pre-training, and Optimization Strategy

Maxime Heuillet, Rishika Bhagwatkar, Jonas Ngnawé, Yann Pequignot, Alexandre Larouche, Christian Gagné, Irina Rish, Ola Ahmad, Audrey Durand

Main category: cs.LG

TL;DR: Comprehensive empirical study on robust fine-tuning of deep learning models, analyzing 1,440 training configurations across various architectures, pretraining methods, and adaptation protocols to understand design choices affecting robust generalization.

Details

Motivation: Deep learning models are vulnerable to input perturbations, and while robust fine-tuning has emerged as an efficient alternative to training from scratch, the impact of various design choices (architecture, pretraining, adaptation protocols) on robust generalization remains poorly understood.

Method: Conducted large-scale empirical study spanning 6 datasets, 40 pretrained architectures, 2 specialized losses, and 3 adaptation protocols, resulting in 1,440 training configurations and 7,200 robustness measurements across five perturbation types.

Result: Found that convolutional neural networks pretrained in a supervised manner on large datasets often perform best, challenging the assumption that attention-based architectures and robust pretrained representations are superior.

Conclusion: The study provides both confirmation and challenges to prior design assumptions, offers practical guidance for robust fine-tuning, and highlights promising research directions in this area.

Abstract: Deep learning models operating in the image domain are vulnerable to small input perturbations. For years, robustness to such perturbations was pursued by training models from scratch (i.e., with random initializations) using specialized loss objectives. Recently, robust fine-tuning has emerged as a more efficient alternative: instead of training from scratch, pretrained models are adapted to maximize predictive performance and robustness. To conduct robust fine-tuning, practitioners design an optimization strategy that includes the model update protocol (e.g., full or partial) and the specialized loss objective. Additional design choices include the architecture type and size, and the pretrained representation. These design choices affect robust generalization, which is the model’s ability to maintain performance when exposed to new and unseen perturbations at test time. Understanding how these design choices influence generalization remains an open question with significant practical implications. In response, we present an empirical study spanning 6 datasets, 40 pretrained architectures, 2 specialized losses, and 3 adaptation protocols, yielding 1,440 training configurations and 7,200 robustness measurements across five perturbation types. To our knowledge, this is the most diverse and comprehensive benchmark of robust fine-tuning to date. While attention-based architectures and robust pretrained representations are increasingly popular, we find that convolutional neural networks pretrained in a supervised manner on large datasets often perform best. Our analysis both confirms and challenges prior design assumptions, highlighting promising research directions and offering practical guidance.

[237] KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge

Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Guoyang Liu

Main category: cs.LG

TL;DR: A new benchmark called KnowDR-REC is proposed for evaluating knowledge-driven referring expression comprehension, featuring real-world knowledge requirements, negative samples, and novel evaluation metrics to assess multimodal reasoning capabilities.

Details

Motivation: Traditional REC benchmarks are inadequate for evaluating Multi-modal Large Language Models (MLLMs) because they either rely solely on intra-image cues or lack fine-grained instance annotations, failing to assess true reasoning capabilities.

Method: The KnowDR-REC benchmark is built on real-world knowledge requiring fine-grained multimodal reasoning, includes negative samples constructed via expression editing to test robustness, and introduces three novel evaluation metrics to systematically explore model reasoning processes.

Result: Evaluation of 16 state-of-the-art multimodal models shows that existing MLLMs struggle with knowledge-driven visual grounding tasks, with many models influenced by memorized shortcut correlations that hinder genuine multimodal reasoning.

Conclusion: The benchmark reveals a decoupling between textual understanding and visual grounding in MLLMs and is expected to inspire development of more robust, interpretable, and knowledge-intensive visual grounding frameworks for reliable multimodal systems.

Abstract: Referring Expression Comprehension (REC) is a popular multimodal task that aims to accurately detect target objects within a single image based on a given textual expression. However, due to the limitations of earlier models, traditional REC benchmarks either rely solely on intra-image cues or lack sufficiently fine-grained instance annotations, making them inadequate for evaluating the reasoning capabilities of Multi-modal Large Language Models (MLLMs). To address this gap, we propose a new benchmark, KnowDR-REC, characterized by three key features: Firstly, it is built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image. Secondly, the dataset includes elaborately constructed negative samples via fine-grained expression editing, designed to evaluate a model’s robustness and anti-hallucination ability. Lastly, we introduce three novel evaluation metrics to systematically explore the model’s internal reasoning process. We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks. Furthermore, we observe a decoupling between textual understanding and visual grounding in MLLMs, where many models are significantly influenced by memorized shortcut correlations, which severely affect their behavior on our benchmark and hinder genuine multimodal reasoning. We anticipate that the proposed benchmark will inspire future research towards developing more robust, interpretable, and knowledge-intensive visual grounding frameworks, driving the development of more reliable and robust multimodal systems for complex real-world scenarios.

[238] Toward Lifelong Learning in Equilibrium Propagation: Sleep-like and Awake Rehearsal for Enhanced Stability

Yoshimasa Kubo, Jean Erik Delanois, Maxim Bazhenov

Main category: cs.LG

TL;DR: Sleep-like replay consolidation (SRC) algorithm helps EP-trained RNNs overcome catastrophic forgetting in continuous learning, matching or surpassing BPTT-trained models on multiple datasets.

Details

Motivation: Address catastrophic forgetting in RNNs trained with Equilibrium Propagation, inspired by human brain's memory consolidation during sleep through replay mechanisms.

Method: Proposed sleep-like replay consolidation (SRC) algorithm for EP-trained RNNs, tested in class-incremental learning scenarios and combined with rehearsal techniques.

Result: SRC significantly improved RNN’s resilience to catastrophic forgetting. MRNN-EP with SRC performed on par with BPTT-trained models on MNIST and surpassed them on Fashion MNIST, Kuzushiji-MNIST, CIFAR10, and ImageNet datasets.

Conclusion: Sleep-like replay techniques are applicable to RNNs and show potential for integrating human-like learning behaviors into artificial neural networks.

Abstract: Recurrent neural networks (RNNs) trained using Equilibrium Propagation (EP), a biologically plausible training algorithm, have demonstrated strong performance in various tasks such as image classification and reinforcement learning. However, these networks face a critical challenge in continuous learning: catastrophic forgetting, where previously acquired knowledge is overwritten when new tasks are learned. This limitation contrasts with the human brain’s ability to retain and integrate both old and new knowledge, aided by processes like memory consolidation during sleep through the replay of learned information. To address this challenge in RNNs, here we propose a sleep-like replay consolidation (SRC) algorithm for EP-trained RNNs. We found that SRC significantly improves RNN’s resilience to catastrophic forgetting in continuous learning scenarios. In class-incremental learning with SRC implemented after each new task training, the EP-trained multilayer RNN model (MRNN-EP) performed significantly better compared to feedforward networks incorporating several well-established regularization techniques. The MRNN-EP performed on par with MRNN trained using Backpropagation Through Time (BPTT) when both were equipped with SRC on MNIST data and surpassed BPTT-based models on the Fashion MNIST, Kuzushiji-MNIST, CIFAR10, and ImageNet datasets. Combining SRC with rehearsal, also known as “awake replay”, further boosted the network’s ability to retain long-term knowledge while continuing to learn new tasks. Our study reveals the applicability of sleep-like replay techniques to RNNs and highlights the potential for integrating human-like learning behaviors into artificial neural networks (ANNs).

[239] Toward Generalist Semi-supervised Regression via Decoupled Representation Distillation

Ye Su, Hezhe Qiao, Wei Huang, Lin Chen

Main category: cs.LG

TL;DR: DRILL is a novel semi-supervised regression framework that transforms regression into discrete distribution estimation to improve label distribution learning and prevent overfitting.

Details

Motivation: Existing semi-supervised regression methods rely heavily on pseudo-label quality and direct regression fails to learn label distribution properly, leading to overfitting.

Method: Transforms regression into Discrete Distribution Estimation over multiple buckets, uses Decoupled Distribution Alignment to align target and non-target bucket distributions between teacher and student models.

Result: Extensive experiments across diverse domains show DRILL has strong generalization and outperforms competing methods.

Conclusion: DRILL effectively addresses overfitting and pseudo-label dependency issues in semi-supervised regression through distribution-based learning.

Abstract: Semi-supervised regression (SSR), which aims to predict continuous scores of samples while reducing reliance on a large amount of labeled data, has recently received considerable attention across various applications, including computer vision, natural language processing, and audio and medical analysis. Existing semi-supervised methods typically apply consistency regularization on the general regression task by generating pseudo-labels. However, these methods heavily rely on the quality of pseudo-labels, and direct regression fails to learn the label distribution and can easily lead to overfitting. To address these challenges, we introduce an end-to-end Decoupled Representation distillation framework (DRILL) which is specially designed for the semi-supervised regression task where we transform the general regression task into a Discrete Distribution Estimation (DDE) task over multiple buckets to better capture the underlying label distribution and mitigate the risk of overfitting associated with direct regression. Then we employ the Decoupled Distribution Alignment (DDA) to align the target bucket and non-target bucket between teacher and student on the distribution of buckets, encouraging the student to learn more robust and generalized knowledge from the teacher. Extensive experiments conducted on datasets from diverse domains demonstrate that the proposed DRILL has strong generalization and outperforms the competing methods.

[240] GeoMAE: Masking Representation Learning for Spatio-Temporal Graph Forecasting with Missing Values

Songyu Ke, Chenyu Wu, Yuxuan Liang, Xiuwen Yi, Yanping Sun, Junbo Zhang, Yu Zheng

Main category: cs.LG

TL;DR: Self-supervised contrastive learning framework for crowd flow inference at POIs using spatial graphs and unlabeled data

Details

Motivation: Accurate crowd flow monitoring at Points of Interest is crucial for urban management but limited by poor data quality from urban sensing techniques and challenges including scarce labeled data, complex spatio-temporal dependencies, and correlations between GPS reports and actual crowd flow

Method: Recast crowd flow inference as self-supervised attributed graph representation learning. Construct spatial adjacency graph based on POIs and distances, use contrastive learning with swapped prediction to exploit unlabeled spatio-temporal data, then fine-tune with accurate crowd flow data

Result: Experiments on two real-world datasets show the proposed framework consistently outperforms models trained from scratch when pre-trained on extensive noisy data

Conclusion: The contrastive self-learning framework effectively addresses crowd flow inference challenges by leveraging unlabeled data through self-supervised learning and spatial graph representations

Abstract: Accurate acquisition of crowd flow at Points of Interest (POIs) is pivotal for effective traffic management, public service, and urban planning. Despite this importance, due to the limitations of urban sensing techniques, the data quality from most sources is inadequate for monitoring crowd flow at each POI. This renders the inference of accurate crowd flow from low-quality data a critical and challenging task. The complexity is heightened by three key factors: 1) \emph{The scarcity and rarity of labeled data}, 2) \emph{The intricate spatio-temporal dependencies among POIs}, and 3) \emph{The myriad correlations between precise crowd flow and GPS reports}. To address these challenges, we recast the crowd flow inference problem as a self-supervised attributed graph representation learning task and introduce a novel \underline{C}ontrastive \underline{S}elf-learning framework for \underline{S}patio-\underline{T}emporal data (\model). Our approach initiates with the construction of a spatial adjacency graph founded on the POIs and their respective distances. We then employ a contrastive learning technique to exploit large volumes of unlabeled spatio-temporal data. We adopt a swapped prediction approach to anticipate the representation of the target subgraph from similar instances. Following the pre-training phase, the model is fine-tuned with accurate crowd flow data. Our experiments, conducted on two real-world datasets, demonstrate that the \model pre-trained on extensive noisy data consistently outperforms models trained from scratch.

[241] Parameter-Aware Ensemble SINDy for Interpretable Symbolic SGS Closure

Hanseul Kang, Shervin Karimkashi, Ville Vuorinen

Main category: cs.LG

TL;DR: A scalable sparse regression framework that discovers interpretable PDEs and subgrid-scale closures from multi-parameter simulation data, extending SINDy with parameter-aware innovations and achieving autonomous discovery of Smagorinsky-type closures.

Details

Motivation: To overcome limitations in existing sparse regression methods for discovering governing equations and subgrid-scale closures from simulation data, particularly the inability to handle varying physical parameters and ensure unit consistency.

Method: Four key innovations: symbolic parameterization for varying physical parameters, Dimensional Similarity Filter for unit consistency, memory-efficient Gram-matrix accumulation for batch processing, and ensemble consensus with coefficient stability analysis for robust model identification.

Result: Successfully recovered governing equations across parameter ranges and discovered an SGS closure τ_SGS = 0.1603·Δ²(∂ū/∂x)² (Smagorinsky constant ≈0.4004) from filtered Burgers data, achieving R² = 0.886 across filter scales with improved accuracy over classical closures.

Conclusion: The framework autonomously discovers physically meaningful SGS forms and calibrates coefficients, providing a complementary data-driven approach to turbulence modeling and contributing to data-driven closure discovery.

Abstract: We present a scalable, parameter-aware sparse regression framework for discovering interpretable partial differential equations and subgrid-scale closures from multi-parameter simulation data. Building on SINDy (Sparse Identification of Nonlinear Dynamics), our approach addresses key limitations through four innovations: symbolic parameterisation enabling physical parameters to vary within unified regression; Dimensional Similarity Filter enforcing unit-consistency whilst reducing candidate libraries; memory-efficient Gram-matrix accumulation enabling batch processing; and ensemble consensus with coefficient stability analysis for robust model identification. Validation on canonical one-dimensional benchmarks demonstrates reliable recovery of governing equations across parameter ranges. Applied to filtered Burgers datasets, the framework discovers an SGS closure $\tau_{\mathrm{SGS}} = 0.1603\cdot\Delta^2\left(\frac{\partial \bar{u}}{\partial x}\right)^2$, corresponding to a Smagorinsky constant of approximately 0.4004. This represents autonomous discovery of Smagorinsky-type closure structure from data without prior theoretical assumptions. The discovered model achieves $R^2 = 0.886$ across filter scales and demonstrates improved prediction accuracy compared to classical closures. The framework’s ability to identify physically meaningful SGS forms and calibrate coefficients offers a complementary approach to existing turbulence modelling methods, contributing to the growing field of data-driven closure discovery.

[242] EEGDM: EEG Representation Learning via Generative Diffusion Model

Jia Hong Puah, Sim Kuan Goh, Ziwei Zhang, Zixuan Ye, Chow Khuen Chan, Kheng Seang Lim, Si Lei Fong, Kok Sin Woon

Main category: cs.LG

TL;DR: EEGDM: A lightweight EEG representation learning framework using diffusion models and structured state-space models that outperforms current foundation models while being 19x more efficient.

Details

Motivation: Current EEG foundation models are computationally expensive with only marginal performance gains as size increases, creating a need for more efficient yet effective representation learning methods.

Method: Proposed EEGDM framework with structured state-space model for diffusion pretraining (SSMDP) to capture EEG temporal dynamics, trained using Denoising Diffusion Probabilistic Model, and latent fusion transformer (LFT) for downstream classification.

Result: Outperformed state-of-the-art EEG foundation models on Temple University EEG Event Corpus while being approximately 19x more lightweight.

Conclusion: EEGDM provides a promising lightweight alternative to current EEG foundation models, offering better performance with significantly reduced computational costs.

Abstract: While electroencephalogram (EEG) has been a crucial tool for monitoring the brain and diagnosing neurological disorders (e.g., epilepsy), learning meaningful representations from raw EEG signals remains challenging due to limited annotations and high signal variability. Recently, EEG foundation models (FMs) have shown promising potential by adopting transformer architectures and self-supervised pre-training methods from large language models (e.g., masked prediction) to learn representations from diverse EEG data, followed by fine-tuning on specific EEG tasks. Nonetheless, these large models often incurred high computational costs during both training and inference, with only marginal performance improvements as model size increases. In this work, we proposed EEG representation learning framework building upon Generative Diffusion Model (EEGDM). Specifically, we developed structured state-space model for diffusion pretraining (SSMDP) to better capture the temporal dynamics of EEG signals and trained the architecture using a Denoising Diffusion Probabilistic Model. The resulting latent EEG representations were then used for downstream classification tasks via our proposed latent fusion transformer (LFT). To evaluate our method, we used the multi-event Temple University EEG Event Corpus and compared EEGDM with current state-of-the-art approaches, including EEG FMs. Empirical results showed that our method outperformed existing methods while being approximately 19x more lightweight. These findings suggested that EEGDM offered a promising alternative to current FMs. Our code is available at: https://github.com/jhpuah/EEGDM.

[243] FM4NPP: A Scaling Foundation Model for Nuclear and Particle Physics

David Park, Shuhang Li, Yi Huang, Xihaier Luo, Haiwang Yu, Yeonju Go, Christopher Pinkenburg, Yuewei Lin, Shinjae Yoo, Joseph Osborn, Jin Huang, Yihui Ren

Main category: cs.LG

TL;DR: A scientific foundation model for particle physics that uses self-supervised learning on detector data, achieving state-of-the-art performance across multiple tasks with 188M parameters and task-specific adapters.

Details

Motivation: To apply large language model principles to experimental particle physics despite the challenge of sparse, spatially distributed detector data that differs from natural language.

Method: Novel self-supervised training method for detector data with models up to 188M parameters, using frozen weights and task-specific adapters for downstream tasks.

Result: The foundation model consistently outperforms baseline models across all downstream tasks and shows robust data-efficient adaptation with task-agnostic representations that can be specialized via linear mapping.

Conclusion: A scalable and generalizable foundation model for particle physics is achievable through self-supervised learning on detector data, enabling superior performance across diverse tasks with efficient adaptation.

Abstract: Large language models have revolutionized artificial intelligence by enabling large, generalizable models trained through self-supervision. This paradigm has inspired the development of scientific foundation models (FMs). However, applying this capability to experimental particle physics is challenging due to the sparse, spatially distributed nature of detector data, which differs dramatically from natural language. This work addresses if an FM for particle physics can scale and generalize across diverse tasks. We introduce a new dataset with more than 11 million particle collision events and a suite of downstream tasks and labeled data for evaluation. We propose a novel self-supervised training method for detector data and demonstrate its neural scalability with models that feature up to 188 million parameters. With frozen weights and task-specific adapters, this FM consistently outperforms baseline models across all downstream tasks. The performance also exhibits robust data-efficient adaptation. Further analysis reveals that the representations extracted by the FM are task-agnostic but can be specialized via a single linear mapping for different downstream tasks.

[244] CoBAD: Modeling Collective Behaviors for Human Mobility Anomaly Detection

Haomin Wen, Shurui Cao, Leman Akoglu

Main category: cs.LG

TL;DR: CoBAD is a novel model for detecting collective anomalies in human mobility by modeling spatiotemporal dependencies between individuals using a two-stage attention mechanism and pre-training on large-scale collective behavior data.

Details

Motivation: Collective anomaly detection in human mobility remains underexplored compared to individual anomaly detection, requiring modeling of complex spatiotemporal dependencies between individuals for applications like public safety and urban planning.

Method: Formulates the problem using Collective Event Sequences with co-occurrence event graphs, employs two-stage attention mechanism to model individual patterns and interactions, and pre-trains on large-scale data through masked event and link reconstruction tasks.

Result: CoBAD significantly outperforms existing baselines with 13%-18% improvement in AUCROC and 19%-70% improvement in AUCPR, effectively detecting both unexpected co-occurrence anomalies and previously overlooked absence anomalies.

Conclusion: The proposed CoBAD model successfully addresses the challenge of collective anomaly detection in human mobility by capturing complex spatiotemporal dependencies and demonstrates superior performance compared to existing methods.

Abstract: Detecting anomalies in human mobility is essential for applications such as public safety and urban planning. While traditional anomaly detection methods primarily focus on individual movement patterns (e.g., a child should stay at home at night), collective anomaly detection aims to identify irregularities in collective mobility behaviors across individuals (e.g., a child is at home alone while the parents are elsewhere) and remains an underexplored challenge. Unlike individual anomalies, collective anomalies require modeling spatiotemporal dependencies between individuals, introducing additional complexity. To address this gap, we propose CoBAD, a novel model designed to capture Collective Behaviors for human mobility Anomaly Detection. We first formulate the problem as unsupervised learning over Collective Event Sequences (CES) with a co-occurrence event graph, where CES represents the event sequences of related individuals. CoBAD then employs a two-stage attention mechanism to model both the individual mobility patterns and the interactions across multiple individuals. Pre-trained on large-scale collective behavior data through masked event and link reconstruction tasks, CoBAD is able to detect two types of collective anomalies: unexpected co-occurrence anomalies and absence anomalies, the latter of which has been largely overlooked in prior work. Extensive experiments on large-scale mobility datasets demonstrate that CoBAD significantly outperforms existing anomaly detection baselines, achieving an improvement of 13%-18% in AUCROC and 19%-70% in AUCPR. All source code is available at https://github.com/wenhaomin/CoBAD.

[245] Logical Expressivity and Explanations for Monotonic GNNs with Scoring Functions

Matthew Morris, David J. Tena Cucala, Bernardo Cuenca Grau

Main category: cs.LG

TL;DR: The paper addresses explainability in GNNs for knowledge graph link prediction by extracting Datalog rules from monotonic GNNs with scoring functions, providing sound explanations while maintaining performance.

Details

Motivation: To improve explainability of GNN predictions in knowledge graphs by extracting interpretable Datalog rules, addressing limitations of previous methods that only worked with restricted low-expressivity encodings.

Method: Adapt GNNs and scoring functions to be monotonic, use monotonicity to extract sound Datalog rules for explaining predictions, and define procedures for obtaining equivalent Datalog programs for certain classes of monotonic GNNs with scoring functions.

Result: Experiments show that monotonic GNNs and scoring functions perform well on link prediction benchmarks and yield many sound rules for explanation.

Conclusion: The approach successfully combines the predictive power of GNNs with scoring functions while providing explainable Datalog rule extraction, demonstrating practical viability for knowledge graph link prediction tasks.

Abstract: Graph neural networks (GNNs) are often used for the task of link prediction: predicting missing binary facts in knowledge graphs (KGs). To address the lack of explainability of GNNs on KGs, recent works extract Datalog rules from GNNs with provable correspondence guarantees. The extracted rules can be used to explain the GNN’s predictions; furthermore, they can help characterise the expressive power of various GNN models. However, these works address only a form of link prediction based on a restricted, low-expressivity graph encoding/decoding method. In this paper, we consider a more general and popular approach for link prediction where a scoring function is used to decode the GNN output into fact predictions. We show how GNNs and scoring functions can be adapted to be monotonic, use the monotonicity to extract sound rules for explaining predictions, and leverage existing results about the kind of rules that scoring functions can capture. We also define procedures for obtaining equivalent Datalog programs for certain classes of monotonic GNNs with scoring functions. Our experiments show that, on link prediction benchmarks, monotonic GNNs and scoring functions perform well in practice and yield many sound rules.

[246] Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Benjamin Pikus, Pratyush Ranjan Tiwari, Burton Ye

Main category: cs.LG

TL;DR: Prioritizing hard examples for language model fine-tuning yields 47% performance gains, while easy examples provide minimal improvements, offering practical guidance for budget-constrained alignment.

Details

Motivation: Collecting high-quality training data for language model fine-tuning is expensive with limited budgets, creating a need to determine which difficulty level of examples provides the best performance gains per acquisition cost.

Method: Study Group Relative Policy Optimization (GRPO) fine-tuning across different model sizes and families, comparing four subset selection policies (easy, medium, hard, random difficulty) chosen from the same unlabeled pool using base-model difficulty estimates via multi-sample evaluation.

Result: Training on the hardest examples yields the largest performance gains (up to 47%), while training on easy examples yields the smallest gains. Harder examples provide more learnable opportunities during GRPO training.

Conclusion: For budget-constrained post-training, prioritizing hard examples yields substantial performance gains on reasoning tasks when using GRPO, providing practical guidance for resource-constrained alignment.

Abstract: Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate a critical question for resource-constrained alignment: under a fixed acquisition budget, should practitioners prioritize examples that are easy, medium, hard, or of random difficulty? We study Group Relative Policy Optimization (GRPO) fine-tuning across different model sizes and families, comparing four subset selection policies chosen from the same unlabeled pool using base-model difficulty estimates obtained via multi-sample evaluation. Our experiments reveal that training on the hardest examples yields the largest performance gains, up to 47%, while training on easy examples yield the smallest gains. Analysis reveals that this effect arises from harder examples providing more learnable opportunities during GRPO training. These findings provide practical guidance for budget-constrained post-training: prioritizing hard examples yields substantial performance gains on reasoning tasks when using GRPO.

[247] Physics-Informed Reward Machines

Daniel Ajeleye, Ashutosh Trivedi, Majid Zamani

Main category: cs.LG

TL;DR: Physics-informed reward machines (pRMs) enhance reinforcement learning by incorporating symbolic knowledge about physical environments, enabling more expressive reward specification and faster learning through counterfactual experiences and reward shaping.

Details

Motivation: To improve reinforcement learning by separating known environmental knowledge (captured by reward mechanisms) from unknown aspects that need discovery, thereby reducing sample complexity and accelerating learning.

Method: Introduce physics-informed reward machines (pRMs) as symbolic machines that express complex learning objectives and reward structures. Develop RL algorithms that exploit pRMs through counterfactual experience generation and reward shaping techniques.

Result: Experimental results show accelerated reward acquisition during training phases. pRMs significantly improve learning efficiency across both finite and continuous physical environments in various control tasks.

Conclusion: Physics-informed reward machines provide a powerful framework for making reinforcement learning more programmable, expressive, and efficient by incorporating symbolic knowledge about physical environments into the reward specification process.

Abstract: Reward machines (RMs) provide a structured way to specify non-Markovian rewards in reinforcement learning (RL), thereby improving both expressiveness and programmability. Viewed more broadly, they separate what is known about the environment, captured by the reward mechanism, from what remains unknown and must be discovered through sampling. This separation supports techniques such as counterfactual experience generation and reward shaping, which reduce sample complexity and speed up learning. We introduce physics-informed reward machines (pRMs), a symbolic machine designed to express complex learning objectives and reward structures for RL agents, thereby enabling more programmable, expressive, and efficient learning. We present RL algorithms capable of exploiting pRMs via counterfactual experiences and reward shaping. Our experimental results show that these techniques accelerate reward acquisition during the training phases of RL. We demonstrate the expressiveness and effectiveness of pRMs through experiments in both finite and continuous physical environments, illustrating that incorporating pRMs significantly improves learning efficiency across several control tasks.

[248] Implicit Hypergraph Neural Network

Akash Choudhuri, Yongjian Zhong, Bijaya Adhikari

Main category: cs.LG

TL;DR: IHNN is a novel implicit hypergraph neural network that captures long-range dependencies through fixed-point representations for both nodes and hyperedges, outperforming existing methods in node classification tasks.

Details

Motivation: Existing hypergraph neural networks fail to capture long-range high-order dependencies due to limited message-passing rounds, and increasing rounds degrades performance. No prior work has addressed long-range dependency issues in hypergraph neural networks.

Method: Proposes Implicit Hypergraph Neural Network (IHNN) that jointly learns fixed-point representations for nodes and hyperedges using implicit differentiation. Uses a tractable projected gradient descent approach for efficient training.

Result: Extensive experiments on real-world hypergraphs show IHNN outperforms closest prior works in most settings for node classification, establishing new state-of-the-art performance.

Conclusion: IHNN successfully addresses the long-range dependency problem in hypergraph learning through implicit differentiation and fixed-point representations, demonstrating superior performance over existing methods.

Abstract: Hypergraphs offer a generalized framework for capturing high-order relationships between entities and have been widely applied in various domains, including healthcare, social networks, and bioinformatics. Hypergraph neural networks, which rely on message-passing between nodes over hyperedges to learn latent representations, have emerged as the method of choice for predictive tasks in many of these domains. These approaches typically perform only a small number of message-passing rounds to learn the representations, which they then utilize for predictions. The small number of message-passing rounds comes at a cost, as the representations only capture local information and forego long-range high-order dependencies. However, as we demonstrate, blindly increasing the message-passing rounds to capture long-range dependency also degrades the performance of hyper-graph neural networks. Recent works have demonstrated that implicit graph neural networks capture long-range dependencies in standard graphs while maintaining performance. Despite their popularity, prior work has not studied long-range dependency issues on hypergraph neural networks. Here, we first demonstrate that existing hypergraph neural networks lose predictive power when aggregating more information to capture long-range dependency. We then propose Implicit Hypergraph Neural Network (IHNN), a novel framework that jointly learns fixed-point representations for both nodes and hyperedges in an end-to-end manner to alleviate this issue. Leveraging implicit differentiation, we introduce a tractable projected gradient descent approach to train the model efficiently. Extensive experiments on real-world hypergraphs for node classification demonstrate that IHNN outperforms the closest prior works in most settings, establishing a new state-of-the-art in hypergraph learning.

[249] Beyond Fixed Morphologies: Learning Graph Policies with Trust Region Compensation in Variable Action Spaces

Thomas Gallien

Main category: cs.LG

TL;DR: Theoretical analysis of TRPO and PPO trust region methods under varying action space dimensionality, with empirical evaluation on morphological generalization using Gymnasium Swimmer environment.

Details

Motivation: Growing demand for morphological generalization in reinforcement learning requires understanding how trust region methods behave with varying action space dimensionality in graph-based policy architectures.

Method: Conducted theoretical analysis of TRPO and PPO optimization landscapes under KL-divergence and policy clipping constraints, complemented by empirical evaluation using Gymnasium Swimmer environment with controlled kinematic variations.

Result: Analysis reveals how varying action space dimensionality influences optimization behavior in trust region methods under different constraint mechanisms.

Conclusion: Provides theoretical insights into trust region optimization under morphological variation, offering foundation for developing more robust and generalizable control policies across different kinematic structures.

Abstract: Trust region-based optimization methods have become foundational reinforcement learning algorithms that offer stability and strong empirical performance in continuous control tasks. Growing interest in scalable and reusable control policies translate also in a demand for morphological generalization, the ability of control policies to cope with different kinematic structures. Graph-based policy architectures provide a natural and effective mechanism to encode such structural differences. However, while these architectures accommodate variable morphologies, the behavior of trust region methods under varying action space dimensionality remains poorly understood. To this end, we conduct a theoretical analysis of trust region-based policy optimization methods, focusing on both Trust Region Policy Optimization (TRPO) and its widely used first-order approximation, Proximal Policy Optimization (PPO). The goal is to demonstrate how varying action space dimensionality influence the optimization landscape, particularly under the constraints imposed by KL-divergence or policy clipping penalties. Complementing the theoretical insights, an empirical evaluation under morphological variation is carried out using the Gymnasium Swimmer environment. This benchmark offers a systematically controlled setting for varying the kinematic structure without altering the underlying task, making it particularly well-suited to study morphological generalization.

[250] From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery

Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, Zijie Qiu, Xuming He, Qiang Zhang, Chenyu You, Shuangjia Zheng, Ning Ding, Wanli Ouyang, Nanqing Dong, Yu Cheng, Siqi Sun, Lei Bai, Bowen Zhou

Main category: cs.LG

TL;DR: Agentic Science represents AI systems evolving from computational tools to autonomous research partners capable of full scientific agency across hypothesis generation, experimental design, execution, and analysis.

Details

Motivation: To establish Agentic Science as a structured paradigm within AI for Science, unifying fragmented perspectives and providing a comprehensive framework for autonomous scientific discovery across multiple domains.

Method: Domain-oriented review approach that unifies three perspectives (process-oriented, autonomy-oriented, mechanism-oriented) through a comprehensive framework connecting capabilities, processes, and domain-specific implementations.

Result: Provides a synthesis of autonomous scientific discovery across life sciences, chemistry, materials science, and physics, identifying five core capabilities and modeling discovery as a four-stage workflow.

Conclusion: Agentic Science represents a pivotal stage in AI-driven research, establishing a structured paradigm that positions AI systems as autonomous scientific partners with capabilities once considered uniquely human.

Abstract: Artificial intelligence (AI) is reshaping scientific discovery, evolving from specialized computational tools into autonomous research partners. We position Agentic Science as a pivotal stage within the broader AI for Science paradigm, where AI systems progress from partial assistance to full scientific agency. Enabled by large language models (LLMs), multimodal systems, and integrated research platforms, agentic AI shows capabilities in hypothesis generation, experimental design, execution, analysis, and iterative refinement – behaviors once regarded as uniquely human. This survey provides a domain-oriented review of autonomous scientific discovery across life sciences, chemistry, materials science, and physics. We unify three previously fragmented perspectives – process-oriented, autonomy-oriented, and mechanism-oriented – through a comprehensive framework that connects foundational capabilities, core processes, and domain-specific realizations. Building on this framework, we (i) trace the evolution of AI for Science, (ii) identify five core capabilities underpinning scientific agency, (iii) model discovery as a dynamic four-stage workflow, (iv) review applications across the above domains, and (v) synthesize key challenges and future opportunities. This work establishes a domain-oriented synthesis of autonomous scientific discovery and positions Agentic Science as a structured paradigm for advancing AI-driven research.

[251] A Cost-Effective Framework for Predicting Parking Availability Using Geospatial Data and Machine Learning

Madyan Bagosher, Tala Mustafa, Mohammad Alsmirat, Amal Al-Ali, Isam Mashhour Al Jawarneh

Main category: cs.LG

TL;DR: Smart parking framework using multiple data sources and machine learning models to predict campus parking availability without physical sensors.

Details

Motivation: Urban growth and limited campus parking spaces create challenges for students finding parking spots during class times, requiring efficient allocation systems.

Method: Integrates street maps, mobility, and weather data through spatial joins. Evaluates Linear Regression, SVR, Random Forest, and LSTM models with hyperparameter tuning and performance metrics (RMSE, MAE, R2).

Result: Random Forest Regression performed best with RMSE of 0.142 and R2 of 0.582, though LSTM may outperform with more data and longer timesteps.

Conclusion: Proposed framework effectively predicts parking availability using location-based data without physical sensors, with Random Forest showing best performance in current setup.

Abstract: As urban populations continue to grow, cities face numerous challenges in managing parking and determining occupancy. This issue is particularly pronounced in university campuses, where students need to find vacant parking spots quickly and conveniently during class timings. The limited availability of parking spaces on campuses underscores the necessity of implementing efficient systems to allocate vacant parking spots effectively. We propose a smart framework that integrates multiple data sources, including street maps, mobility, and meteorological data, through a spatial join operation to capture parking behavior and vehicle movement patterns over the span of 3 consecutive days with an hourly duration between 7AM till 3PM. The system will not require any sensing tools to be installed in the street or in the parking area to provide its services since all the data needed will be collected using location services. The framework will use the expected parking entrance and time to specify a suitable parking area. Several forecasting models, namely, Linear Regression, Support Vector Regression (SVR), Random Forest Regression (RFR), and Long Short-Term Memory (LSTM), are evaluated. Hyperparameter tuning was employed using grid search, and model performance is assessed using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Coefficient of Determination (R2). Random Forest Regression achieved the lowest RMSE of 0.142 and highest R2 of 0.582. However, given the time-series nature of the task, an LSTM model may perform better with additional data and longer timesteps.

[252] Comparison of derivative-free and gradient-based minimization for multi-objective compositional design of shape memory alloys

S. Josyula, Y. Noiman, E. J. Payton, T. Giovannelli

Main category: cs.LG

TL;DR: Machine learning models combined with optimization algorithms to design shape memory alloys with target martensitic start temperature while minimizing cost.

Details

Motivation: Designing affordable and sustainable shape memory alloys that meet performance targets is challenging, requiring optimization of compositions to achieve desired properties while minimizing cost.

Method: Used tree-based ensemble and neural network ML models trained on experimental SMA data with physics-informed features. Paired tree model with COBYLA (derivative-free optimizer) and neural network with TRUST-CONSTR (gradient-based optimizer) to search for optimal alloy compositions.

Result: Both models predicted Ms with similar accuracy, but neural network with TRUST-CONSTR found better solutions more consistently. COBYLA often converged to suboptimal results, especially with poor initial guesses. TRUST-CONSTR showed more stable behavior and better met both objectives.

Conclusion: Demonstrates practical approach combining physics-informed data, ML models, and optimization for SMA design. Approach reliable despite smaller dataset size and can be extended to other materials with design trade-offs and limited data.

Abstract: Designing shape memory alloys (SMAs) that meet performance targets while remaining affordable and sustainable is a complex challenge. In this work, we focus on optimizing SMA compositions to achieve a desired martensitic start temperature (Ms) while minimizing cost. To do this, we use machine learning models as surrogate predictors and apply numerical optimization methods to search for suitable alloy combinations. We trained two types of machine learning models, a tree-based ensemble and a neural network, using a dataset of experimentally characterized alloys and physics-informed features. The tree-based model was used with a derivative-free optimizer (COBYLA), while the neural network, which provides gradient information, was paired with a gradient-based optimizer (TRUST-CONSTR). Our results show that while both models predict Ms with similar accuracy, the optimizer paired with the neural network finds better solutions more consistently. COBYLA often converged to suboptimal results, especially when the starting guess was far from the target. The TRUST-CONSTR method showed more stable behavior and was better at reaching alloy compositions that met both objectives. This study demonstrates a practical approach to exploring new SMA compositions by combining physics-informed data, machine learning models, and optimization algorithms. Although the scale of our dataset is smaller than simulation-based efforts, the use of experimental data improves the reliability of the predictions. The approach can be extended to other materials where design trade-offs must be made with limited data.

[253] ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification

Xin Wu, Fei Teng, Ji Zhang, Xingwang Li, Yuxuan Liang

Main category: cs.LG

TL;DR: ERIS framework enables guided feature disentanglement for time series classification using energy-guided calibration, weight-level orthogonality, and adversarial training to achieve better OOD performance.

Details

Motivation: Current time series classification models suffer from poor out-of-distribution performance due to entanglement of domain-specific and label-relevant features, creating spurious correlations. Existing disentanglement methods lack semantic guidance.

Method: End-to-end Energy-Regularized Information for Shift-Robustness (ERIS) framework with three mechanisms: energy-guided calibration for semantic guidance, weight-level orthogonality for structural independence, and auxiliary adversarial training for robustness.

Result: ERIS improves state-of-the-art baselines by an average of 4.04% accuracy across four benchmarks.

Conclusion: The proposed ERIS framework successfully addresses the disentanglement problem in time series classification by providing semantic guidance alongside mathematical constraints, leading to significantly improved out-of-distribution performance.

Abstract: An ideal time series classification (TSC) should be able to capture invariant representations, but achieving reliable performance on out-of-distribution (OOD) data remains a core obstacle. This obstacle arises from the way models inherently entangle domain-specific and label-relevant features, resulting in spurious correlations. While feature disentanglement aims to solve this, current methods are largely unguided, lacking the semantic direction required to isolate truly universal features. To address this, we propose an end-to-end Energy-Regularized Information for Shift-Robustness (\textbf{ERIS}) framework to enable guided and reliable feature disentanglement. The core idea is that effective disentanglement requires not only mathematical constraints but also semantic guidance to anchor the separation process. ERIS incorporates three key mechanisms to achieve this goal. Specifically, we first introduce an energy-guided calibration mechanism, which provides crucial semantic guidance for the separation, enabling the model to self-calibrate. Additionally, a weight-level orthogonality strategy enforces structural independence between domain-specific and label-relevant features, thereby mitigating their interference. Moreover, an auxiliary adversarial training mechanism enhances robustness by injecting structured perturbations. Experiments demonstrate that ERIS improves upon state-of-the-art baselines by an average of 4.04% accuracy across four benchmarks.

[254] Towards Agent-based Test Support Systems: An Unsupervised Environment Design Approach

Collins O. Ogbodo, Timothy J. Rogers, Mattia Dal Borgo, David J. Wagg

Main category: cs.LG

TL;DR: Agent-based framework for adaptive sensor placement in modal testing using reinforcement learning to optimize sensor locations across frequency segments in dynamically changing test environments.

Details

Motivation: Traditional modal test design approaches are static and rigid, compromising accuracy and adaptability by not accounting for evolving test parameters and their impact on previously established decisions like sensor configurations.

Method: An agent-based decision support framework using underspecified partially observable Markov decision process, training a generalist reinforcement learning agent through dual-curriculum learning strategy.

Result: Case study on a steel cantilever structure demonstrates efficacy in optimizing sensor locations across frequency segments, showing robustness and real-world applicability.

Conclusion: The proposed framework addresses limitations of traditional static approaches by enabling adaptive sensor placement that responds to dynamically changing test environments, improving test accuracy and adaptability.

Abstract: Modal testing plays a critical role in structural analysis by providing essential insights into dynamic behaviour across a wide range of engineering industries. In practice, designing an effective modal test campaign involves complex experimental planning, comprising a series of interdependent decisions that significantly influence the final test outcome. Traditional approaches to test design are typically static-focusing only on global tests without accounting for evolving test campaign parameters or the impact of such changes on previously established decisions, such as sensor configurations, which have been found to significantly influence test outcomes. These rigid methodologies often compromise test accuracy and adaptability. To address these limitations, this study introduces an agent-based decision support framework for adaptive sensor placement across dynamically changing modal test environments. The framework formulates the problem using an underspecified partially observable Markov decision process, enabling the training of a generalist reinforcement learning agent through a dual-curriculum learning strategy. A detailed case study on a steel cantilever structure demonstrates the efficacy of the proposed method in optimising sensor locations across frequency segments, validating its robustness and real-world applicability in experimental settings.

[255] Topological Data Analysis for Unsupervised Anomaly Detection and Customer Segmentation on Banking Data

Leonardo Aldo Alejandro Barberi, Linda Maria De Cave

Main category: cs.LG

TL;DR: Advanced TDA techniques using Mapper algorithm and persistent homology for unsupervised anomaly detection and customer segmentation in banking data.

Details

Motivation: To develop unsupervised procedures that uncover meaningful patterns in customers' banking data by exploiting topological information, bridging abstract topology with practical industry applications.

Method: Utilizes Topological Data Analysis (TDA) techniques including the Mapper algorithm and persistent homology to analyze banking data in an unsupervised manner.

Result: The framework yields actionable insights that combine topological mathematics with real-life industry use cases, successfully identifying patterns for anomaly detection and customer segmentation.

Conclusion: TDA provides effective unsupervised methods for pattern discovery in banking data, demonstrating the practical value of topological approaches for real-world financial applications.

Abstract: This paper introduces advanced techniques of Topological Data Analysis (TDA) for unsupervised anomaly detection and customer segmentation in banking data. Using the Mapper algorithm and persistent homology, we develop unsupervised procedures that uncover meaningful patterns in customers’ banking data by exploiting topological information. The framework we present in this paper yields actionable insights that combine the abstract mathematical subject of topology with real-life use cases that are useful in industry.

[256] Learning to Learn the Macroscopic Fundamental Diagram using Physics-Informed and meta Machine Learning techniques

Amalie Roark, Serio Agriesti, Francisco Camara Pereira, Guido Cantelmo

Main category: cs.LG

TL;DR: Meta-learning framework for estimating Macroscopic Fundamental Diagrams (MFD) using limited loop detector data by leveraging multi-city data and physics-informed neural networks.

Details

Motivation: Traditional MFD estimation requires many loop detectors which are often unavailable. This paper addresses data scarcity by using meta-learning to generalize across cities with different detector coverage and topologies.

Method: Proposes a meta-learning framework applied to a Multi-Task Physics-Informed Neural Network specifically designed for MFD estimation. The model is trained on data from multiple cities and tested on cities with limited detectors.

Result: Achieves average MSE improvement in flow prediction between ~17,500 and 36,000 (depending on detector subset). Successfully generalizes across diverse urban settings and outperforms traditional transfer learning approaches.

Conclusion: Meta-learning effectively addresses MFD estimation with limited detectors, demonstrating strong transferability across cities and superior performance compared to conventional methods.

Abstract: The Macroscopic Fundamental Diagram is a popular tool used to describe traffic dynamics in an aggregated way, with applications ranging from traffic control to incident analysis. However, estimating the MFD for a given network requires large numbers of loop detectors, which is not always available in practice. This article proposes a framework harnessing meta-learning, a subcategory of machine learning that trains models to understand and adapt to new tasks on their own, to alleviate the data scarcity challenge. The developed model is trained and tested by leveraging data from multiple cities and exploiting it to model the MFD of other cities with different shares of detectors and topological structures. The proposed meta-learning framework is applied to an ad-hoc Multi-Task Physics-Informed Neural Network, specifically designed to estimate the MFD. Results show an average MSE improvement in flow prediction ranging between ~ 17500 and 36000 (depending on the subset of loop detectors tested). The meta-learning framework thus successfully generalizes across diverse urban settings and improves performance on cities with limited data, demonstrating the potential of using meta-learning when a limited number of detectors is available. Finally, the proposed framework is validated against traditional transfer learning approaches and tested with FitFun, a non-parametric model from the literature, to prove its transferability.

[257] STAS: Spatio-Temporal Adaptive Computation Time for Spiking Transformers

Donghwa Kang, Doohyun Kim, Sang-Ki Ko, Jinkyu Lee, Brent ByungHoon Kang, Hyeongboo Baek

Main category: cs.LG

TL;DR: STAS framework reduces SNN latency and energy consumption by co-designing architecture and dynamic computation policy, achieving up to 45.9% energy savings while improving accuracy.

Details

Motivation: Spiking neural networks (SNNs) offer energy efficiency but suffer from high latency and computational overhead due to multi-timestep operations. Existing dynamic computation methods are fragmented and cannot be directly applied to SNN-based vision Transformers due to temporal dissimilarity issues and static architecture limitations.

Method: Proposes STAS framework with two key components: 1) Integrated spike patch splitting (I-SPS) module to create unified input representation and establish temporal stability, 2) Adaptive spiking self-attention (A-SSA) module that performs two-dimensional token pruning across spatial and temporal axes.

Result: Validated on CIFAR-10, CIFAR-100, and ImageNet datasets. Achieved energy consumption reductions of 45.9%, 43.8%, and 30.1% respectively, while simultaneously improving accuracy over state-of-the-art models.

Conclusion: STAS successfully addresses the core challenges of applying adaptive computation time to SNN-based vision Transformers by co-designing architecture and computation policy, enabling significant energy savings without compromising accuracy.

Abstract: Spiking neural networks (SNNs) offer energy efficiency over artificial neural networks (ANNs) but suffer from high latency and computational overhead due to their multi-timestep operational nature. While various dynamic computation methods have been developed to mitigate this by targeting spatial, temporal, or architecture-specific redundancies, they remain fragmented. While the principles of adaptive computation time (ACT) offer a robust foundation for a unified approach, its application to SNN-based vision Transformers (ViTs) is hindered by two core issues: the violation of its temporal similarity prerequisite and a static architecture fundamentally unsuited for its principles. To address these challenges, we propose STAS (Spatio-Temporal Adaptive computation time for Spiking transformers), a framework that co-designs the static architecture and dynamic computation policy. STAS introduces an integrated spike patch splitting (I-SPS) module to establish temporal stability by creating a unified input representation, thereby solving the architectural problem of temporal dissimilarity. This stability, in turn, allows our adaptive spiking self-attention (A-SSA) module to perform two-dimensional token pruning across both spatial and temporal axes. Implemented on spiking Transformer architectures and validated on CIFAR-10, CIFAR-100, and ImageNet, STAS reduces energy consumption by up to 45.9%, 43.8%, and 30.1%, respectively, while simultaneously improving accuracy over SOTA models.

[258] Neuro-inspired Ensemble-to-Ensemble Communication Primitives for Sparse and Efficient ANNs

Orestis Konstantaropoulos, Stelios Manolis Smirnakis, Maria Papadopouli

Main category: cs.LG

TL;DR: G2GNet is a biologically-inspired neural network that uses sparse, modular connectivity patterns from mouse visual cortex to achieve 75% sparsity while improving accuracy by 4.3% on vision benchmarks with fewer computations.

Details

Motivation: Biological neural circuits show efficient trade-offs between wiring cost, specialization, and robustness. The paper aims to incorporate observed functional connectivity patterns from mouse visual cortex into ANN design to improve efficiency and performance.

Method: Introduces G2GNet with sparse, modular connectivity across feedforward layers based on biological ensemble-to-ensemble communication patterns. Combines static structural bias with dynamic sparse training (DST) that prunes and regrows edges during training, plus a Hebbian-inspired rewiring rule based on activation correlations.

Result: Achieves up to 75% sparsity while improving accuracy by up to 4.3% on Fashion-MNIST, CIFAR-10, and CIFAR-100 benchmarks. Outperforms dense baselines with significantly fewer parameters and computations.

Conclusion: This is the first architecture to incorporate biologically observed functional connectivity patterns as structural bias in ANN design, demonstrating that biological principles can lead to more efficient and accurate neural networks.

Abstract: The structure of biological neural circuits-modular, hierarchical, and sparsely interconnected-reflects an efficient trade-off between wiring cost, functional specialization, and robustness. These principles offer valuable insights for artificial neural network (ANN) design, especially as networks grow in depth and scale. Sparsity, in particular, has been widely explored for reducing memory and computation, improving speed, and enhancing generalization. Motivated by systems neuroscience findings, we explore how patterns of functional connectivity in the mouse visual cortex-specifically, ensemble-to-ensemble communication, can inform ANN design. We introduce G2GNet, a novel architecture that imposes sparse, modular connectivity across feedforward layers. Despite having significantly fewer parameters than fully connected models, G2GNet achieves superior accuracy on standard vision benchmarks. To our knowledge, this is the first architecture to incorporate biologically observed functional connectivity patterns as a structural bias in ANN design. We complement this static bias with a dynamic sparse training (DST) mechanism that prunes and regrows edges during training. We also propose a Hebbian-inspired rewiring rule based on activation correlations, drawing on principles of biological plasticity. G2GNet achieves up to 75% sparsity while improving accuracy by up to 4.3% on benchmarks, including Fashion-MNIST, CIFAR-10, and CIFAR-100, outperforming dense baselines with far fewer computations.

[259] Beyond Turing: Memory-Amortized Inference as a Foundation for Cognitive Computation

Xin Li

Main category: cs.LG

TL;DR: Memory-Amortized Inference (MAI) proposes intelligence emerges from structured reuse of prior inference rather than optimization from scratch, modeling cognition as inference over memory cycles instead of gradient recomputation.

Details

Motivation: To address computational bottlenecks in modern AI and provide a biologically grounded theory of intelligence that explains how cognition efficiently reuses prior knowledge rather than recomputing everything.

Method: MAI framework models cognition as inference over latent cycles in memory, using structural reuse to encode inductive biases. It employs delta-homology to model cortical columns as local inference operators and establishes time-reversal duality with reinforcement learning.

Result: MAI provides a principled foundation for Mountcastle’s Universal Cortical Algorithm, demonstrates energy-efficient inference capabilities, and offers a unified theory of intelligence based on structure, reuse, and memory.

Conclusion: MAI offers a biologically plausible framework for AGI that addresses computational efficiency through memory-amortized inference, potentially overcoming current AI bottlenecks by leveraging structured knowledge reuse rather than brute-force computation.

Abstract: Intelligence is fundamentally non-ergodic: it emerges not from uniform sampling or optimization from scratch, but from the structured reuse of prior inference trajectories. We introduce Memory-Amortized Inference (MAI) as a formal framework in which cognition is modeled as inference over latent cycles in memory, rather than recomputation through gradient descent. MAI systems encode inductive biases via structural reuse, minimizing entropy and enabling context-aware, structure-preserving inference. This approach reframes cognitive systems not as ergodic samplers, but as navigators over constrained latent manifolds, guided by persistent topological memory. Through the lens of delta-homology, we show that MAI provides a principled foundation for Mountcastle’s Universal Cortical Algorithm, modeling each cortical column as a local inference operator over cycle-consistent memory states. Furthermore, we establish a time-reversal duality between MAI and reinforcement learning: whereas RL propagates value forward from reward, MAI reconstructs latent causes backward from memory. This inversion paves a path toward energy-efficient inference and addresses the computational bottlenecks facing modern AI. MAI thus offers a unified, biologically grounded theory of intelligence based on structure, reuse, and memory. We also briefly discuss the profound implications of MAI for achieving artificial general intelligence (AGI).

[260] Noise Robust One-Class Intrusion Detection on Dynamic Graphs

Aleksei Liuliakov, Alexander Schulz, Luca Hermes, Barbara Hammer

Main category: cs.LG

TL;DR: Probabilistic TGN-SVDD model improves network intrusion detection robustness against noisy data by predicting Gaussian distribution parameters for network events.

Details

Motivation: Network intrusion detection systems face challenges with contaminated and noisy data inputs, requiring improved robustness against adversarial noise.

Method: Developed a probabilistic version of Temporal Graph Network Support Vector Data Description (TGN-SVDD) that predicts Gaussian distribution parameters for each network event to handle noisy inputs.

Result: Experiments on modified CIC-IDS2017 dataset with synthetic noise showed significant detection performance improvements over baseline TGN-SVDD, especially at higher noise levels.

Conclusion: The probabilistic approach effectively enhances robustness against noisy adversarial inputs in network intrusion detection systems.

Abstract: In the domain of network intrusion detection, robustness against contaminated and noisy data inputs remains a critical challenge. This study introduces a probabilistic version of the Temporal Graph Network Support Vector Data Description (TGN-SVDD) model, designed to enhance detection accuracy in the presence of input noise. By predicting parameters of a Gaussian distribution for each network event, our model is able to naturally address noisy adversarials and improve robustness compared to a baseline model. Our experiments on a modified CIC-IDS2017 data set with synthetic noise demonstrate significant improvements in detection performance compared to the baseline TGN-SVDD model, especially as noise levels increase.

[261] Reliability comparison of vessel trajectory prediction models via Probability of Detection

Zahra Rastin, Kathrin Donandt, Dirk Söffker

Main category: cs.LG

TL;DR: Evaluation of deep learning models for vessel trajectory prediction with focus on traffic complexity and reliability assessment using probability of detection analysis.

Details

Motivation: Previous VTP models overlook traffic situation complexity and lack reliability assessments, so this research aims to quantify model reliability in varying traffic scenarios beyond common error analyses.

Method: Uses probability of detection analysis to evaluate deep learning models on test samples categorized by traffic situation complexity during prediction horizon, with performance metrics and reliability estimates for each category.

Result: Comprehensive evaluation provides deeper understanding of strengths/weaknesses of different prediction approaches and their reliability for safe forecast horizons.

Conclusion: Findings can inform development of more reliable vessel trajectory prediction approaches to enhance safety and efficiency in inland waterways navigation.

Abstract: This contribution addresses vessel trajectory prediction (VTP), focusing on the evaluation of different deep learning-based approaches. The objective is to assess model performance in diverse traffic complexities and compare the reliability of the approaches. While previous VTP models overlook the specific traffic situation complexity and lack reliability assessments, this research uses a probability of detection analysis to quantify model reliability in varying traffic scenarios, thus going beyond common error distribution analyses. All models are evaluated on test samples categorized according to their traffic situation during the prediction horizon, with performance metrics and reliability estimates obtained for each category. The results of this comprehensive evaluation provide a deeper understanding of the strengths and weaknesses of the different prediction approaches, along with their reliability in terms of the prediction horizon lengths for which safe forecasts can be guaranteed. These findings can inform the development of more reliable vessel trajectory prediction approaches, enhancing safety and efficiency in future inland waterways navigation.

[262] Graph Concept Bottleneck Models

Haotian Xu, Tsui-Wei Weng, Lam M. Nguyen, Tengfei Ma

Main category: cs.LG

TL;DR: GraphCBMs enhance Concept Bottleneck Models by incorporating latent concept graphs to capture relationships between concepts, improving classification performance and interpretability while enabling more effective interventions.

Details

Motivation: Existing CBMs assume concepts are conditionally independent and ignore hidden relationships among concepts, but concepts are often correlated and changing one concept impacts related concepts.

Method: Propose GraphCBMs that construct latent concept graphs to facilitate concept relationships, which can be combined with CBMs to enhance performance while retaining interpretability.

Result: Superior performance in image classification tasks, provides more concept structure information for interpretability, enables more effective interventions using latent concept graphs, and shows robust performance across different training and architecture settings.

Conclusion: GraphCBMs successfully address the limitation of ignoring concept relationships in traditional CBMs, offering improved performance, better interpretability, and more effective intervention capabilities while maintaining robustness.

Abstract: Concept Bottleneck Models (CBMs) provide explicit interpretations for deep neural networks through concepts and allow intervention with concepts to adjust final predictions. Existing CBMs assume concepts are conditionally independent given labels and isolated from each other, ignoring the hidden relationships among concepts. However, the set of concepts in CBMs often has an intrinsic structure where concepts are generally correlated: changing one concept will inherently impact its related concepts. To mitigate this limitation, we propose GraphCBMs: a new variant of CBM that facilitates concept relationships by constructing latent concept graphs, which can be combined with CBMs to enhance model performance while retaining their interpretability. Our experiment results on real-world image classification tasks demonstrate Graph CBMs offer the following benefits: (1) superior in image classification tasks while providing more concept structure information for interpretability; (2) able to utilize latent concept graphs for more effective interventions; and (3) robust in performance across different training and architecture settings.

[263] Amortized Bayesian Meta-Learning for Low-Rank Adaptation of Large Language Models

Liyi Zhang, Jake Snell, Thomas L. Griffiths

Main category: cs.LG

TL;DR: ABMLL is an efficient Bayesian meta-learning method for LoRA fine-tuning that improves generalization and uncertainty quantification in large language models while maintaining computational efficiency.

Details

Motivation: Existing methods for improving generalization in LoRA fine-tuning are expensive in memory and computation, requiring long-context prompts or second-order gradient updates.

Method: Amortized Bayesian meta-learning adapted for LLMs with LoRA, reframing task-specific and global parameters using new hyperparameters to balance reconstruction accuracy and parameter fidelity.

Result: Outperforms existing methods on Unified-QA and CrossFit benchmarks in both accuracy and expected calibration error, scales to large models like Llama3-8B.

Conclusion: ABMLL provides computationally efficient generalization improvement with better uncertainty quantification for LoRA fine-tuning of large language models.

Abstract: Fine-tuning large language models (LLMs) with low-rank adaptaion (LoRA) is a cost-effective way to incorporate information from a specific dataset. However, it is often unclear how well the fine-tuned LLM will generalize, i.e., how well it will perform on unseen datasets. Methods have been proposed to improve generalization by optimizing with in-context prompts, or by using meta-learning to fine-tune LLMs. However, these methods are expensive in memory and computation, requiring either long-context prompts or saving copies of parameters and using second-order gradient updates. To address these challenges, we propose Amortized Bayesian Meta-Learning for LoRA (ABMLL). This method builds on amortized Bayesian meta-learning for smaller models, adapting this approach to LLMs while maintaining its computational efficiency. We reframe task-specific and global parameters in the context of LoRA and use a set of new hyperparameters to balance reconstruction accuracy and the fidelity of task-specific parameters to the global ones. ABMLL provides effective generalization and scales to large models such as Llama3-8B. Furthermore, as a result of using a Bayesian framework, ABMLL provides improved uncertainty quantification. We test ABMLL on Unified-QA and CrossFit datasets and find that it outperforms existing methods on these benchmarks in terms of both accuracy and expected calibration error.

[264] GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation

Amirmohsen Sattarifard, Sepehr Lavasani, Ehsan Imani, Kunlin Zhang, Hanlin Xu, Fengyu Sun, Negar Hassanpour, Chao Gao

Main category: cs.LG

TL;DR: GLASS introduces training-free dynamic pruning for LLMs using rank-aggregation of local and global neuron statistics, outperforming prior methods in long-form generation without inference overhead.

Details

Motivation: Edge deployment of LLMs requires aggressive pruning to reduce computation while maintaining quality. Existing static or predictor-based methods either lack flexibility or add runtime overhead, and zero-shot methods fail on short prompts/long generation scenarios.

Method: A/I-GLASS uses activation- and impact-based global-local neural importance aggregation for FFN sparsification. It dynamically selects FFN units through rank-aggregation of prompt local statistics and model-intrinsic global neuron statistics, without training or auxiliary predictors.

Result: Empirical results across multiple LLMs and benchmarks show GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios.

Conclusion: GLASS provides an effective training-free dynamic pruning approach that maintains quality while reducing computation, without adding inference overhead, making it suitable for edge deployment of large language models.

Abstract: Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.

[265] Learning Time-Varying Convexifications of Multiple Fairness Measures

Quan Zhou, Jakub Marecek, Robert Shorten

Main category: cs.LG

TL;DR: Learning time-varying convex combinations of multiple fairness measures with limited graph-structured feedback

Details

Motivation: There is increasing need to consider multiple fairness measures (group and individual notions) simultaneously, but their relative weights are unknown, time-varying, and must be learned adaptively

Method: Proposes learning time-varying convexifications of multiple fairness measures with limited graph-structured feedback

Result: Not specified in abstract

Conclusion: Addresses the challenge of dynamically learning appropriate weight combinations for multiple fairness constraints in evolving environments

Abstract: There is an increasing appreciation that one may need to consider multiple measures of fairness, e.g., considering multiple group and individual fairness notions. The relative weights of the fairness regularisers are a priori unknown, may be time varying, and need to be learned on the fly. We consider the learning of time-varying convexifications of multiple fairness measures with limited graph-structured feedback.

[266] Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

Can Jin, Yang Zhou, Qixin Zhang, Hongwu Peng, Di Zhang, Marco Pavone, Ligong Han, Zhang-Wei Hong, Tong Che, Dimitris N. Metaxas

Main category: cs.LG

TL;DR: AIRL-S unifies RL and search-based test-time scaling by using the RL-learned reward function as an ideal process reward model (PRM) for guiding search, eliminating need for labeled intermediate data.

Details

Motivation: Existing test-time scaling methods have limitations - RL methods suffer from instability and low efficiency, while search-based methods require expensive labeled data and degrade under distribution shifts. A unified approach is needed.

Method: Uses adversarial inverse reinforcement learning (AIRL) with group relative policy optimization (GRPO) to learn dense, dynamic PRM directly from correct reasoning traces without labeled intermediate data. The PRM serves both as RL critic and search heuristic.

Result: Improves performance by 9% on average over base model across 8 benchmarks (mathematics, scientific reasoning, code generation), matching GPT-4o. Outperforms all baseline PRMs trained with labeled data when integrated into search algorithms.

Conclusion: The RL-learned reward function serves as the best PRM for search, providing a robust and cost-effective solution for complex reasoning tasks in LLMs without requiring expensive labeled data.

Abstract: Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.

[267] FedRAIN-Lite: Federated Reinforcement Algorithms for Improving Idealised Numerical Weather and Climate Models

Pritthijit Nath, Sebastian Schemm, Henry Moss, Peter Haynes, Emily Shuckburgh, Mark Webb

Main category: cs.LG

TL;DR: FedRAIN-Lite uses federated reinforcement learning with latitude-based agents to enable adaptive climate model parameterization, showing DDPG outperforms traditional static methods with better accuracy and faster convergence.

Details

Motivation: Traditional climate model parameterizations are static and tuned offline, limiting adaptability to evolving climate states and changing conditions.

Method: Federated reinforcement learning framework with agents assigned to latitude bands, tested on simplified energy-balance climate models (ebm-v1 to ebm-v3) using DDPG and other RL algorithms.

Result: DDPG consistently outperformed static and single-agent baselines with faster convergence, lower area-weighted RMSE in tropical/mid-latitude zones, and demonstrated good transferability across hyperparameters.

Conclusion: The approach provides a scalable pathway for geographically adaptive parameter learning in high-complexity GCMs and offers a prototype for physically aligned online-learning climate models.

Abstract: Sub-grid parameterisations in climate models are traditionally static and tuned offline, limiting adaptability to evolving states. This work introduces FedRAIN-Lite, a federated reinforcement learning (FedRL) framework that mirrors the spatial decomposition used in general circulation models (GCMs) by assigning agents to latitude bands, enabling local parameter learning with periodic global aggregation. Using a hierarchy of simplified energy-balance climate models, from a single-agent baseline (ebm-v1) to multi-agent ensemble (ebm-v2) and GCM-like (ebm-v3) setups, we benchmark three RL algorithms under different FedRL configurations. Results show that Deep Deterministic Policy Gradient (DDPG) consistently outperforms both static and single-agent baselines, with faster convergence and lower area-weighted RMSE in tropical and mid-latitude zones across both ebm-v2 and ebm-v3 setups. DDPG’s ability to transfer across hyperparameters and low computational cost make it well-suited for geographically adaptive parameter learning. This capability offers a scalable pathway towards high-complexity GCMs and provides a prototype for physically aligned, online-learning climate models that can evolve with a changing climate. Code accessible at https://github.com/p3jitnath/climate-rl-fedrl.

[268] Multi-view Graph Condensation via Tensor Decomposition

Nícolas Roque dos Santos, Dawon Ahn, Diego Minatel, Alneu de Andrade Lopes, Evangelos E. Papalexakis

Main category: cs.LG

TL;DR: GCTD proposes tensor decomposition for graph condensation to reduce computational costs while maintaining GNN performance and improving interpretability.

Details

Motivation: Current graph condensation methods rely on computationally intensive bi-level optimization and lack interpretability due to missing node mapping between original and synthetic graphs.

Method: Multi-view Graph Condensation via Tensor Decomposition (GCTD) uses tensor decomposition techniques to synthesize a smaller informative graph while maintaining performance.

Result: GCTD achieves up to 4.0% accuracy improvement on 3/6 datasets and competitive performance on large graphs while effectively reducing graph size.

Conclusion: Tensor decomposition offers an effective alternative to bi-level optimization for graph condensation, providing better interpretability and competitive performance with reduced computational demands.

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable results in various real-world applications, including drug discovery, object detection, social media analysis, recommender systems, and text classification. In contrast to their vast potential, training them on large-scale graphs presents significant computational challenges due to the resources required for their storage and processing. Graph Condensation has emerged as a promising solution to reduce these demands by learning a synthetic compact graph that preserves the essential information of the original one while maintaining the GNN’s predictive performance. Despite their efficacy, current graph condensation approaches frequently rely on a computationally intensive bi-level optimization. Moreover, they fail to maintain a mapping between synthetic and original nodes, limiting the interpretability of the model’s decisions. In this sense, a wide range of decomposition techniques have been applied to learn linear or multi-linear functions from graph data, offering a more transparent and less resource-intensive alternative. However, their applicability to graph condensation remains unexplored. This paper addresses this gap and proposes a novel method called Multi-view Graph Condensation via Tensor Decomposition (GCTD) to investigate to what extent such techniques can synthesize an informative smaller graph and achieve comparable downstream task performance. Extensive experiments on six real-world datasets demonstrate that GCTD effectively reduces graph size while preserving GNN performance, achieving up to a 4.0\ improvement in accuracy on three out of six datasets and competitive performance on large graphs compared to existing approaches. Our code is available at https://anonymous.4open.science/r/gctd-345A.

[269] DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang

Main category: cs.LG

TL;DR: DuPO is a dual learning-based preference optimization framework that generates annotation-free feedback through generalized duality, eliminating the need for costly labels and expanding beyond strictly dual task pairs.

Details

Motivation: To address limitations of RLVR (reliance on costly labels and restricted applicability to verifiable tasks) and traditional dual learning (restricted to strictly dual task pairs like translation/back-translation).

Method: Decomposes primal task input into known/unknown components, constructs dual task to reconstruct unknown parts using primal output and known information, uses reconstruction quality as self-supervised reward to optimize primal task via LLMs.

Result: Achieved 2.13 COMET improvement in translation across 756 directions, 6.4% accuracy boost in mathematical reasoning on three benchmarks, and 9.3 point improvement as inference-time reranker.

Conclusion: DuPO presents a scalable, general, and annotation-free paradigm for LLM optimization that broadens applicability to non-invertible tasks through generalized duality.

Abstract: We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)’s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning’s restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task’s input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs’ ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.

[270] A Comparative Evaluation of Teacher-Guided Reinforcement Learning Techniques for Autonomous Cyber Operations

Konur Tholl, Mariam El Mezouar, Ranwa Al Mallah

Main category: cs.LG

TL;DR: Teacher-guided reinforcement learning techniques significantly improve training efficiency in autonomous cyber operations, enhancing early policy performance and convergence speed compared to learning from scratch.

Details

Motivation: Existing autonomous cyber operations require agents to learn from scratch, resulting in slow convergence and poor early-stage performance. Teacher-guided techniques have shown promise in other domains but haven't been applied to cybersecurity operations.

Method: Implemented four distinct teacher-guided techniques in the simulated CybORG environment and conducted comparative evaluation of their effectiveness.

Result: Teacher integration significantly improved training efficiency in terms of early policy performance and convergence speed.

Conclusion: Teacher-guided techniques show strong potential benefits for autonomous cybersecurity operations by addressing the limitations of traditional reinforcement learning approaches.

Abstract: Autonomous Cyber Operations (ACO) rely on Reinforcement Learning (RL) to train agents to make effective decisions in the cybersecurity domain. However, existing ACO applications require agents to learn from scratch, leading to slow convergence and poor early-stage performance. While teacher-guided techniques have demonstrated promise in other domains, they have not yet been applied to ACO. In this study, we implement four distinct teacher-guided techniques in the simulated CybORG environment and conduct a comparative evaluation. Our results demonstrate that teacher integration can significantly improve training efficiency in terms of early policy performance and convergence speed, highlighting its potential benefits for autonomous cybersecurity.

[271] NeRC: Neural Ranging Correction through Differentiable Moving Horizon Location Estimation

Xu Weng, K. V. Ling, Haochen Liu, Bingheng Wang, Kun Cao

Main category: cs.LG

TL;DR: NeRC is an end-to-end neural framework that corrects GNSS ranging errors using ground-truth locations instead of hard-to-obtain error labels, achieving improved positioning accuracy for mobile devices in urban environments.

Details

Motivation: GNSS localization on mobile devices suffers from accuracy issues in urban areas due to signal propagation complexity and low-quality hardware. Traditional data-driven methods require difficult-to-obtain ranging error annotations.

Method: Proposes Neural Ranging Correction (NeRC) framework with differentiable moving horizon estimation for end-to-end training using ground-truth locations. Introduces Euclidean Distance Field cost maps to reduce dependency on labeled data.

Result: Demonstrates significant improvement in positioning accuracy on public benchmarks and collected datasets. Successfully deployed on edge devices with real-time performance.

Conclusion: NeRC provides a practical solution for accurate GNSS positioning on mobile devices by eliminating the need for ranging error annotations and enabling end-to-end learning with easily obtainable ground-truth locations.

Abstract: GNSS localization using everyday mobile devices is challenging in urban environments, as ranging errors caused by the complex propagation of satellite signals and low-quality onboard GNSS hardware are blamed for undermining positioning accuracy. Researchers have pinned their hopes on data-driven methods to regress such ranging errors from raw measurements. However, the grueling annotation of ranging errors impedes their pace. This paper presents a robust end-to-end Neural Ranging Correction (NeRC) framework, where localization-related metrics serve as the task objective for training the neural modules. Instead of seeking impractical ranging error labels, we train the neural network using ground-truth locations that are relatively easy to obtain. This functionality is supported by differentiable moving horizon location estimation (MHE) that handles a horizon of measurements for positioning and backpropagates the gradients for training. Even better, as a blessing of end-to-end learning, we propose a new training paradigm using Euclidean Distance Field (EDF) cost maps, which alleviates the demands on labeled locations. We evaluate the proposed NeRC on public benchmarks and our collected datasets, demonstrating its distinguished improvement in positioning accuracy. We also deploy NeRC on the edge to verify its real-time performance for mobile devices.

[272] Organ-Agents: Virtual Human Physiology Simulator via LLMs

Rihao Chang, He Jiao, Weizhi Nie, Honglin Guo, Keliang Xie, Zhenhua Wu, Lina Zhao, Yunpeng Bai, Yongtao Ma, Lanjun Wang, Yuting Su, Xi Gao, Weijie Wang, Nicu Sebe, Bruno Lepri, Bingwei Sun

Main category: cs.LG

TL;DR: Organ-Agents is a multi-agent LLM framework that simulates human physiology using specialized agents for different body systems, achieving high accuracy in sepsis patient simulation and enabling counterfactual treatment analysis.

Details

Motivation: To create a credible digital twin for human physiology that can simulate complex multi-system interactions, particularly for critical care scenarios like sepsis, enabling precision diagnosis, treatment simulation, and hypothesis testing.

Method: Multi-agent framework with LLM-driven agents modeling specific physiological systems. Training involves supervised fine-tuning on system-specific time-series data followed by reinforcement-guided coordination with dynamic reference selection and error correction. Uses data from 7,134 sepsis patients and 7,895 controls across 9 systems and 125 variables.

Result: Achieved high simulation accuracy on 4,509 held-out patients (per-system MSEs <0.16), robust across severity strata. External validation on 22,689 ICU patients showed moderate degradation but stable simulation. Faithfully reproduces critical multi-system events with coherent timing. Physicians rated realism and plausibility highly (3.9 and 3.7 on Likert scale). Enables counterfactual simulations with aligned APACHE II scores.

Conclusion: Organ-Agents represents a credible, interpretable, and generalizable digital twin for critical care applications, preserving decision-relevant patterns and enabling synthetic data generation for downstream tasks with minimal performance degradation.

Abstract: Recent advances in large language models (LLMs) have enabled new possibilities in simulating complex physiological systems. We introduce Organ-Agents, a multi-agent framework that simulates human physiology via LLM-driven agents. Each Simulator models a specific system (e.g., cardiovascular, renal, immune). Training consists of supervised fine-tuning on system-specific time-series data, followed by reinforcement-guided coordination using dynamic reference selection and error correction. We curated data from 7,134 sepsis patients and 7,895 controls, generating high-resolution trajectories across 9 systems and 125 variables. Organ-Agents achieved high simulation accuracy on 4,509 held-out patients, with per-system MSEs <0.16 and robustness across SOFA-based severity strata. External validation on 22,689 ICU patients from two hospitals showed moderate degradation under distribution shifts with stable simulation. Organ-Agents faithfully reproduces critical multi-system events (e.g., hypotension, hyperlactatemia, hypoxemia) with coherent timing and phase progression. Evaluation by 15 critical care physicians confirmed realism and physiological plausibility (mean Likert ratings 3.9 and 3.7). Organ-Agents also enables counterfactual simulations under alternative sepsis treatment strategies, generating trajectories and APACHE II scores aligned with matched real-world patients. In downstream early warning tasks, classifiers trained on synthetic data showed minimal AUROC drops (<0.04), indicating preserved decision-relevant patterns. These results position Organ-Agents as a credible, interpretable, and generalizable digital twin for precision diagnosis, treatment simulation, and hypothesis testing in critical care.

[273] On the Interplay between Graph Structure and Learning Algorithms in Graph Neural Networks

Junwei Su, Chuan Wu

Main category: cs.LG

TL;DR: This paper analyzes how graph structure affects learning algorithm performance in GNNs, focusing on excess risk and connecting spectral graph theory with SGD/Ridge regression performance.

Details

Motivation: Existing theoretical studies on GNN learning dynamics focus primarily on convergence rates under noise-free conditions and provide only crude connections to graph structure. This paper aims to bridge the gap by examining generalization performance with noise and establishing precise connections between graph structure and learning algorithm performance.

Method: The authors extend conventional learning theory settings to GNNs, derive excess risk profiles for SGD and Ridge regression, connect these to graph structure through spectral graph theory, and perform comparative analysis of different graph structures (regular vs. power-law). They also extend analysis to multi-layer linear GNNs.

Result: The study reveals how different graph structures impact learning algorithm performance, shows an increasing non-isotropic effect on excess risk in multi-layer GNNs, and provides new insights into the over-smoothing issue. Empirical results align with theoretical predictions.

Conclusion: The research demonstrates a coupling relationship among graph structure, GNNs, and learning algorithms, providing practical insights for GNN algorithm design and selection based on graph structural properties.

Abstract: This paper studies the interplay between learning algorithms and graph structure for graph neural networks (GNNs). Existing theoretical studies on the learning dynamics of GNNs primarily focus on the convergence rates of learning algorithms under the interpolation regime (noise-free) and offer only a crude connection between these dynamics and the actual graph structure (e.g., maximum degree). This paper aims to bridge this gap by investigating the excessive risk (generalization performance) of learning algorithms in GNNs within the generalization regime (with noise). Specifically, we extend the conventional settings from the learning theory literature to the context of GNNs and examine how graph structure influences the performance of learning algorithms such as stochastic gradient descent (SGD) and Ridge regression. Our study makes several key contributions toward understanding the interplay between graph structure and learning in GNNs. First, we derive the excess risk profiles of SGD and Ridge regression in GNNs and connect these profiles to the graph structure through spectral graph theory. With this established framework, we further explore how different graph structures (regular vs. power-law) impact the performance of these algorithms through comparative analysis. Additionally, we extend our analysis to multi-layer linear GNNs, revealing an increasing non-isotropic effect on the excess risk profile, thereby offering new insights into the over-smoothing issue in GNNs from the perspective of learning algorithms. Our empirical results align with our theoretical predictions, \emph{collectively showcasing a coupling relation among graph structure, GNNs and learning algorithms, and providing insights on GNN algorithm design and selection in practice.}

[274] A Non-Asymptotic Convergent Analysis for Scored-Based Graph Generative Model via a System of Stochastic Differential Equations

Junwei Su, Chuan Wu

Main category: cs.LG

TL;DR: First non-asymptotic convergence analysis for score-based graph generative models (SGGMs) that reveals unique convergence factors and provides practical guidance for model design.

Details

Motivation: Score-based graph generative models are effective in drug discovery and protein synthesis but lack theoretical understanding of convergence behavior, unlike standard score-based models.

Method: Theoretical convergence analysis across three graph generation paradigms: feature generation with fixed structure, structure generation with fixed features, and joint generation of both. Uses controlled empirical study with synthetic graph models for validation.

Result: Identifies unique convergence factors specific to SGGMs (e.g., topological graph properties), provides theoretical insights for hyperparameter selection, and validates findings with empirical results that align with theoretical predictions.

Conclusion: This work deepens theoretical understanding of SGGMs, demonstrates their applicability in critical domains, and offers practical guidance for designing effective models through proper hyperparameter selection and techniques like normalization.

Abstract: Score-based graph generative models (SGGMs) have proven effective in critical applications such as drug discovery and protein synthesis. However, their theoretical behavior, particularly regarding convergence, remains underexplored. Unlike common score-based generative models (SGMs), which are governed by a single stochastic differential equation (SDE), SGGMs involve a system of coupled SDEs. In SGGMs, the graph structure and node features are governed by separate but interdependent SDEs. This distinction makes existing convergence analyses from SGMs inapplicable for SGGMs. In this work, we present the first non-asymptotic convergence analysis for SGGMs, focusing on the convergence bound (the risk of generative error) across three key graph generation paradigms: (1) feature generation with a fixed graph structure, (2) graph structure generation with fixed node features, and (3) joint generation of both graph structure and node features. Our analysis reveals several unique factors specific to SGGMs (e.g., the topological properties of the graph structure) which affect the convergence bound. Additionally, we offer theoretical insights into the selection of hyperparameters (e.g., sampling steps and diffusion length) and advocate for techniques like normalization to improve convergence. To validate our theoretical findings, we conduct a controlled empirical study using synthetic graph models, and the results align with our theoretical predictions. This work deepens the theoretical understanding of SGGMs, demonstrates their applicability in critical domains, and provides practical guidance for designing effective models.

[275] Online Incident Response Planning under Model Misspecification through Bayesian Learning and Belief Quantization

Kim Hammar, Tao Li

Main category: cs.LG

TL;DR: MOBAL is an online Bayesian learning method for cyber incident response that adapts to model misspecification by iteratively refining conjectures through Bayesian updates and using quantized Markov models for efficient response planning.

Details

Motivation: Traditional incident response frameworks require detailed system models, limiting their practical utility when attack information is incomplete or inaccurate. This paper addresses the need for decision-support that works under model misspecification.

Method: MOBAL uses iterative Bayesian learning to refine conjectures about the attack model as new information becomes available. It quantizes the conjectured model into a finite Markov model to enable efficient response planning through dynamic programming.

Result: The method proves Bayesian learning is asymptotically consistent and establishes bounds on misspecification and quantization errors. Experiments on CAGE-2 benchmark show MOBAL outperforms state-of-the-art methods in adaptability and robustness to model misspecification.

Conclusion: MOBAL provides an effective online approach for cyber incident response that handles model uncertainty through Bayesian learning and enables efficient decision-making even with incomplete information, demonstrating superior performance compared to existing methods.

Abstract: Effective responses to cyberattacks require fast decisions, even when information about the attack is incomplete or inaccurate. However, most decision-support frameworks for incident response rely on a detailed system model that describes the incident, which restricts their practical utility. In this paper, we address this limitation and present an online method for incident response planning under model misspecification, which we call MOBAL: Misspecified Online Bayesian Learning. MOBAL iteratively refines a conjecture about the model through Bayesian learning as new information becomes available, which facilitates model adaptation as the incident unfolds. To determine effective responses online, we quantize the conjectured model into a finite Markov model, which enables efficient response planning through dynamic programming. We prove that Bayesian learning is asymptotically consistent with respect to the information feedback. Additionally, we establish bounds on misspecification and quantization errors. Experiments on the CAGE-2 benchmark show that MOBAL outperforms the state of the art in terms of adaptability and robustness to model misspecification.

[276] SBGD: Improving Graph Diffusion Generative Model via Stochastic Block Diffusion

Junwei Su, Shan Wu

Main category: cs.LG

TL;DR: SBGD model addresses scalability and size generalization issues in graph diffusion models by using block graph representations with structural priors, achieving 6x memory reduction while maintaining performance.

Details

Motivation: Graph diffusion generative models face scalability challenges with large graphs due to high memory requirements and poor generalization to different graph sizes, limiting their real-world applicability.

Method: Proposes stochastic block graph diffusion (SBGD) model that refines graph representations into block graph space with structural priors based on real-world patterns, reducing memory complexity.

Result: SBGD achieves up to 6x memory improvements while maintaining comparable or superior graph generation performance, and demonstrates better generalization to unseen graph sizes.

Conclusion: SBGD provides a scalable and effective solution for graph generation while exemplifying modularization principles in generative modeling, offering new avenues for complex task decomposition.

Abstract: Graph diffusion generative models (GDGMs) have emerged as powerful tools for generating high-quality graphs. However, their broader adoption faces challenges in \emph{scalability and size generalization}. GDGMs struggle to scale to large graphs due to their high memory requirements, as they typically operate in the full graph space, requiring the entire graph to be stored in memory during training and inference. This constraint limits their feasibility for large-scale real-world graphs. GDGMs also exhibit poor size generalization, with limited ability to generate graphs of sizes different from those in the training data, restricting their adaptability across diverse applications. To address these challenges, we propose the stochastic block graph diffusion (SBGD) model, which refines graph representations into a block graph space. This space incorporates structural priors based on real-world graph patterns, significantly reducing memory complexity and enabling scalability to large graphs. The block representation also improves size generalization by capturing fundamental graph structures. Empirical results show that SBGD achieves significant memory improvements (up to 6$\times$) while maintaining comparable or even superior graph generation performance relative to state-of-the-art methods. Furthermore, experiments demonstrate that SBGD better generalizes to unseen graph sizes. The significance of SBGD extends beyond being a scalable and effective GDGM; it also exemplifies the principle of modularization in generative modeling, offering a new avenue for exploring generative models by decomposing complex tasks into more manageable components.

[277] Disentanglement in T-space for Faster and Distributed Training of Diffusion Models with Fewer Latent-states

Samarth Gupta, Raghudeep Gadde, Rui Chen, Aleix M. Martinez

Main category: cs.LG

TL;DR: The paper challenges the assumption that diffusion models require many time steps, showing that careful noise schedule selection enables training with as few as 32 steps (matching 1000-step performance) and even pushing to single-step models that can be combined for high-quality generation with 4-6x faster convergence.

Details

Motivation: To challenge the fundamental assumption that diffusion models require a large number of latent states/time steps for effective training and Gaussian reverse processes.

Method: Careful selection of noise schedules to enable training with minimal latent states (as low as 32 steps), and developing completely disentangled models using single latent-state models that can be combined for generation.

Result: Achieved performance matching models with 1000 steps using only 32 steps, and successful generation with single-step models combined together, providing 4-6x faster convergence across various metrics on two datasets.

Conclusion: Diffusion models do not inherently require many time steps; with proper noise scheduling, significantly fewer steps (even single steps) can achieve comparable performance with substantially faster convergence.

Abstract: We challenge a fundamental assumption of diffusion models, namely, that a large number of latent-states or time-steps is required for training so that the reverse generative process is close to a Gaussian. We first show that with careful selection of a noise schedule, diffusion models trained over a small number of latent states (i.e. $T \sim 32$) match the performance of models trained over a much large number of latent states ($T \sim 1,000$). Second, we push this limit (on the minimum number of latent states required) to a single latent-state, which we refer to as complete disentanglement in T-space. We show that high quality samples can be easily generated by the disentangled model obtained by combining several independently trained single latent-state models. We provide extensive experiments to show that the proposed disentangled model provides 4-6$\times$ faster convergence measured across a variety of metrics on two different datasets.

[278] Personalized Counterfactual Framework: Generating Potential Outcomes from Wearable Data

Ajan Subramanian, Amir M. Rahmani

Main category: cs.LG

TL;DR: A framework for learning personalized counterfactual models from wearable sensor data to explore what-if scenarios for individual health outcomes.

Details

Motivation: Wearable sensor data provides opportunities for personalized health monitoring, but deriving actionable insights from complex longitudinal data streams is challenging.

Method: Augments individual datasets with similar patients’ data via multi-modal similarity analysis, uses temporal PC algorithm to discover predictive relationships, trains Gradient Boosting Machines to quantify individual-specific effects, and implements a counterfactual engine for projecting physiological trajectories under hypothetical interventions.

Result: Achieved reasonable predictive accuracy (mean heart rate MAE 4.71 bpm) and high counterfactual plausibility (median 0.9643), showing significant inter-individual variability in response to lifestyle changes.

Conclusion: Provides a tool to explore personalized health dynamics and generate hypotheses on individual responses to lifestyle changes, demonstrating potential for personalized health insights.

Abstract: Wearable sensor data offer opportunities for personalized health monitoring, yet deriving actionable insights from their complex, longitudinal data streams is challenging. This paper introduces a framework to learn personalized counterfactual models from multivariate wearable data. This enables exploring what-if scenarios to understand potential individual-specific outcomes of lifestyle choices. Our approach first augments individual datasets with data from similar patients via multi-modal similarity analysis. We then use a temporal PC (Peter-Clark) algorithm adaptation to discover predictive relationships, modeling how variables at time t-1 influence physiological changes at time t. Gradient Boosting Machines are trained on these discovered relationships to quantify individual-specific effects. These models drive a counterfactual engine projecting physiological trajectories under hypothetical interventions (e.g., activity or sleep changes). We evaluate the framework via one-step-ahead predictive validation and by assessing the plausibility and impact of interventions. Evaluation showed reasonable predictive accuracy (e.g., mean heart rate MAE 4.71 bpm) and high counterfactual plausibility (median 0.9643). Crucially, these interventions highlighted significant inter-individual variability in response to hypothetical lifestyle changes, showing the framework’s potential for personalized insights. This work provides a tool to explore personalized health dynamics and generate hypotheses on individual responses to lifestyle changes.

[279] Fast Symbolic Regression Benchmarking

Viktor Martinek

Main category: cs.LG

TL;DR: Improved symbolic regression benchmarking with curated expression lists and early termination, increasing rediscovery rates and reducing computational costs.

Details

Motivation: Existing symbolic regression benchmarks overemphasize recovering specific expression forms, rely solely on computer algebra systems for assessment, and continue searching after discovery, leading to inefficiencies.

Method: Introduces curated lists of acceptable expressions and a callback mechanism for early termination, applied to the SRSD benchmark problems to evaluate SymbolicRegression.jl and TiSR packages.

Result: SymbolicRegression.jl’s rediscovery rate increased from 26.7% to 44.7% with 41.2% less computational expense. TiSR achieved 69.4% rediscovery rate with 63% time savings.

Conclusion: The new benchmarking approach significantly improves both rediscovery performance and computational efficiency for symbolic regression algorithms.

Abstract: Symbolic regression (SR) uncovers mathematical models from data. Several benchmarks have been proposed to compare the performance of SR algorithms. However, existing ground-truth rediscovery benchmarks overemphasize the recovery of “the one” expression form or rely solely on computer algebra systems (such as SymPy) to assess success. Furthermore, existing benchmarks continue the expression search even after its discovery. We improve upon these issues by introducing curated lists of acceptable expressions, and a callback mechanism for early termination. As a starting point, we use the symbolic regression for scientific discovery (SRSD) benchmark problems proposed by Yoshitomo et al., and benchmark the two SR packages SymbolicRegression.jl and TiSR. The new benchmarking method increases the rediscovery rate of SymbolicRegression.jl from 26.7%, as reported by Yoshitomo et at., to 44.7%. Performing the benchmark takes 41.2% less computational expense. TiSR’s rediscovery rate is 69.4%, while performing the benchmark saves 63% time.

[280] Exact Shapley Attributions in Quadratic-time for FANOVA Gaussian Processes

Majid Mohammadi, Krikamol Muandet, Ilaria Tiddi, Annette Ten Teije, Siu Lun Chau

Main category: cs.LG

TL;DR: Exact Shapley value computation for FANOVA Gaussian processes in quadratic time, enabling efficient local and global explanations with uncertainty quantification.

Details

Motivation: Shapley values provide principled feature attribution but scale exponentially with features, especially challenging for probabilistic models like GPs where outputs are random variables requiring higher-order moment modeling.

Method: Leverage FANOVA GP’s closed-form Möbius representation and introduce recursive algorithms inspired by Newton’s identities to compute exact Shapley values in quadratic time for both local (stochastic cooperative game) and global (variance-based value function) explanations.

Result: Achieves exact Shapley attribution computation in quadratic time instead of exponential, capturing both expected contributions and uncertainty for local explanations, and quantifying feature contributions to model sensitivity for global explanations.

Conclusion: Enhances explainable AI by providing scalable, axiomatically sound, and uncertainty-aware explanations for structured probabilistic models, making Shapley-based feature attribution practical for complex GP models.

Abstract: Shapley values are widely recognized as a principled method for attributing importance to input features in machine learning. However, the exact computation of Shapley values scales exponentially with the number of features, severely limiting the practical application of this powerful approach. The challenge is further compounded when the predictive model is probabilistic - as in Gaussian processes (GPs) - where the outputs are random variables rather than point estimates, necessitating additional computational effort in modeling higher-order moments. In this work, we demonstrate that for an important class of GPs known as FANOVA GP, which explicitly models all main effects and interactions, exact Shapley attributions for both local and global explanations can be computed in quadratic time. For local, instance-wise explanations, we define a stochastic cooperative game over function components and compute the exact stochastic Shapley value in quadratic time only, capturing both the expected contribution and uncertainty. For global explanations, we introduce a deterministic, variance-based value function and compute exact Shapley values that quantify each feature’s contribution to the model’s overall sensitivity. Our methods leverage a closed-form (stochastic) M"{o}bius representation of the FANOVA decomposition and introduce recursive algorithms, inspired by Newton’s identities, to efficiently compute the mean and variance of Shapley values. Our work enhances the utility of explainable AI, as demonstrated by empirical studies, by providing more scalable, axiomatically sound, and uncertainty-aware explanations for predictions generated by structured probabilistic models.

[281] On the notion of missingness for path attribution explainability methods in medical settings: Guiding the selection of medically meaningful baselines

Alexander Geiger, Lars Wagner, Daniel Rueckert, Dirk Wilhelm, Alissa Jell

Main category: cs.LG

TL;DR: Proposes counterfactual-guided baseline selection for medical AI explainability, replacing standard baselines with clinically normal counterfactuals to improve attribution faithfulness.

Details

Motivation: Standard baseline choices like all-zero inputs are semantically meaningless in medical contexts where missingness itself can be informative, and existing methods lack principled dynamic baseline selection.

Method: Uses a Variational Autoencoder to generate counterfactual baselines representing clinically normal but input-close alternatives, though the approach is generative-model-agnostic.

Result: Evaluation on three medical datasets shows counterfactual baselines yield more faithful and medically relevant attributions compared to standard baseline choices.

Conclusion: Counterfactual-guided baseline selection provides a more accurate representation of meaningful feature absence in medical data, improving explainability for clinical trust and transparency.

Abstract: The explainability of deep learning models remains a significant challenge, particularly in the medical domain where interpretable outputs are critical for clinical trust and transparency. Path attribution methods such as Integrated Gradients rely on a baseline input representing the absence of relevant features (“missingness”). Commonly used baselines, such as all-zero inputs, are often semantically meaningless, especially in medical contexts where missingness can itself be informative. While alternative baseline choices have been explored, existing methods lack a principled approach to dynamically select baselines tailored to each input. In this work, we examine the notion of missingness in the medical setting, analyze its implications for baseline selection, and introduce a counterfactual-guided approach to address the limitations of conventional baselines. We argue that a clinically normal but input-close counterfactual represents a more accurate representation of a meaningful absence of features in medical data. To implement this, we use a Variational Autoencoder to generate counterfactual baselines, though our concept is generative-model-agnostic and can be applied with any suitable counterfactual method. We evaluate the approach on three distinct medical data sets and empirically demonstrate that counterfactual baselines yield more faithful and medically relevant attributions compared to standard baseline choices.

[282] Semantic Energy: Detecting LLM Hallucination Beyond Entropy

Huan Ma, Jiadong Pan, Jing Liu, Yan Chen, Joey Tianyi Zhou, Guangyu Wang, Qinghua Hu, Hua Wu, Changqing Zhang, Haifeng Wang

Main category: cs.LG

TL;DR: Semantic Energy is a new uncertainty estimation framework that uses logits from LLM’s penultimate layer with Boltzmann-inspired energy distribution to better detect hallucinations compared to semantic entropy.

Details

Motivation: LLMs are prone to hallucinations that produce fluent but incorrect responses, leading to erroneous decisions. Existing uncertainty estimation methods like semantic entropy rely on post-softmax probabilities and fail to capture the model's inherent uncertainty effectively.

Method: Proposes Semantic Energy framework that operates directly on logits of penultimate layer, combining semantic clustering with Boltzmann-inspired energy distribution to better capture uncertainty.

Result: Experiments across multiple benchmarks show Semantic Energy significantly improves hallucination detection and uncertainty estimation compared to semantic entropy.

Conclusion: Semantic Energy provides more reliable uncertainty signals for downstream applications like hallucination detection by better leveraging LLMs’ inherent confidence through logit-based energy distribution.

Abstract: Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations, which produce fluent yet incorrect responses and lead to erroneous decision-making. Uncertainty estimation is a feasible approach to detect such hallucinations. For example, semantic entropy estimates uncertainty by considering the semantic diversity across multiple sampled responses, thus identifying hallucinations. However, semantic entropy relies on post-softmax probabilities and fails to capture the model’s inherent uncertainty, causing it to be ineffective in certain scenarios. To address this issue, we introduce Semantic Energy, a novel uncertainty estimation framework that leverages the inherent confidence of LLMs by operating directly on logits of penultimate layer. By combining semantic clustering with a Boltzmann-inspired energy distribution, our method better captures uncertainty in cases where semantic entropy fails. Experiments across multiple benchmarks show that Semantic Energy significantly improves hallucination detection and uncertainty estimation, offering more reliable signals for downstream applications such as hallucination detection.

[283] Beyond ReLU: Chebyshev-DQN for Enhanced Deep Q-Networks

Saman Yazdannik, Morteza Tayefi, Shamim Sanisales

Main category: cs.LG

TL;DR: Chebyshev-DQN (Ch-DQN) integrates Chebyshev polynomials into DQN framework, achieving 39% better performance than standard DQN on CartPole-v1 with optimal polynomial degree (N=4), though higher degrees (N=8) harm learning.

Details

Motivation: Standard neural networks struggle to approximate complex value functions in reinforcement learning. Chebyshev polynomials offer superior function approximation properties that could improve DQN performance.

Method: Proposed Ch-DQN architecture that integrates Chebyshev polynomial basis into DQN framework. Evaluated on CartPole-v1 benchmark against standard DQN with comparable parameter count, testing different polynomial degrees (N=4 and N=8).

Result: Ch-DQN with moderate polynomial degree (N=4) achieved approximately 39% better asymptotic performance than standard DQN baseline. However, higher degree (N=8) was detrimental to learning performance.

Conclusion: Orthogonal polynomial bases like Chebyshev polynomials show promise for deep reinforcement learning, but polynomial degree is a critical hyperparameter requiring careful tuning to balance model complexity and learning effectiveness.

Abstract: The performance of Deep Q-Networks (DQN) is critically dependent on the ability of its underlying neural network to accurately approximate the action-value function. Standard function approximators, such as multi-layer perceptrons, may struggle to efficiently represent the complex value landscapes inherent in many reinforcement learning problems. This paper introduces a novel architecture, the Chebyshev-DQN (Ch-DQN), which integrates a Chebyshev polynomial basis into the DQN framework to create a more effective feature representation. By leveraging the powerful function approximation properties of Chebyshev polynomials, we hypothesize that the Ch-DQN can learn more efficiently and achieve higher performance. We evaluate our proposed model on the CartPole-v1 benchmark and compare it against a standard DQN with a comparable number of parameters. Our results demonstrate that the Ch-DQN with a moderate polynomial degree (N=4) achieves significantly better asymptotic performance, outperforming the baseline by approximately 39%. However, we also find that the choice of polynomial degree is a critical hyperparameter, as a high degree (N=8) can be detrimental to learning. This work validates the potential of using orthogonal polynomial bases in deep reinforcement learning while also highlighting the trade-offs involved in model complexity.

[284] Artificial Intelligence-Based Multiscale Temporal Modeling for Anomaly Detection in Cloud Services

Lian Lian, Yilin Li, Song Han, Renzi Meng, Sibo Wang, Ming Wang

Main category: cs.LG

TL;DR: Transformer-based anomaly detection method with multiscale feature perception for cloud services, achieving superior performance in precision, recall, AUC, and F1-score compared to baseline models.

Details

Motivation: Address limitations in temporal modeling and scale-aware feature representation for anomaly detection in cloud service environments, where traditional methods struggle with complex monitoring data patterns.

Method: Uses improved Transformer module with self-attention for temporal modeling, multiscale feature construction path with downsampling and parallel encoding, and attention-weighted fusion module to dynamically adjust scale contributions. Processes standardized multidimensional time series (CPU, memory, task scheduling) with positional encoding.

Result: Outperforms mainstream baseline models in key metrics (precision, recall, AUC, F1-score) and maintains strong stability and detection performance under various perturbation conditions including different optimizers, learning rates, anomaly ratios, and noise levels.

Conclusion: The proposed method demonstrates superior capability in complex cloud environments, providing robust anomaly detection with enhanced temporal modeling and multiscale feature representation.

Abstract: This study proposes an anomaly detection method based on the Transformer architecture with integrated multiscale feature perception, aiming to address the limitations of temporal modeling and scale-aware feature representation in cloud service environments. The method first employs an improved Transformer module to perform temporal modeling on high-dimensional monitoring data, using a self-attention mechanism to capture long-range dependencies and contextual semantics. Then, a multiscale feature construction path is introduced to extract temporal features at different granularities through downsampling and parallel encoding. An attention-weighted fusion module is designed to dynamically adjust the contribution of each scale to the final decision, enhancing the model’s robustness in anomaly pattern modeling. In the input modeling stage, standardized multidimensional time series are constructed, covering core signals such as CPU utilization, memory usage, and task scheduling states, while positional encoding is used to strengthen the model’s temporal awareness. A systematic experimental setup is designed to evaluate performance, including comparative experiments and hyperparameter sensitivity analysis, focusing on the impact of optimizers, learning rates, anomaly ratios, and noise levels. Experimental results show that the proposed method outperforms mainstream baseline models in key metrics, including precision, recall, AUC, and F1-score, and maintains strong stability and detection performance under various perturbation conditions, demonstrating its superior capability in complex cloud environments.

[285] Great GATsBi: Hybrid, Multimodal, Trajectory Forecasting for Bicycles using Anticipation Mechanism

Kevin Riehl, Shaimaa K. El-Baklish, Anastasios Kouvelas, Michail A. Makridis

Main category: cs.LG

TL;DR: Great GATsBi is a hybrid multimodal trajectory prediction framework for bicycles that combines physics-based and social-based modeling using graph attention networks to account for bicycles’ dual nature, outperforming state-of-the-art methods.

Details

Motivation: Bicycles have received little attention in trajectory prediction research despite accounting for most traffic accident fatalities, while previous work focused mainly on pedestrians and motorized vehicles.

Method: A hybrid framework incorporating physics-based modeling (inspired by motorized vehicles) and social-based modeling (inspired by pedestrian movements) using graph attention networks with decayed historical and anticipated future trajectory data.

Result: The ensemble of physics models (good for short-term predictions) and social models (good for long-term predictions) exceeds state-of-the-art performance, validated through controlled mass-cycling experiments.

Conclusion: The proposed framework successfully addresses bicycle trajectory prediction by accounting for their dual movement nature, demonstrating superior performance and practical applicability for road safety applications.

Abstract: Accurate prediction of road user movement is increasingly required by many applications ranging from advanced driver assistance systems to autonomous driving, and especially crucial for road safety. Even though most traffic accident fatalities account to bicycles, they have received little attention, as previous work focused mainly on pedestrians and motorized vehicles. In this work, we present the Great GATsBi, a domain-knowledge-based, hybrid, multimodal trajectory prediction framework for bicycles. The model incorporates both physics-based modeling (inspired by motorized vehicles) and social-based modeling (inspired by pedestrian movements) to explicitly account for the dual nature of bicycle movement. The social interactions are modeled with a graph attention network, and include decayed historical, but also anticipated, future trajectory data of a bicycles neighborhood, following recent insights from psychological and social studies. The results indicate that the proposed ensemble of physics models – performing well in the short-term predictions – and social models – performing well in the long-term predictions – exceeds state-of-the-art performance. We also conducted a controlled mass-cycling experiment to demonstrate the framework’s performance when forecasting bicycle trajectories and modeling social interactions with road users.

[286] Adaptively Robust LLM Inference Optimization under Prediction Uncertainty

Zixi Chen, Yinyu Ye, Zijie Zhou

Main category: cs.LG

TL;DR: Optimizing LLM inference scheduling with ML-based output length prediction to minimize latency and prevent memory overflow, using adaptive algorithms that achieve log-scale competitive ratios.

Details

Motivation: LLM inference is an online multi-task service that consumes significant energy and faces uncertainty in output length prediction, making efficient scheduling crucial for reducing latency and power consumption while handling high request volumes.

Method: Proposed two algorithms: A_max (conservative approach using upper bound predictions) and A_min (adaptive algorithm using lower bound predictions with dynamic refinement during inference). Uses ML to predict output length intervals and proves A_min achieves log-scale competitive ratio.

Result: Numerical simulations show A_min performs nearly as well as hindsight scheduler, demonstrating efficiency and robustness. A_min’s reliance on lower bounds is advantageous since upper bounds are harder to predict accurately.

Conclusion: The adaptive algorithm A_min effectively addresses output length uncertainty in LLM inference scheduling, achieving near-optimal performance while being robust to prediction inaccuracies, making it suitable for practical deployment.

Abstract: We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that while the prompt length is known upon arrival, the output length, which critically impacts memory usage and processing time, is unknown. To address this uncertainty, we propose algorithms that leverage machine learning to predict output lengths, assuming the prediction provides an interval classification (min-max range) for each request. We first design a conservative algorithm, $\mathcal{A}{\max}$, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose $\mathcal{A}{\min}$, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that $\mathcal{A}{\min}$ achieves a log-scale competitive ratio. Through numerical simulations, we demonstrate that $\mathcal{A}{\min}$ often performs nearly as well as the hindsight scheduler, highlighting both its efficiency and robustness in practical scenarios. Moreover, $\mathcal{A}_{\min}$ relies solely on the lower bound of the prediction interval–an advantageous design choice since upper bounds on output length are typically more challenging to predict accurately.

[287] FedEve: On Bridging the Client Drift and Period Drift for Cross-device Federated Learning

Tao Shen, Zexi Li, Didi Zhu, Ziyu Zhao, Chao Wu, Fei Wu

Main category: cs.LG

TL;DR: The paper identifies period drift as a new challenge in cross-device federated learning caused by partial client participation, proposes FedEve framework to mitigate both period drift and client drift through compensation, and demonstrates improved performance on non-iid data.

Details

Motivation: Data heterogeneity in federated learning causes performance degradation. While client drift from multiple local updates is known, period drift from partial client participation in cross-device FL has not been well studied and can be more harmful as it shifts optimization objectives each round.

Method: Proposes a predict-observe framework and instantiated method called FedEve that allows period drift and client drift to compensate each other, reducing the variance of model updates through theoretical optimization.

Result: Extensive experiments show that FedEve outperforms alternative methods on non-iid data in cross-device federated learning settings, demonstrating effectiveness in mitigating both types of drift.

Conclusion: Period drift is a significant challenge in cross-device FL that interacts with client drift, and the proposed FedEve framework successfully mitigates these issues through drift compensation, leading to improved convergence and performance on heterogeneous data.

Abstract: Federated learning (FL) is a machine learning paradigm that allows multiple clients to collaboratively train a shared model without exposing their private data. Data heterogeneity is a fundamental challenge in FL, which can result in poor convergence and performance degradation. Client drift has been recognized as one of the factors contributing to this issue resulting from the multiple local updates in FedAvg. However, in cross-device FL, a different form of drift arises due to the partial client participation, but it has not been studied well. This drift, we referred as period drift, occurs as participating clients at each communication round may exhibit distinct data distribution that deviates from that of all clients. It could be more harmful than client drift since the optimization objective shifts with every round. In this paper, we investigate the interaction between period drift and client drift, finding that period drift can have a particularly detrimental effect on cross-device FL as the degree of data heterogeneity increases. To tackle these issues, we propose a predict-observe framework and present an instantiated method, FedEve, where these two types of drift can compensate each other to mitigate their overall impact. We provide theoretical evidence that our approach can reduce the variance of model updates. Extensive experiments demonstrate that our method outperforms alternatives on non-iid data in cross-device settings.

[288] Cooperative SGD with Dynamic Mixing Matrices

Soumya Sarkar, Shweta Jain

Main category: cs.LG

TL;DR: A unified framework for distributed SGD algorithms with dynamic topologies and non-uniform aggregation strategies that provides improved convergence guarantees.

Details

Motivation: Existing distributed SGD approaches assume fixed topologies and uniform node contributions, which are suboptimal. Dynamic topologies with non-uniform aggregation can significantly improve performance.

Method: Develops a unified framework covering several Local-Update SGD-based distributed algorithms with dynamic topologies and client selection strategies.

Result: The framework provides improved or matching theoretical convergence guarantees compared to existing work in distributed SGD settings.

Conclusion: Dynamic topology approaches with non-uniform aggregation strategies outperform traditional fixed-topology distributed SGD methods and offer better theoretical convergence properties.

Abstract: One of the most common methods to train machine learning algorithms today is the stochastic gradient descent (SGD). In a distributed setting, SGD-based algorithms have been shown to converge theoretically under specific circumstances. A substantial number of works in the distributed SGD setting assume a fixed topology for the edge devices. These papers also assume that the contribution of nodes to the global model is uniform. However, experiments have shown that such assumptions are suboptimal and a non uniform aggregation strategy coupled with a dynamically shifting topology and client selection can significantly improve the performance of such models. This paper details a unified framework that covers several Local-Update SGD-based distributed algorithms with dynamic topologies and provides improved or matching theoretical guarantees on convergence compared to existing work.

[289] A Comprehensive Evaluation of the Sensitivity of Density-Ratio Estimation Based Fairness Measurement in Regression

Abdalwahab Almajed, Maryam Tabar, Peyman Najafirad

Main category: cs.LG

TL;DR: Study examines how different density-ratio estimation methods affect fairness measurement in ML regression, finding significant inconsistencies that question reliability of current approaches.

Details

Motivation: Prior research formulated fairness measurement in regression as density-ratio estimation problem but didn't study sensitivity to choice of estimation algorithm, creating a research gap.

Method: Developed multiple fairness measurement methods with different density-ratio estimation cores and experimentally compared their effects on fairness outcomes.

Result: Choice of density-ratio estimation core significantly affects fairness measurement results and can generate inconsistent conclusions about relative fairness of algorithms.

Conclusion: Major issues exist with density-ratio estimation based fairness measurement in regression, indicating need for further research to enhance reliability of these methods.

Abstract: The prevalence of algorithmic bias in Machine Learning (ML)-driven approaches has inspired growing research on measuring and mitigating bias in the ML domain. Accordingly, prior research studied how to measure fairness in regression which is a complex problem. In particular, recent research proposed to formulate it as a density-ratio estimation problem and relied on a Logistic Regression-driven probabilistic classifier-based approach to solve it. However, there are several other methods to estimate a density ratio, and to the best of our knowledge, prior work did not study the sensitivity of such fairness measurement methods to the choice of underlying density ratio estimation algorithm. To fill this gap, this paper develops a set of fairness measurement methods with various density-ratio estimation cores and thoroughly investigates how different cores would affect the achieved level of fairness. Our experimental results show that the choice of density-ratio estimation core could significantly affect the outcome of fairness measurement method, and even, generate inconsistent results with respect to the relative fairness of various algorithms. These observations suggest major issues with density-ratio estimation based fairness measurement in regression and a need for further research to enhance their reliability.

[290] DualNILM: Energy Injection Identification Enabled Disaggregation with Deep Multi-Task Learning

Xudong Wang, Guoming Tang, Junyu Xue, Srinivasan Keshav, Tongxin Li, Chris Ding

Main category: cs.LG

TL;DR: DualNILM is a Transformer-based multi-task learning framework that simultaneously performs appliance state recognition and energy injection identification to address challenges from behind-the-meter energy sources in NILM systems.

Details

Motivation: Conventional NILM methods suffer performance degradation due to behind-the-meter energy sources (solar panels, battery storage) that obscure appliance power signatures by injecting energy into the system.

Method: Deep multi-task learning framework combining sequence-to-point and sequence-to-sequence strategies within a Transformer architecture to capture multi-scale temporal dependencies in aggregate power consumption patterns.

Result: Extensive experiments on self-collected and synthesized datasets show DualNILM maintains excellent performance for both appliance state recognition and energy injection identification, significantly outperforming conventional methods.

Conclusion: DualNILM effectively addresses the challenge of behind-the-meter energy sources in NILM by simultaneously learning both appliance states and energy injection patterns through a multi-task Transformer architecture.

Abstract: Non-Intrusive Load Monitoring (NILM) offers a cost-effective method to obtain fine-grained appliance-level energy consumption in smart homes and building applications. However, the increasing adoption of behind-the-meter energy sources, such as solar panels and battery storage, poses new challenges for conventional NILM methods that rely solely on at-the-meter data. The injected energy from the behind-the-meter sources can obscure the power signatures of individual appliances, leading to a significant decline in NILM performance. To address this challenge, we present DualNILM, a deep multi-task learning framework designed for the dual tasks of appliance state recognition and injected energy identification in NILM. By integrating sequence-to-point and sequence-to-sequence strategies within a Transformer-based architecture, DualNILM can effectively capture multi-scale temporal dependencies in the aggregate power consumption patterns, allowing for accurate appliance state recognition and energy injection identification. We conduct validation of DualNILM using both self-collected and synthesized open NILM datasets that include both appliance-level energy consumption and energy injection. Extensive experimental results demonstrate that DualNILM maintains an excellent performance for the dual tasks in NILM, much outperforming conventional methods.

[291] Measuring IIA Violations in Similarity Choices with Bayesian Models

Hugo Sales Corrêa, Suryanarayana Sankagiri, Daniel Ratton Figueiredo, Matthias Grossglauser

Main category: cs.LG

TL;DR: The paper develops statistical tests for Independence of Irrelevant Alternatives (IIA) violations in similarity choice data, finding significant violations across datasets and homogeneous population effects.

Details

Motivation: Similarity choice data (e.g., information retrieval, embedding learning) assumes IIA, but violations have been underexplored due to target-dependent complexity. The paper aims to test IIA violations in this context.

Method: Proposes two statistical tests: classical goodness-of-fit and Bayesian Posterior Predictive Checks (PPC) to quantify IIA violation degree. Uses curated datasets with designed and random choice sets, and develops PPC test for population homogeneity.

Result: Significant IIA violations found in both datasets with comparable degree. Population homogeneity confirmed, suggesting context effects (choice set interactions) drive violations rather than individual differences.

Conclusion: Results demonstrate need for new similarity choice models that account for context effects, as IIA violations are systematic and driven by choice set interactions rather than population heterogeneity.

Abstract: Similarity choice data occur when humans make choices among alternatives based on their similarity to a target, e.g., in the context of information retrieval and in embedding learning settings. Classical metric-based models of similarity choice assume independence of irrelevant alternatives (IIA), a property that allows for a simpler formulation. While IIA violations have been detected in many discrete choice settings, the similarity choice setting has received scant attention. This is because the target-dependent nature of the choice complicates IIA testing. We propose two statistical methods to test for IIA: a classical goodness-of-fit test and a Bayesian counterpart based on the framework of Posterior Predictive Checks (PPC). This Bayesian approach, our main technical contribution, quantifies the degree of IIA violation beyond its mere significance. We curate two datasets: one with choice sets designed to elicit IIA violations, and another with randomly generated choice sets from the same item universe. Our tests confirmed significant IIA violations on both datasets, and notably, we find a comparable degree of violation between them. Further, we devise a new PPC test for population homogeneity. Results show that the population is indeed homogenous, suggesting that the IIA violations are driven by context effects – specifically, interactions within the choice sets. These results highlight the need for new similarity choice models that account for such context effects.

[292] ELATE: Evolutionary Language model for Automated Time-series Engineering

Andrew Murray, Danial Dervovic, Michael Cashmore

Main category: cs.LG

TL;DR: ELATE uses evolutionary framework with language model to automate time-series feature engineering, improving forecasting accuracy by 8.4%

Details

Motivation: Manual feature engineering is time-intensive, and existing automation methods are computationally costly and lack domain insights

Method: Evolutionary framework with language model that uses time-series statistical measures and feature importance metrics to guide and prune features, while proposing contextually relevant transformations

Result: Improves forecasting accuracy by an average of 8.4% across various domains

Conclusion: ELATE effectively automates time-series feature engineering with significant performance improvements

Abstract: Time-series prediction involves forecasting future values using machine learning models. Feature engineering, whereby existing features are transformed to make new ones, is critical for enhancing model performance, but is often manual and time-intensive. Existing automation attempts rely on exhaustive enumeration, which can be computationally costly and lacks domain-specific insights. We introduce ELATE (Evolutionary Language model for Automated Time-series Engineering), which leverages a language model within an evolutionary framework to automate feature engineering for time-series data. ELATE employs time-series statistical measures and feature importance metrics to guide and prune features, while the language model proposes new, contextually relevant feature transformations. Our experiments demonstrate that ELATE improves forecasting accuracy by an average of 8.4% across various domains.

[293] A Fuzzy-Enhanced Explainable AI Framework for Flight Continuous Descent Operations Classification

Amin Noroozi, Sandaruwan K. Sethunge, Elham Norouzi, Phat T. Phan, Kavinda U. Waduge, Md. Arafatur Rahman

Main category: cs.LG

TL;DR: Proposes FEXAI framework combining fuzzy logic with ML and SHAP analysis to identify key factors affecting Continuous Descent Operations performance and provide explainable rules for aviation decision support.

Details

Motivation: Limited research on factors influencing CDO performance despite its operational and environmental benefits, and lack of transparency in existing trajectory optimization methods which is critical for aviation safety and stakeholder trust.

Method: Collected comprehensive dataset from 1,094 flights with 29 features (11 operational, 18 weather-related) using ADS-B data. Applied ML models and SHAP analysis to classify CDO adherence and rank feature importance, then built fuzzy rule-based classifier using top 3 features.

Result: All models achieved >90% classification accuracy. Identified average descent rate within arrival route, number of descent segments, and average change in directional heading during descent as strongest predictors of CDO performance. FEXAI provided human-readable interpretable rules.

Conclusion: FEXAI presents a novel pathway for operational decision support that could be integrated into aviation tools for real-time advisories to maintain CDO adherence under varying conditions, combining high accuracy with explainability.

Abstract: Continuous Descent Operations (CDO) involve smooth, idle-thrust descents that avoid level-offs, reducing fuel burn, emissions, and noise while improving efficiency and passenger comfort. Despite its operational and environmental benefits, limited research has systematically examined the factors influencing CDO performance. Moreover, many existing methods in related areas, such as trajectory optimization, lack the transparency required in aviation, where explainability is critical for safety and stakeholder trust. This study addresses these gaps by proposing a Fuzzy-Enhanced Explainable AI (FEXAI) framework that integrates fuzzy logic with machine learning and SHapley Additive exPlanations (SHAP) analysis. For this purpose, a comprehensive dataset of 29 features, including 11 operational and 18 weather-related features, was collected from 1,094 flights using Automatic Dependent Surveillance-Broadcast (ADS-B) data. Machine learning models and SHAP were then applied to classify flights’ CDO adherence levels and rank features by importance. The three most influential features, as identified by SHAP scores, were then used to construct a fuzzy rule-based classifier, enabling the extraction of interpretable fuzzy rules. All models achieved classification accuracies above 90%, with FEXAI providing meaningful, human-readable rules for operational users. Results indicated that the average descent rate within the arrival route, the number of descent segments, and the average change in directional heading during descent were the strongest predictors of CDO performance. The FEXAI method proposed in this study presents a novel pathway for operational decision support and could be integrated into aviation tools to enable real-time advisories that maintain CDO adherence under varying operational conditions.

[294] Clinical semantics for lung cancer prediction

Luis H. John, Jan A. Kors, Jenna M. Reps, Peter R. Rijnbeek, Egill A. Fridgeirsson

Main category: cs.LG

TL;DR: Using hyperbolic Poincar'e embeddings of SNOMED medical hierarchy improves lung cancer prediction models by preserving semantic relationships between clinical concepts.

Details

Motivation: Existing clinical prediction models ignore semantic relationships between medical concepts. This study aims to integrate domain-specific semantic information from SNOMED taxonomy to improve prediction accuracy.

Method: Mapped SNOMED medical term hierarchy into hyperbolic space using Poincar'e embeddings via Riemannian stochastic gradient descent. Incorporated these embeddings into ResNet and Transformer deep learning architectures for lung cancer onset prediction.

Result: Poincar'e embeddings provided modest but consistent improvements in discrimination performance. ResNet models with 10-dimensional embeddings showed enhanced calibration, while Transformers maintained stable calibration across configurations.

Conclusion: Embedding clinical knowledge graphs into hyperbolic space preserves hierarchical structure and improves prediction performance, demonstrating a feasible method to combine data-driven feature extraction with clinical knowledge.

Abstract: Background: Existing clinical prediction models often represent patient data using features that ignore the semantic relationships between clinical concepts. This study integrates domain-specific semantic information by mapping the SNOMED medical term hierarchy into a low-dimensional hyperbolic space using Poincar'e embeddings, with the aim of improving lung cancer onset prediction. Methods: Using a retrospective cohort from the Optum EHR dataset, we derived a clinical knowledge graph from the SNOMED taxonomy and generated Poincar'e embeddings via Riemannian stochastic gradient descent. These embeddings were then incorporated into two deep learning architectures, a ResNet and a Transformer model. Models were evaluated for discrimination (area under the receiver operating characteristic curve) and calibration (average absolute difference between observed and predicted probabilities) performance. Results: Incorporating pre-trained Poincar'e embeddings resulted in modest and consistent improvements in discrimination performance compared to baseline models using randomly initialized Euclidean embeddings. ResNet models, particularly those using a 10-dimensional Poincar'e embedding, showed enhanced calibration, whereas Transformer models maintained stable calibration across configurations. Discussion: Embedding clinical knowledge graphs into hyperbolic space and integrating these representations into deep learning models can improve lung cancer onset prediction by preserving the hierarchical structure of clinical terminologies used for prediction. This approach demonstrates a feasible method for combining data-driven feature extraction with established clinical knowledge.

[295] Understanding Data Influence with Differential Approximation

Haoru Tan, Sitong Wu, Xiuzhe Wu, Wang Wang, Bo Zhao, Zeke Xie, Gui-Song Xia, Xiaojuan Qi

Main category: cs.LG

TL;DR: Diff-In is a novel influence estimation method that approximates sample influence by accumulating differences across training steps, using second-order approximations without requiring convex loss functions, achieving high accuracy with computational efficiency comparable to first-order methods.

Details

Motivation: Existing data analysis tools for AI model training often have accuracy limitations, particularly assuming convex loss functions which doesn't reflect real neural network behavior, making current methods challenging to implement effectively.

Method: Formulates sample-wise influence as cumulative sum of changes across successive training iterations, employs second-order approximations to compute difference terms accurately, and efficiently computes Hessian-gradient products using finite differences of first-order gradients.

Result: Theoretical analysis shows significantly lower approximation error compared to existing influence estimators. Extensive experiments demonstrate superior performance across multiple benchmark datasets in data cleaning, deletion, and coreset selection tasks. Scales to millions of data points in large-scale vision-language pre-training.

Conclusion: Diff-In provides an accurate and scalable influence estimation method that overcomes convexity assumptions of existing approaches, maintaining computational efficiency while achieving better performance across various data-centric applications.

Abstract: Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However, existing data analysis tools often lag in accuracy. For instance, many of these tools even assume that the loss function of neural networks is convex. These limitations make it challenging to implement current methods effectively. In this paper, we introduce a new formulation to approximate a sample’s influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.

[296] Improving Fairness in Graph Neural Networks via Counterfactual Debiasing

Zengyi Wo, Chang Liu, Yumeng Wang, Minglai Shao, Wenjun Wang

Main category: cs.LG

TL;DR: Fair-ICD: A counterfactual data augmentation method for mitigating bias in Graph Neural Networks while maintaining predictive accuracy.

Details

Motivation: GNNs can exhibit bias in predictions based on sensitive attributes like race and gender, which is exacerbated by graph structure and message-passing mechanisms. Existing bias mitigation methods may unintentionally eliminate non-sensitive features, compromising the balance between accuracy and fairness.

Method: Uses counterfactual data augmentation to create diverse neighborhoods before message passing, facilitating unbiased node representations. Then employs an adversarial discriminator to reduce bias in predictions from conventional GNN classifiers.

Result: Experiments on standard datasets with three GNN backbones show Fair-ICD significantly improves fairness metrics while preserving high predictive performance.

Conclusion: Fair-ICD effectively ensures fairness in GNNs under moderate conditions, providing a balanced approach to bias mitigation without sacrificing predictive accuracy.

Abstract: Graph Neural Networks (GNNs) have been successful in modeling graph-structured data. However, similar to other machine learning models, GNNs can exhibit bias in predictions based on attributes like race and gender. Moreover, bias in GNNs can be exacerbated by the graph structure and message-passing mechanisms. Recent cutting-edge methods propose mitigating bias by filtering out sensitive information from input or representations, like edge dropping or feature masking. Yet, we argue that such strategies may unintentionally eliminate non-sensitive features, leading to a compromised balance between predictive accuracy and fairness. To tackle this challenge, we present a novel approach utilizing counterfactual data augmentation for bias mitigation. This method involves creating diverse neighborhoods using counterfactuals before message passing, facilitating unbiased node representations learning from the augmented graph. Subsequently, an adversarial discriminator is employed to diminish bias in predictions by conventional GNN classifiers. Our proposed technique, Fair-ICD, ensures the fairness of GNNs under moderate conditions. Experiments on standard datasets using three GNN backbones demonstrate that Fair-ICD notably enhances fairness metrics while preserving high predictive performance.

[297] AFABench: A Generic Framework for Benchmarking Active Feature Acquisition

Valter Schütz, Han Wu, Reza Rezvan, Linus Aronsson, Morteza Haghir Chehreghani

Main category: cs.LG

TL;DR: AFABench is the first standardized benchmark framework for Active Feature Acquisition (AFA) that enables systematic evaluation of feature selection methods across diverse datasets and policies.

Details

Motivation: Existing AFA methods lack fair and systematic evaluation due to the absence of standardized benchmarks, making it difficult to compare different approaches and understand their trade-offs.

Method: Developed AFABench with synthetic and real-world datasets, implemented representative algorithms from static, greedy, and reinforcement learning categories, and created AFAContext dataset to test lookahead capabilities.

Result: The benchmark successfully evaluated various AFA strategies, highlighted key trade-offs between methods, and exposed limitations of greedy selection approaches through the novel AFAContext dataset.

Conclusion: AFABench provides a comprehensive evaluation framework that enables standardized comparison of AFA methods and offers actionable insights for future research in cost-effective feature acquisition.

Abstract: In many real-world scenarios, acquiring all features of a data instance can be expensive or impractical due to monetary cost, latency, or privacy concerns. Active Feature Acquisition (AFA) addresses this challenge by dynamically selecting a subset of informative features for each data instance, trading predictive performance against acquisition cost. While numerous methods have been proposed for AFA, ranging from greedy information-theoretic strategies to non-myopic reinforcement learning approaches, fair and systematic evaluation of these methods has been hindered by the lack of standardized benchmarks. In this paper, we introduce AFABench, the first benchmark framework for AFA. Our benchmark includes a diverse set of synthetic and real-world datasets, supports a wide range of acquisition policies, and provides a modular design that enables easy integration of new methods and tasks. We implement and evaluate representative algorithms from all major categories, including static, greedy, and reinforcement learning-based approaches. To test the lookahead capabilities of AFA policies, we introduce a novel synthetic dataset, AFAContext, designed to expose the limitations of greedy selection. Our results highlight key trade-offs between different AFA strategies and provide actionable insights for future research. The benchmark code is available at: https://github.com/Linusaronsson/AFA-Benchmark.

[298] Addressing Graph Anomaly Detection via Causal Edge Separation and Spectrum

Zengyi Wo, Wenjun Wang, Minglai Shao, Chang Liu, Yumeng Wang, Yueheng Sun

Main category: cs.LG

TL;DR: Proposes CES2-GAD, a spectral neural network using causal edge separation for anomaly detection on heterophilic graphs where anomalous nodes hide connections.

Details

Motivation: Existing GNN-based anomaly detection methods fail on heterophilic graphs where anomalous entities hide direct links while adding legitimate connections, and spectral domain solutions are limited.

Method: Separates graph into homophilic/heterophilic edges using causal interventions, applies hybrid-spectrum filters to capture signals from segmented graphs, then concatenates representations for classification.

Result: Extensive experiments with real-world datasets demonstrate the effectiveness of the proposed CES2-GAD method.

Conclusion: The spectral approach with causal edge separation effectively addresses heterophilic anomaly detection problems by capturing shifted spectral energy patterns in anomalous nodes.

Abstract: In the real world, anomalous entities often add more legitimate connections while hiding direct links with other anomalous entities, leading to heterophilic structures in anomalous networks that most GNN-based techniques fail to address. Several works have been proposed to tackle this issue in the spatial domain. However, these methods overlook the complex relationships between node structure encoding, node features, and their contextual environment and rely on principled guidance, research on solving spectral domain heterophilic problems remains limited. This study analyzes the spectral distribution of nodes with different heterophilic degrees and discovers that the heterophily of anomalous nodes causes the spectral energy to shift from low to high frequencies. To address the above challenges, we propose a spectral neural network CES2-GAD based on causal edge separation for anomaly detection on heterophilic graphs. Firstly, CES2-GAD will separate the original graph into homophilic and heterophilic edges using causal interventions. Subsequently, various hybrid-spectrum filters are used to capture signals from the segmented graphs. Finally, representations from multiple signals are concatenated and input into a classifier to predict anomalies. Extensive experiments with real-world datasets have proven the effectiveness of the method we proposed.

[299] Cross-Modality Controlled Molecule Generation with Diffusion Language Model

Yunzhe Zhang, Yifei Wang, Khanh Vinh Nguyen, Pengyu Hong

Main category: cs.LG

TL;DR: CMCM-DLM is a diffusion-based molecule generation method that enables cross-modality constraint handling without retraining, using separate structure and property control modules in a two-phase generation process.

Details

Motivation: Current SMILES-based diffusion models only support unimodal constraints and require retraining for new constraints, but real-world applications need multiple cross-modality constraints that may emerge during studies.

Method: Builds on pre-trained diffusion model with two trainable modules: Structure Control Module (SCM) for early diffusion steps to anchor molecular backbone, and Property Control Module (PCM) for later inference stages to refine chemical properties.

Result: Experimental results on multiple datasets demonstrate efficiency and adaptability, showing significant advancement in molecular generation for drug discovery.

Conclusion: CMCM-DLM successfully extends pre-trained diffusion models to support cross-modality constraints without retraining, providing a flexible framework for molecular generation with multiple constraints.

Abstract: Current SMILES-based diffusion models for molecule generation typically support only unimodal constraint. They inject conditioning signals at the start of the training process and require retraining a new model from scratch whenever the constraint changes. However, real-world applications often involve multiple constraints across different modalities, and additional constraints may emerge over the course of a study. This raises a challenge: how to extend a pre-trained diffusion model not only to support cross-modality constraints but also to incorporate new ones without retraining. To tackle this problem, we propose the Cross-Modality Controlled Molecule Generation with Diffusion Language Model (CMCM-DLM), demonstrated by two distinct cross modalities: molecular structure and chemical properties. Our approach builds upon a pre-trained diffusion model, incorporating two trainable modules, the Structure Control Module (SCM) and the Property Control Module (PCM), and operates in two distinct phases during the generation process. In Phase I, we employs the SCM to inject structural constraints during the early diffusion steps, effectively anchoring the molecular backbone. Phase II builds on this by further introducing PCM to guide the later stages of inference to refine the generated molecules, ensuring their chemical properties match the specified targets. Experimental results on multiple datasets demonstrate the efficiency and adaptability of our approach, highlighting CMCM-DLM’s significant advancement in molecular generation for drug discovery applications.

[300] CaTE Data Curation for Trustworthy AI

Mary Versa Clemens-Sewall, Christopher Cervantes, Emma Rafkin, J. Neil Otte, Tom Magelinski, Libby Lewis, Michelle Liu, Dana Udwin, Monique Kirkman-Bey

Main category: cs.LG

TL;DR: Practical guidance for building trustworthy AI systems through data curation, with step-by-step approaches, tool implementations, and synthesis of academic literature.

Details

Motivation: To provide development teams with actionable methods to enhance AI trustworthiness during the data curation phase, addressing the need for reliable and ethical AI systems.

Method: Defines key concepts (data, data curation, trustworthiness), outlines sequential steps for data curation, presents alternative approaches, and includes analysis of strengths/weaknesses, preconditions, outcomes, and open-source tool implementations.

Result: A comprehensive framework and synthesis of data curation practices that development teams can implement to build more trustworthy AI-enabled systems.

Conclusion: This report provides a coherent set of practical tools and approaches that equip data scientists and development teams with diverse methods to improve AI trustworthiness through effective data curation practices.

Abstract: This report provides practical guidance to teams designing or developing AI-enabled systems for how to promote trustworthiness during the data curation phase of development. In this report, the authors first define data, the data curation phase, and trustworthiness. We then describe a series of steps that the development team, especially data scientists, can take to build a trustworthy AI-enabled system. We enumerate the sequence of core steps and trace parallel paths where alternatives exist. The descriptions of these steps include strengths, weaknesses, preconditions, outcomes, and relevant open-source software tool implementations. In total, this report is a synthesis of data curation tools and approaches from relevant academic literature, and our goal is to equip readers with a diverse yet coherent set of practices for improving AI trustworthiness.

Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Mohsen Imani

Main category: cs.LG

TL;DR: D-GSR is a new data-driven graph refinement method that optimizes LLM-generated reasoning graphs using downstream task data through hyperdimensional computing, achieving significant performance gains in video anomaly detection tasks.

Details

Motivation: LLM-generated reasoning graphs are often misaligned with downstream visual tasks like video anomaly detection, and existing graph refinement methods are unsuitable for these novel, dataset-less graphs.

Method: Proposed Data-driven GSR (D-GSR) paradigm that directly optimizes graph structure using downstream task data, implemented through MissionHD - a hyperdimensional computing framework with an efficient encode-decode process guided by task signals.

Result: Experiments on challenging video anomaly detection and video anomaly recognition benchmarks show significant performance improvements when using the refined graphs.

Conclusion: The approach serves as an effective pre-processing step that successfully aligns LLM-generated reasoning graphs with downstream visual tasks, validating the data-driven graph refinement paradigm.

Abstract: Reasoning graphs from Large Language Models (LLMs) are often misaligned with downstream visual tasks such as video anomaly detection (VAD). Existing Graph Structure Refinement (GSR) methods are ill-suited for these novel, dataset-less graphs. We introduce Data-driven GSR (D-GSR), a new paradigm that directly optimizes graph structure using downstream task data, and propose MissionHD, a hyperdimensional computing (HDC) framework to operationalize it. MissionHD uses an efficient encode-decode process to refine the graph, guided by the downstream task signal. Experiments on challenging VAD and VAR benchmarks show significant performance improvements when using our refined graphs, validating our approach as an effective pre-processing step.

[302] PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning

Ruheng Wang, Hang Zhang, Trieu Nguyen, Shasha Feng, Hao-Wei Pang, Xiang Yu, Li Xiao, Peter Zhiping Zhang

Main category: cs.LG

TL;DR: PepThink-R1 is a novel generative framework that combines LLMs with chain-of-thought reasoning and reinforcement learning to design therapeutic peptides with improved properties and interpretability.

Details

Motivation: Current peptide design faces challenges including vast sequence space, limited experimental data, and poor interpretability of generative models, making therapeutic peptide optimization difficult.

Method: Integrates large language models with chain-of-thought supervised fine-tuning and reinforcement learning, using explicit monomer-level reasoning during sequence generation with a tailored reward function balancing chemical validity and property improvements.

Result: Generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming both general LLMs (like GPT-5) and domain-specific baselines in optimization success and interpretability.

Conclusion: First LLM-based peptide design framework combining explicit reasoning with RL-driven property control, representing a step toward reliable and transparent peptide optimization for therapeutic discovery.

Abstract: Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.

[303] HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents

Thomas Carta, Clément Romac, Loris Gaven, Pierre-Yves Oudeyer, Olivier Sigaud, Sylvain Lamprier

Main category: cs.LG

TL;DR: HERAKLES is a hierarchical reinforcement learning framework that enables AI agents to continuously compile mastered goals into low-level policies and dynamically expand subgoal spaces using LLMs for efficient goal decomposition and adaptation.

Details

Motivation: Open-ended AI agents need to efficiently learn increasingly complex and heterogeneous goals while controlling computational complexity growth. Existing hierarchical approaches rely on expert-defined subgoal spaces and pre-trained policies, which are inadequate for open-ended scenarios where goal spaces diversify across difficulty levels.

Method: A two-level hierarchical framework where mastered goals are continuously compiled into a small, fast low-level neural network policy. A Large Language Model serves as the high-level controller for goal decomposition and generalization over the dynamically evolving subgoal space.

Result: HERAKLES scales effectively with goal complexity in the Crafter environment, improves sample efficiency through skill compilation, and enables robust adaptation to novel challenges over time.

Conclusion: The framework successfully addresses the challenge of open-ended goal learning by combining hierarchical reinforcement learning with LLM-based high-level control and dynamic skill compilation, demonstrating effective scaling and adaptation capabilities.

Abstract: Open-ended AI agents need to be able to learn efficiently goals of increasing complexity, abstraction and heterogeneity over their lifetime. Beyond sampling efficiently their own goals, autotelic agents specifically need to be able to keep the growing complexity of goals under control, limiting the associated growth in sample and computational complexity. To adress this challenge, recent approaches have leveraged hierarchical reinforcement learning (HRL) and language, capitalizing on its compositional and combinatorial generalization capabilities to acquire temporally extended reusable behaviours. Existing approaches use expert defined spaces of subgoals over which they instantiate a hierarchy, and often assume pre-trained associated low-level policies. Such designs are inadequate in open-ended scenarios, where goal spaces naturally diversify across a broad spectrum of difficulties. We introduce HERAKLES, a framework that enables a two-level hierarchical autotelic agent to continuously compile mastered goals into the low-level policy, executed by a small, fast neural network, dynamically expanding the set of subgoals available to the high-level policy. We train a Large Language Model (LLM) to serve as the high-level controller, exploiting its strengths in goal decomposition and generalization to operate effectively over this evolving subgoal space. We evaluate HERAKLES in the open-ended Crafter environment and show that it scales effectively with goal complexity, improves sample efficiency through skill compilation, and enables the agent to adapt robustly to novel challenges over time.

[304] Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data

Ahmed Mujtaba, Gleb Radchenko, Radu Prodan, Marc Masana

Main category: cs.LG

TL;DR: EdgeFD is a resource-efficient federated distillation method that uses KMeans-based density ratio estimation to filter proxy data on clients, eliminating complex server-side filtering and achieving high accuracy under heterogeneous conditions.

Details

Motivation: Existing federated distillation methods require computationally expensive statistical density ratio estimators for selective knowledge sharing and introduce latency through server-side filtering, making them impractical for resource-constrained edge devices.

Method: Proposes EdgeFD with an efficient KMeans-based density ratio estimator that filters both in-distribution and out-of-distribution proxy data on clients, removing the need for server-side filtering and complex statistical estimators.

Result: EdgeFD outperforms state-of-the-art methods across diverse scenarios (strong non-IID, weak non-IID, IID) without requiring a pre-trained teacher model, achieving accuracy close to IID scenarios under challenging conditions with significantly reduced computational overhead.

Conclusion: EdgeFD enhances the scalability and real-world applicability of federated distillation by providing a robust, resource-efficient solution suitable for deployment on resource-constrained edge devices.

Abstract: Federated distillation has emerged as a promising collaborative machine learning approach, offering enhanced privacy protection and reduced communication compared to traditional federated learning by exchanging model outputs (soft logits) rather than full model parameters. However, existing methods employ complex selective knowledge-sharing strategies that require clients to identify in-distribution proxy data through computationally expensive statistical density ratio estimators. Additionally, server-side filtering of ambiguous knowledge introduces latency to the process. To address these challenges, we propose a robust, resource-efficient EdgeFD method that reduces the complexity of the client-side density ratio estimation and removes the need for server-side filtering. EdgeFD introduces an efficient KMeans-based density ratio estimator for effectively filtering both in-distribution and out-of-distribution proxy data on clients, significantly improving the quality of knowledge sharing. We evaluate EdgeFD across diverse practical scenarios, including strong non-IID, weak non-IID, and IID data distributions on clients, without requiring a pre-trained teacher model on the server for knowledge distillation. Experimental results demonstrate that EdgeFD outperforms state-of-the-art methods, consistently achieving accuracy levels close to IID scenarios even under heterogeneous and challenging conditions. The significantly reduced computational overhead of the KMeans-based estimator is suitable for deployment on resource-constrained edge devices, thereby enhancing the scalability and real-world applicability of federated distillation. The code is available online for reproducibility.

[305] Context Steering: A New Paradigm for Compression-based Embeddings by Synthesizing Relevant Information Features

Guillermo Sarasa Durán, Ana Granados Fontecha, Francisco de Borja Rodríguez Ortíz

Main category: cs.LG

TL;DR: Context steering is a novel methodology that actively guides compression-based distance features to align with specific tasks, improving clustering and classification performance.

Details

Motivation: Compression-based distances provide flexible similarity measurement but often fail to align with specific tasks as features emerge passively from data rather than being task-defined.

Method: Systematically analyzes how each object influences relational context within clustering to steer the feature-shaping process, generating custom embeddings that amplify class-distinctive information using NCD and NRC with hierarchical clustering.

Result: Experimental validation across heterogeneous datasets (text and audio) demonstrates robustness and generality, providing effective alternative to transductive methods.

Conclusion: Context steering represents a fundamental shift from passively discovering inherent data structures to actively shaping feature spaces tailored to specific objectives.

Abstract: Compression-based distances (CD) offer a flexible and domain-agnostic means of measuring similarity by identifying implicit information through redundancies between data objects. However, as similarity features are derived from the data, rather than defined as an input, it often proves difficult to align with the task at hand, particularly in complex clustering or classification settings. To address this issue, we introduce “context steering,” a novel methodology that actively guides the feature-shaping process. Instead of passively accepting the emergent data structure (typically a hierarchy derived from clustering CDs), our approach “steers” the process by systematically analyzing how each object influences the relational context within a clustering framework. This process generates a custom-tailored embedding that isolates and amplifies class-distinctive information. We validate the capabilities of this strategy using Normalized Compression Distance (NCD) and Relative Compression Distance (NRC) with common hierarchical clustering, providing an effective alternative to common transductive methods. Experimental results across heterogeneous datasets-from text to real-world audio-validate the robustness and generality of context steering, marking a fundamental shift in their application: from merely discovering inherent data structures to actively shaping a feature space tailored to a specific objective.

[306] Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method

Suleyman Olcay Polat, Poli A. Nemkova, Mark V. Albert

Main category: cs.LG

TL;DR: Novel adaptive distillation framework using loss-aware data augmentation and vectorized teacher-student interface for efficient model compression

Details

Motivation: Conventional model distillation approaches suffer from computational overhead and limited generalization, making deployment in resource-constrained environments challenging

Method: Dynamic data augmentation in high-loss regions using UMAP dimensionality reduction and nearest neighbor sampling, plus lightweight teacher-student interface that bypasses teacher’s input layer for direct distillation on vectorized representations

Result: 66M-parameter student model achieves 91.2% on QNLI and 92.3% on SST-2, matching or surpassing established baselines with fewer training epochs

Conclusion: Loss-aware data augmentation and vectorized distillation show promise for efficient and effective model compression

Abstract: Model distillation enables the transfer of knowledge from large-scale models to compact student models, facilitating deployment in resource-constrained environments. However, conventional distillation approaches often suffer from computational overhead and limited generalization. We propose a novel adaptive distillation framework that dynamically augments training data in regions of high student model loss. Using UMAP-based dimensionality reduction and nearest neighbor sampling, our method identifies underperforming regions in the embedding space and generates targeted synthetic examples to guide student learning. To further improve efficiency, we introduce a lightweight teacher-student interface that bypasses the teacher’s input layer, enabling direct distillation on vectorized representations. Experiments across standard NLP benchmarks demonstrate that our 66M-parameter student model consistently matches or surpasses established baselines, achieving 91.2% on QNLI and 92.3% on SST-2, while training with fewer epochs. These results highlight the promise of loss-aware data augmentation and vectorized distillation for efficient and effective model compression.

[307] Graph Structure Learning with Temporal Graph Information Bottleneck for Inductive Representation Learning

Jiafeng Xiong, Rizos Sakellariou

Main category: cs.LG

TL;DR: GTGIB integrates Graph Structure Learning with Temporal Graph Information Bottleneck to handle dynamic networks with evolving nodes and edges, achieving superior performance in both inductive and transductive settings.

Details

Motivation: Address challenges in temporal graph learning: representing unseen nodes in dynamic networks and mitigating noisy/redundant graph information as networks evolve over time with continuous new node additions.

Method: Proposes GTGIB framework with two-step GSL-based structural enhancer to optimize node neighborhoods, and TGIB that extends information bottleneck principle to temporal graphs with variational approximation for tractable optimization of edges and features.

Result: Outperforms existing methods on all four real-world datasets in inductive setting, with significant and consistent improvements in transductive setting for link prediction tasks.

Conclusion: GTGIB effectively handles temporal graph learning challenges through integrated structure learning and information bottleneck approach, demonstrating superior performance across different evaluation settings.

Abstract: Temporal graph learning is crucial for dynamic networks where nodes and edges evolve over time and new nodes continuously join the system. Inductive representation learning in such settings faces two major challenges: effectively representing unseen nodes and mitigating noisy or redundant graph information. We propose GTGIB, a versatile framework that integrates Graph Structure Learning (GSL) with Temporal Graph Information Bottleneck (TGIB). We design a novel two-step GSL-based structural enhancer to enrich and optimize node neighborhoods and demonstrate its effectiveness and efficiency through theoretical proofs and experiments. The TGIB refines the optimized graph by extending the information bottleneck principle to temporal graphs, regularizing both edges and features based on our derived tractable TGIB objective function via variational approximation, enabling stable and efficient optimization. GTGIB-based models are evaluated to predict links on four real-world datasets; they outperform existing methods in all datasets under the inductive setting, with significant and consistent improvement in the transductive setting.

[308] A Guide for Manual Annotation of Scientific Imagery: How to Prepare for Large Projects

Azim Ahmadzadeh, Rohan Adhyapak, Armin Iraji, Kartik Chaurasiya, V Aparna, Petrus C. Martens

Main category: cs.LG

TL;DR: A practical guide for managing complex manual annotation projects, focusing on scientific imagery, with strategies to address challenges like data collection, bias mitigation, and team organization.

Details

Motivation: Manual annotation projects are complex and costly but lack comprehensive guidelines, especially for domain experts who may not have expertise in project management aspects like resource allocation and bias mitigation.

Method: The paper draws from the authors’ extensive experience managing large annotation projects, providing domain-agnostic guidance covering success measures, annotation subjects, project goals, data availability, team roles, human biases, and recommended tools/technologies.

Result: The paper provides a systematic preparation framework that addresses the interconnected challenges of annotation projects, offering practical recommendations to improve annotation quality and efficiency.

Conclusion: This guide aims to encourage further research and development of comprehensive frameworks to reduce costs and improve outcomes of manual annotation projects across various scientific and technical domains.

Abstract: Despite the high demand for manually annotated image data, managing complex and costly annotation projects remains under-discussed. This is partly due to the fact that leading such projects requires dealing with a set of diverse and interconnected challenges which often fall outside the expertise of specific domain experts, leaving practical guidelines scarce. These challenges range widely from data collection to resource allocation and recruitment, from mitigation of biases to effective training of the annotators. This paper provides a domain-agnostic preparation guide for annotation projects, with a focus on scientific imagery. Drawing from the authors’ extensive experience in managing a large manual annotation project, it addresses fundamental concepts including success measures, annotation subjects, project goals, data availability, and essential team roles. Additionally, it discusses various human biases and recommends tools and technologies to improve annotation quality and efficiency. The goal is to encourage further research and frameworks for creating a comprehensive knowledge base to reduce the costs of manual annotation projects across various fields.

[309] Source-Guided Flow Matching

Zifan Wang, Alice Harting, Matthieu Barreau, Michael M. Zavlanos, Karl H. Johansson

Main category: cs.LG

TL;DR: SGFM modifies source distribution instead of vector field for guidance, enabling flexible sampling methods while preserving exact target distribution recovery.

Details

Motivation: Traditional guidance methods modify probability flow vector fields, which can be complex and inflexible. SGFM aims to simplify guidance by working directly with source distributions while keeping pre-trained vector fields intact.

Method: Proposes Source-Guided Flow Matching (SGFM) framework that modifies the source distribution directly rather than the vector field. Reduces guidance to sampling from source distribution, allows flexible choice of sampling methods, and preserves straight transport maps from optimal flow matching.

Result: Theoretical proof that SGFM exactly recovers desired target distribution. Provides Wasserstein error bounds for approximate samplers and vector fields. Experimental validation on 2D benchmarks, image datasets, and physics-informed tasks shows effectiveness and flexibility.

Conclusion: SGFM provides a flexible and theoretically sound alternative to traditional guidance methods, enabling exact distribution recovery while allowing users to choose appropriate sampling methods for their specific problems.

Abstract: Guidance of generative models is typically achieved by modifying the probability flow vector field through the addition of a guidance field. In this paper, we instead propose the Source-Guided Flow Matching (SGFM) framework, which modifies the source distribution directly while keeping the pre-trained vector field intact. This reduces the guidance problem to a well-defined problem of sampling from the source distribution. We theoretically show that SGFM recovers the desired target distribution exactly. Furthermore, we provide bounds on the Wasserstein error for the generated distribution when using an approximate sampler of the source distribution and an approximate vector field. The key benefit of our approach is that it allows the user to flexibly choose the sampling method depending on their specific problem. To illustrate this, we systematically compare different sampling methods and discuss conditions for asymptotically exact guidance. Moreover, our framework integrates well with optimal flow matching models since the straight transport map generated by the vector field is preserved. Experimental results on synthetic 2D benchmarks, image datasets, and physics-informed generative tasks demonstrate the effectiveness and flexibility of the proposed framework.

[310] LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

Klaudia Bałazy, Mohammadreza Banaei, Karl Aberer, Jacek Tabor

Main category: cs.LG

TL;DR: LoRA-XS is a parameter-efficient fine-tuning method that uses a small trainable matrix between frozen SVD-derived low-rank matrices, reducing storage by 100x+ vs LoRA while maintaining or improving accuracy.

Details

Motivation: Address storage and computational challenges of deploying multiple task/user-specific modules in large language models, overcoming LoRA's limitations.

Method: Incorporates small trainable weight matrix between frozen low-rank matrices from SVD of pre-trained weights, enabling scaling from single parameter to large values.

Result: Reduces storage by over 100x in 7B models, outperforms/matches LoRA and VeRA accuracy on GLUE, GSM8K, MATH, and commonsense reasoning benchmarks.

Conclusion: LoRA-XS provides unmatched parameter efficiency, storage-efficient solution for scaling and personalizing LLMs, with significance of singular vectors demonstrated.

Abstract: The growth of large language models underscores the need for parameter-efficient fine-tuning. Despite its popularity, LoRA encounters storage and computational challenges when deploying multiple task- or user-specific modules. To address this, we introduce LoRA-XS, a novel fine-tuning method backed by a theoretical derivation. LoRA-XS drastically reduces trainable parameters by incorporating a small, trainable weight matrix between frozen low-rank matrices derived from the Singular Value Decomposition of pre-trained weights. This design enables LoRA-XS to reduce storage requirements by over 100x in 7B models compared to LoRA. Additionally, unlike other methods, LoRA-XS imposes no lower bound on trainable parameters - it can scale from a single parameter per module to arbitrarily large values, adapting to any storage or computational constraint. Evaluations on GLUE, GSM8K, MATH, and commonsense reasoning benchmarks across different model scales reveal that LoRA-XS consistently outperforms or matches LoRA and VeRA in accuracy, offering unmatched parameter efficiency. Our ablation studies highlight the significance of singular vectors in transformer weights, establishing LoRA-XS as a powerful, storage-efficient solution for scaling and personalizing large language models.

[311] Enhancing Contrastive Link Prediction With Edge Balancing Augmentation

Chen-Hao Chang, Hui-Ju Hung, Chia-Hsun Lu, Chih-Ya Shen

Main category: cs.LG

TL;DR: This paper introduces CoEBA, a novel contrastive learning approach for link prediction that addresses theoretical gaps and node degree considerations through Edge Balancing Augmentation and new contrastive losses.

Details

Motivation: Existing contrastive learning methods for link prediction lack theoretical analysis and fail to adequately consider node degrees, which limits their performance and understanding.

Method: Proposes Edge Balancing Augmentation (EBA) to adjust node degrees and Contrastive Link Prediction with EBA (CoEBA) that integrates EBA with new contrastive losses based on theoretical analysis.

Result: Experimental results on 8 benchmark datasets show that CoEBA significantly outperforms state-of-the-art link prediction models.

Conclusion: The proposed CoEBA framework successfully addresses theoretical weaknesses and node degree considerations in contrastive learning for link prediction, achieving superior performance across multiple datasets.

Abstract: Link prediction is one of the most fundamental tasks in graph mining, which motivates the recent studies of leveraging contrastive learning to enhance the performance. However, we observe two major weaknesses of these studies: i) the lack of theoretical analysis for contrastive learning on link prediction, and ii) inadequate consideration of node degrees in contrastive learning. To address the above weaknesses, we provide the first formal theoretical analysis for contrastive learning on link prediction, where our analysis results can generalize to the autoencoder-based link prediction models with contrastive learning. Motivated by our analysis results, we propose a new graph augmentation approach, Edge Balancing Augmentation (EBA), which adjusts the node degrees in the graph as the augmentation. We then propose a new approach, named Contrastive Link Prediction with Edge Balancing Augmentation (CoEBA), that integrates the proposed EBA and the proposed new contrastive losses to improve the model performance. We conduct experiments on 8 benchmark datasets. The results demonstrate that our proposed CoEBA significantly outperforms the other state-of-the-art link prediction models.

[312] LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning

Utsav Singh, Pramit Bhattacharyya, Vinay P. Namboodiri

Main category: cs.LG

TL;DR: LGR2 is a hierarchical reinforcement learning framework that uses large language models to generate language-guided reward functions, addressing non-stationarity issues in HRL and enabling stable learning for robotic tasks.

Details

Motivation: Traditional HRL suffers from non-stationarity caused by changing lower-level policies during training, which destabilizes higher-level policy learning. There's a need to bridge natural language instructions to effective robotic control, especially for long-horizon planning in sparse reward environments.

Method: LGR2 leverages LLMs to generate language-guided reward functions for higher-level policies, decoupling reward generation from low-level policy changes. It integrates goal-conditioned hindsight experience relabeling to enhance sample efficiency in sparse environments.

Result: LGR2 outperforms both hierarchical and non-hierarchical baselines, achieving over 55% success rates on challenging robotic navigation and manipulation tasks. It demonstrates robust transfer to real robots without additional fine-tuning.

Conclusion: The framework effectively mitigates non-stationarity in off-policy HRL, enabling stable and efficient learning for language-to-robotic-control translation, particularly in sparse reward and long-horizon planning scenarios.

Abstract: Large language models (LLMs) have shown remarkable abilities in logical reasoning, in-context learning, and code generation. However, translating natural language instructions into effective robotic control policies remains a significant challenge, especially for tasks requiring long-horizon planning and operating under sparse reward conditions. Hierarchical Reinforcement Learning (HRL) provides a natural framework to address this challenge in robotics; however, it typically suffers from non-stationarity caused by the changing behavior of the lower-level policy during training, destabilizing higher-level policy learning. We introduce LGR2, a novel HRL framework that leverages LLMs to generate language-guided reward functions for the higher-level policy. By decoupling high-level reward generation from low-level policy changes, LGR2 fundamentally mitigates the non-stationarity problem in off-policy HRL, enabling stable and efficient learning. To further enhance sample efficiency in sparse environments, we integrate goal-conditioned hindsight experience relabeling. Extensive experiments across simulated and real-world robotic navigation and manipulation tasks demonstrate LGR2 outperforms both hierarchical and non-hierarchical baselines, achieving over 55% success rates on challenging tasks and robust transfer to real robots, without additional fine-tuning.

[313] Successive Halving with Learning Curve Prediction via Latent Kronecker Gaussian Processes

Jihao Andreas Lin, Nicolas Mayoraz, Steffen Rendle, Dima Kuzmin, Emil Praun, Berivan Isik

Main category: cs.LG

TL;DR: Predictive Successive Halving using Latent Kronecker Gaussian Processes shows competitive performance but isn’t Pareto optimal compared to standard approach due to requiring fully observed learning curves as training data.

Details

Motivation: Standard Successive Halving prematurely prunes slow starters that could become best candidates because it relies only on intermediate performance values for resource allocation decisions.

Method: Using learning curve predictions based on Latent Kronecker Gaussian Processes to guide Successive Halving instead of relying solely on current performance values.

Result: Predictive approach achieves competitive performance but is not Pareto optimal compared to investing more resources into standard Successive Halving.

Conclusion: While predictive Successive Halving works well, its requirement for fully observed learning curves as training data makes it less optimal than simply allocating more resources to the standard approach, though this limitation could be mitigated by leveraging existing learning curve data.

Abstract: Successive Halving is a popular algorithm for hyperparameter optimization which allocates exponentially more resources to promising candidates. However, the algorithm typically relies on intermediate performance values to make resource allocation decisions, which can cause it to prematurely prune slow starters that would eventually become the best candidate. We investigate whether guiding Successive Halving with learning curve predictions based on Latent Kronecker Gaussian Processes can overcome this limitation. In a large-scale empirical study involving different neural network architectures and a click prediction dataset, we compare this predictive approach to the standard approach based on current performance values. Our experiments show that, although the predictive approach achieves competitive performance, it is not Pareto optimal compared to investing more resources into the standard approach, because it requires fully observed learning curves as training data. However, this downside could be mitigated by leveraging existing learning curve data.

[314] On Defining Neural Averaging

Su Hyeong Lee, Richard Ngo

Main category: cs.LG

TL;DR: This paper introduces Amortized Model Ensembling (AME), a data-free method for averaging neural network weights from multiple pretrained models trained on disjoint data, which outperforms existing approaches like model soup.

Details

Motivation: To develop a principled definition and method for neural network averaging that synthesizes a single model from multiple pretrained models without access to training data, addressing the limitations of existing approaches like model soup.

Method: Proposes Amortized Model Ensembling (AME), a data-free meta-optimization approach that treats model differences as pseudogradients to guide neural weight updates, enabling more expressive and adaptive ensembling strategies.

Result: AME produces averaged neural solutions that outperform both individual expert models and model soup baselines, particularly in out-of-distribution settings.

Conclusion: The work establishes a principled and generalizable framework for data-free model weight aggregation, providing a formal definition of how to perform neural averaging through the AME approach.

Abstract: What does it even mean to average neural networks? We investigate the problem of synthesizing a single neural network from a collection of pretrained models, each trained on disjoint data shards, using only their final weights and no access to training data. In forming a definition of neural averaging, we take insight from model soup, which appears to aggregate multiple models into a singular model while enhancing generalization performance. In this work, we reinterpret model souping as a special case of a broader framework: Amortized Model Ensembling (AME) for neural averaging, a data-free meta-optimization approach that treats model differences as pseudogradients to guide neural weight updates. We show that this perspective not only recovers model soup but enables more expressive and adaptive ensembling strategies. Empirically, AME produces averaged neural solutions that outperform both individual experts and model soup baselines, especially in out-of-distribution settings. Our results suggest a principled and generalizable notion of data-free model weight aggregation and defines, in one sense, how to perform neural averaging.

[315] Squeezed Diffusion Models

Jyotirmai Singh, Samar Khanna, James Burgess

Main category: cs.LG

TL;DR: Squeezed Diffusion Models (SDM) introduce anisotropic noise scaling along data principal components, improving FID by up to 15% on standard benchmarks without architectural changes.

Details

Motivation: Standard diffusion models use isotropic Gaussian noise that ignores data structure. Inspired by quantum squeezed states and Heisenberg uncertainty principle, the authors hypothesize that data-dependent noise scaling can better help models learn important features.

Method: Two configurations: (1) Heisenberg diffusion model that compensates principal axis scaling with inverse scaling on orthogonal directions, and (2) standard SDM that scales only the principal axis. Applied mild antisqueezing (increasing variance on principal axis).

Result: On CIFAR-10/100 and CelebA-64, mild antisqueezing consistently improved FID by up to 15% and shifted precision-recall frontier toward higher recall.

Conclusion: Simple data-aware noise shaping delivers robust generative performance improvements without requiring architectural modifications, demonstrating the value of structured noise injection in diffusion models.

Abstract: Diffusion models typically inject isotropic Gaussian noise, disregarding structure in the data. Motivated by the way quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle, we introduce Squeezed Diffusion Models (SDM), which scale noise anisotropically along the principal component of the training distribution. As squeezing enhances the signal-to-noise ratio in physics, we hypothesize that scaling noise in a data-dependent manner can better assist diffusion models in learning important data features. We study two configurations: (i) a Heisenberg diffusion model that compensates the scaling on the principal axis with inverse scaling on orthogonal directions and (ii) a standard SDM variant that scales only the principal axis. Counterintuitively, on CIFAR-10/100 and CelebA-64, mild antisqueezing - i.e. increasing variance on the principal axis - consistently improves FID by up to 15% and shifts the precision-recall frontier toward higher recall. Our results demonstrate that simple, data-aware noise shaping can deliver robust generative gains without architectural changes.

[316] Multimodal Quantum Vision Transformer for Enzyme Commission Classification from Biochemical Representations

Murat Isik, Mandeep Kaur Saggi, Humaira Gowher, Sabre Kais

Main category: cs.LG

TL;DR: Novel multimodal Quantum Machine Learning framework integrating protein sequences, quantum descriptors, molecular graphs, and 2D images for enhanced enzyme function classification, achieving 85.1% accuracy.

Details

Motivation: Addressing the challenge of accurately predicting enzyme functionality, especially for enzymes with limited structural annotations or sequence homology, by leveraging multiple biochemical modalities.

Method: Quantum Vision Transformer (QVT) backbone with modality-specific encoders and cross-attention fusion module that integrates protein sequence embeddings, quantum-derived electronic descriptors, molecular graph structures, and 2D molecular image representations.

Result: Achieves top-1 accuracy of 85.1%, substantially outperforming sequence-only baselines and showing better performance compared to other Quantum Machine Learning models.

Conclusion: The multimodal QML framework successfully captures key stereoelectronic interactions behind enzyme function through integrated graph features and spatial patterns, demonstrating superior performance in enzyme classification.

Abstract: Accurately predicting enzyme functionality remains one of the major challenges in computational biology, particularly for enzymes with limited structural annotations or sequence homology. We present a novel multimodal Quantum Machine Learning (QML) framework that enhances Enzyme Commission (EC) classification by integrating four complementary biochemical modalities: protein sequence embeddings, quantum-derived electronic descriptors, molecular graph structures, and 2D molecular image representations. Quantum Vision Transformer (QVT) backbone equipped with modality-specific encoders and a unified cross-attention fusion module. By integrating graph features and spatial patterns, our method captures key stereoelectronic interactions behind enzyme function. Experimental results demonstrate that our multimodal QVT model achieves a top-1 accuracy of 85.1%, outperforming sequence-only baselines by a substantial margin and achieving better performance results compared to other QML models.

[317] Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu

Main category: cs.LG

TL;DR: Proposes an intrinsic optimization method using exponentiated gradient descent with Bregman projection to generate effective jailbreak suffixes for LLMs, achieving higher success rates and faster convergence than existing methods.

Details

Motivation: LLMs remain vulnerable to jailbreak attacks despite alignment techniques like RLHF. Existing methods are inefficient in discrete token spaces or face challenges when projecting continuous embeddings back to discrete tokens for proprietary models.

Method: Direct optimization of relaxed one-hot encodings of adversarial suffix tokens using exponentiated gradient descent coupled with Bregman projection, ensuring optimized encodings remain within the probability simplex.

Result: Achieves higher success rates and faster convergence compared to three state-of-the-art baselines on five open-source LLMs and four adversarial behavior datasets. Also generates universal adversarial suffixes effective across multiple prompts with transferability to different LLMs.

Conclusion: The proposed intrinsic optimization method provides an effective approach for jailbreaking LLMs with theoretical convergence guarantees and practical efficiency improvements over existing techniques.

Abstract: As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or direct optimization of continuous embeddings. While continuous embeddings can be given directly to selected open-source models as input, doing so is not feasible for proprietary models. On the other hand, projecting these embeddings back into valid discrete tokens introduces additional complexity and often reduces attack effectiveness. We propose an intrinsic optimization method which directly optimizes relaxed one-hot encodings of the adversarial suffix tokens using exponentiated gradient descent coupled with Bregman projection, ensuring that the optimized one-hot encoding of each token always remains within the probability simplex. We provide theoretical proof of convergence for our proposed method and implement an efficient algorithm that effectively jailbreaks several widely used LLMs. Our method achieves higher success rates and faster convergence compared to three state-of-the-art baselines, evaluated on five open-source LLMs and four adversarial behavior datasets curated for evaluating jailbreak methods. In addition to individual prompt attacks, we also generate universal adversarial suffixes effective across multiple prompts and demonstrate transferability of optimized suffixes to different LLMs.

[318] Compute-Optimal Scaling for Value-Based Deep RL

Preston Fu, Oleh Rybkin, Zhiyuan Zhou, Michal Nauman, Pieter Abbeel, Sergey Levine, Aviral Kumar

Main category: cs.LG

TL;DR: This paper investigates compute-optimal scaling for online, value-based deep reinforcement learning, examining how to partition compute resources between model capacity and update-to-data ratio to maximize sample efficiency.

Details

Motivation: As model training becomes more expensive, there's a need for compute-optimal scaling in reinforcement learning, similar to what has been studied in language modeling, to extract maximal performance per unit of compute.

Method: The study analyzes the interplay between model size, batch size, and update-to-data ratio in value-based deep RL, identifying TD-overfitting phenomenon and providing guidelines for compute allocation.

Result: The research reveals that increasing batch size quickly harms Q-function accuracy for small models (TD-overfitting), but this effect is absent in large models, enabling effective use of large batch sizes at scale.

Conclusion: The findings provide grounded guidelines for choosing batch size and UTD ratio to optimize compute usage in deep RL, offering a starting point for compute-optimal scaling adapted to TD learning.

Abstract: As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio. Given a fixed compute budget, we ask: how should resources be partitioned across these axes to maximize sample efficiency? Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning.

[319] CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Xiaoya Li, Xiaofei Sun, Albert Wang, Chris Shum, Jiwei Li

Main category: cs.LG

TL;DR: CRINN is a reinforcement learning approach for ANNS optimization that automatically generates faster implementations while maintaining accuracy, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: ANNS algorithms are critical for AI applications like RAG and agent-based LLMs, but current methods require specialized knowledge and manual optimization.

Method: Treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal, enabling automatic generation of progressively faster implementations.

Result: Achieves best performance on 3 out of 6 benchmark datasets (GIST-960-Euclidean, MNIST-784-Euclidean, GloVe-25-angular) and ties for first on 2 others (SIFT-128-Euclidean, GloVe-25-angular).

Conclusion: CRINN validates that LLMs augmented with reinforcement learning can automate sophisticated algorithmic optimizations, extending beyond ANNS to broader algorithmic optimization challenges.

Abstract: Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN’s effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN’s success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual refinement. Code can be found at https://github.com/deepreinforce-ai/CRINN

[320] BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

DatologyAI, :, Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

Main category: cs.LG

TL;DR: BeyondWeb is a synthetic data generation framework that produces high-quality pretraining data, outperforming state-of-the-art synthetic datasets by up to 5.1 percentage points while delivering 7.7x faster training than web data.

Details

Motivation: Current LLM pretraining faces diminishing returns from scaling data quantity, hitting a data wall. Synthetic data emerges as a promising solution, but factors affecting synthetic data quality remain poorly understood.

Method: Introduces BeyondWeb framework for generating high-quality synthetic pretraining data through joint optimization of multiple factors, including what data to rephrase and how, and considering model size and family impact.

Result: Outperforms Cosmopedia by up to 5.1pp and Nemotron-Synth by 2.6pp across 14 benchmarks. Delivers 7.7x faster training than web data and 2.7x faster than Nemotron-Synth. A 3B model trained on BeyondWeb outperforms an 8B model trained on Cosmopedia with same token budget.

Conclusion: No silver bullet exists for generating high-quality synthetic pretraining data - best outcomes require joint optimization of many factors through rigorous science and practical expertise. Well-executed methods can yield transformative improvements.

Abstract: Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC’s high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there’s no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

[321] Input Time Scaling

Rapheal Huang, Weilong Guo

Main category: cs.LG

TL;DR: Introduces Input Time Scaling paradigm that refines queries using meta-knowledge from LLMs, challenging the “garbage in, garbage out” principle by showing that seemingly low-quality data can achieve high performance when properly processed during both training and testing.

Details

Motivation: Current LLMs rely on data/training scaling and inference-time scaling, but lack optimization at the input/query level. The paper aims to explore how input refinement strategies can complement existing scaling methods.

Method: Proposes Input Time Scaling by combining meta-knowledge from LLMs to refine inputs with different strategies during both training and testing phases (training-testing co-design). Experiments conducted on Qwen2.5-32B-Instruct models.

Result: Achieved SOTA performance among 32B models: 76.7% on AIME24 and AIME25 pass@1, and 80% on AIME25 with majority voting. Surprisingly found that adding irrelevant information and using minimally filtered datasets can perform best, contradicting traditional data quality assumptions.

Conclusion: Input Time Scaling is an effective complementary paradigm to existing scaling methods. Dataset quality assumptions need re-evaluation as seemingly low-quality data can yield high performance when processed correctly. Training-testing co-design is crucial, and simple dataset size scaling should be carefully inspected.

Abstract: Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, “garbage in, garbage out”. Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.

[322] Towards the Use of Saliency Maps for Explaining Low-Quality Electrocardiograms to End Users

Ana Lucic, Sheeraz Ahmad, Amanda Furtado Brinhosa, Vera Liao, Himani Agrawal, Umang Bhatt, Krishnaram Kenthapadi, Alice Xiang, Maarten de Rijke, Nicholas Drabowski

Main category: cs.LG

TL;DR: Developing AI system for real-time quality assessment of medical images with explanations, plus user studies to understand stakeholder needs and evaluate XAI impact on clinical workflows.

Details

Motivation: Prevent need for patient return visits by detecting low-quality medical images in real-time, especially critical for remote patients in telemedicine settings.

Method: Three-part approach: (1) AI system development for real-time image quality flagging with explanations, (2) interview study to understand stakeholder explanation needs, (3) longitudinal user study design to evaluate XAI impact on technician workflows.

Result: Ongoing work - no final results yet, but proposes first longitudinal study on XAI effects for non-AI-expert end users in healthcare.

Conclusion: This research addresses critical telemedicine challenges by combining real-time AI quality assessment with explainable AI and user-centered evaluation, seeking feedback on experimental design.

Abstract: When using medical images for diagnosis, either by clinicians or artificial intelligence (AI) systems, it is important that the images are of high quality. When an image is of low quality, the medical exam that produced the image often needs to be redone. In telemedicine, a common problem is that the quality issue is only flagged once the patient has left the clinic, meaning they must return in order to have the exam redone. This can be especially difficult for people living in remote regions, who make up a substantial portion of the patients at Portal Telemedicina, a digital healthcare organization based in Brazil. In this paper, we report on ongoing work regarding (i) the development of an AI system for flagging and explaining low-quality medical images in real-time, (ii) an interview study to understand the explanation needs of stakeholders using the AI system at OurCompany, and, (iii) a longitudinal user study design to examine the effect of including explanations on the workflow of the technicians in our clinics. To the best of our knowledge, this would be the first longitudinal study on evaluating the effects of XAI methods on end-users – stakeholders that use AI systems but do not have AI-specific expertise. We welcome feedback and suggestions on our experimental setup.

[323] Don’t Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning

Andrea Apicella, Francesco Isgrò, Roberto Prevete

Main category: cs.LG

TL;DR: This paper examines data leakage in machine learning, where unintended information contaminates training data, leading to overly optimistic performance estimates that don’t translate to real-world applications.

Details

Motivation: With the increasing accessibility of ML tools, non-expert practitioners often use 'push the button' approaches without understanding underlying algorithms, leading to unreliable outcomes and incorrect performance evaluation due to data leakage issues.

Method: The paper categorizes data leakage in ML, discusses how it propagates through ML workflows, explores its connection to specific tasks, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks.

Result: The research identifies critical pathways through which data leakage can occur and demonstrates how it affects performance evaluation across different ML paradigms including Transfer Learning.

Conclusion: Addressing data leakage is crucial for robust and reliable ML applications, as it prevents overly optimistic performance estimates and ensures models perform well on new, unseen data in real-world scenarios.

Abstract: Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, with the increasing accessibility of ML tools, many practitioners, lacking deep ML expertise, adopt a “push the button” approach, utilizing user-friendly interfaces without a thorough understanding of underlying algorithms. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. This paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Users, due to a lack of understanding, may inadvertently overlook crucial steps, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications.

[324] Estimation of Energy-dissipation Lower-bounds for Neuromorphic Learning-in-memory

Zihao Chen, Faiek Ahsan, Johannes Leugering, Gert Cauwenberghs, Shantanu Chakrabartty

Main category: cs.LG

TL;DR: Theoretical energy efficiency analysis for ideal neuromorphic optimizers using compute-in-memory and learning-in-memory paradigms to overcome memory, update, and consolidation walls.

Details

Motivation: Address energy bottlenecks in optimization problems caused by memory access (memory-wall), precision updates (update-wall), and information transfer between memory types (consolidation-wall).

Method: Derive theoretical energy-to-solution estimates by modulating energy-barrier of physical memories to match optimization dynamics, capturing out-of-equilibrium thermodynamics of learning.

Result: Model-agnostic energy-efficiency estimates that depend only on number of update operations, model size, convergence speed, and solution precision.

Conclusion: Provides practical framework for estimating lower-bound energy-to-solution metrics for large-scale AI workloads using neuromorphic optimization principles.

Abstract: Neuromorphic or neurally-inspired optimizers rely on local but parallel parameter updates to solve problems that range from quadratic programming to Ising machines. An ideal realization of such an optimizer not only uses a compute-in-memory (CIM) paradigm to address the so-called memory-wall (i.e. energy dissipated due to repeated memory read access), but also uses a learning-in-memory (LIM) paradigm to address the energy bottlenecks due to repeated memory writes at the precision required for optimization (the update-wall), and to address the energy bottleneck due to the repeated transfer of information between short-term and long-term memories (the consolidation-wall). In this paper, we derive theoretical estimates for the energy-to-solution metric that can be achieved by this ideal neuromorphic optimizer which is realized by modulating the energy-barrier of the physical memories such that the dynamics of memory updates and memory consolidation matches the optimization or the annealing dynamics. The analysis presented in this paper captures the out-of-equilibrium thermodynamics of learning and the resulting energy-efficiency estimates are model-agnostic which only depend on the number of model-update operations (OPS), the model-size in terms of number of parameters, the speed of convergence, and the precision of the solution. To show the practical applicability of our results, we apply our analysis for estimating the lower-bound on the energy-to-solution metrics for large-scale AI workloads.

[325] A Comprehensive Benchmark on Spectral GNNs: The Impact on Efficiency, Memory, and Effectiveness

Ningyi Liao, Haoyu Liu, Zulun Zhu, Siqiang Luo, Laks V. S. Lakshmanan

Main category: cs.LG

TL;DR: This paper presents a comprehensive benchmarking study of spectral graph neural networks (GNNs), analyzing 35 GNNs with 27 spectral filters to provide practical guidelines for model selection and deployment on large-scale graphs.

Details

Motivation: Despite recent advancements in spectral GNNs, there is a lack of systematic studies to benchmark their efficiency, memory consumption, and effectiveness in a unified manner. The need exists to select appropriate spectral models for specific graph data and deploy them to massive web-scale graphs.

Method: The authors analyze and categorize 35 GNNs with 27 corresponding spectral filters, then implement them within a unified spectral-oriented framework with dedicated graph computations and efficient training schemes. They conduct thorough experiments with comprehensive metrics on effectiveness and efficiency across various graph scales.

Result: The benchmark reveals an intricate landscape regarding spectral graph filters’ effectiveness and efficiency, demonstrating that desirable performance can be achieved through tailored spectral manipulation of graph data. Their implementation enables deployment on million-scale graphs with comparable performance and less overhead.

Conclusion: The study provides novel observations and practical guidelines for spectral GNN selection and deployment, challenging prevailing beliefs about spectral graph filters and showing their potential for efficient large-scale graph processing through proper spectral data manipulation.

Abstract: With recent advancements in graph neural networks (GNNs), spectral GNNs have received increasing popularity by virtue of their ability to retrieve graph signals in the spectral domain. These models feature uniqueness in efficient computation as well as rich expressiveness, which stems from advanced management and profound understanding of graph data. However, few systematic studies have been conducted to assess spectral GNNs, particularly in benchmarking their efficiency, memory consumption, and effectiveness in a unified and fair manner. There is also a pressing need to select spectral models suitable for learning specific graph data and deploying them to massive web-scale graphs, which is currently constrained by the varied model designs and training settings. In this work, we extensively benchmark spectral GNNs with a focus on the spectral perspective, demystifying them as spectral graph filters. We analyze and categorize 35 GNNs with 27 corresponding filters, spanning diverse formulations and utilizations of the graph data. Then, we implement the filters within a unified spectral-oriented framework with dedicated graph computations and efficient training schemes. In particular, our implementation enables the deployment of spectral GNNs over million-scale graphs and various tasks with comparable performance and less overhead. Thorough experiments are conducted on the graph filters with comprehensive metrics on effectiveness and efficiency, offering novel observations and practical guidelines that are only available from our evaluations across graph scales. Different from the prevailing belief, our benchmark reveals an intricate landscape regarding the effectiveness and efficiency of spectral graph filters, demonstrating the potential to achieve desirable performance through tailored spectral manipulation of graph data.

[326] Testing Components of the Attention Schema Theory in Artificial Neural Networks

Kathryn T. Farrell, Kirsten Ziman, Michael S. A. Graziano

Main category: cs.LG

TL;DR: Adding an attention schema to transformer-based agents improves their ability to judge, categorize, and cooperate with other agents by making attention states more interpretable and predictable.

Details

Motivation: To investigate whether an attention schema (simplified model of attention) provides computational benefits for artificial agents in judging others' attention states and enabling cooperation, similar to proposed benefits in biological brains.

Method: Used neural networks with transformer attention mechanisms, comparing agents with and without attention schemas on tasks involving attention state categorization and cooperative painting tasks.

Result: Agents with attention schemas showed: 1) higher accuracy in categorizing others’ attention states, 2) developed more easily categorizable attention patterns, 3) improved performance in cooperative tasks, and 4) these benefits were specific to attention-related tasks rather than general network complexity.

Conclusion: Attention schemas provide specific computational advantages for mutual interpretability and interactive behavior between agents, supporting the hypothesis that similar principles may operate in biological attention systems.

Abstract: Growing evidence suggests that the brain uses an attention schema, or a simplified model of attention, to help control what it attends to. One proposed benefit of this model is to allow agents to model the attention states of other agents, and thus predict and interact with other agents. The effects of an attention schema may be examined in artificial agents. Although attention mechanisms in artificial agents are different from in biological brains, there may be some principles in common. In both cases, select features or representations are emphasized for better performance. Here, using neural networks with transformer attention mechanisms, we asked whether the addition of an attention schema affected the ability of agents to make judgements about and cooperate with each other. First, we found that an agent with an attention schema is better at categorizing the attention states of other agents (higher accuracy). Second, an agent with an attention schema develops a pattern of attention that is easier for other agents to categorize. Third, in a joint task where two agents must predict each other to paint a scene together, adding an attention schema improves performance. Finally, the performance improvements are not caused by a general increase in network complexity. Instead, improvement is specific to tasks involving judging, categorizing, or predicting the attention of other agents. These results support the hypothesis that an attention schema has computational properties beneficial to mutual interpretability and interactive behavior. We speculate that the same principles might pertain to biological attention and attention schemas in people.

[327] Fluorescence molecular optomic signatures improve identification of tumors in head and neck specimens

Yao Chen, Samuel S. Streeter, Brady Hunt, Hira S. Sardar, Jason R. Gunn, Laura J. Tafe, Joseph A. Paydarfar, Brian W. Pogue, Keith D. Paulsen, Kimberley S. Samkoe

Main category: cs.LG

TL;DR: Optomics extends radiomics to fluorescence imaging, using texture analysis of EGFR expression patterns to improve tumor detection accuracy over simple intensity thresholding in head and neck cancer surgery.

Details

Motivation: Fluorescence molecular imaging for surgical guidance is limited by heterogeneous EGFR expression, requiring better methods to distinguish tumor from normal tissue beyond simple intensity measurements.

Method: Extracted 1,472 standardized optomic features from fluorescence images, used minimum redundancy maximum relevance to select top 25 features, and trained a support vector machine classifier for tissue classification.

Result: Optomics achieved 89% mean accuracy vs 81% for intensity thresholding (P=0.0072), showing consistent improvement across all test samples regardless of dose.

Conclusion: Extending radiomics to fluorescence imaging (optomics) provides a promising technique for improved cancer detection in fluorescence-guided surgery.

Abstract: In this study, a radiomics approach was extended to optical fluorescence molecular imaging data for tissue classification, termed ‘optomics’. Fluorescence molecular imaging is emerging for precise surgical guidance during head and neck squamous cell carcinoma (HNSCC) resection. However, the tumor-to-normal tissue contrast is confounded by intrinsic physiological limitations of heterogeneous expression of the target molecule, epidermal growth factor receptor (EGFR). Optomics seek to improve tumor identification by probing textural pattern differences in EGFR expression conveyed by fluorescence. A total of 1,472 standardized optomic features were extracted from fluorescence image samples. A supervised machine learning pipeline involving a support vector machine classifier was trained with 25 top-ranked features selected by minimum redundancy maximum relevance criterion. Model predictive performance was compared to fluorescence intensity thresholding method by classifying testing set image patches of resected tissue with histologically confirmed malignancy status. The optomics approach provided consistent improvement in prediction accuracy on all test set samples, irrespective of dose, compared to fluorescence intensity thresholding method (mean accuracies of 89% vs. 81%; P = 0.0072). The improved performance demonstrates that extending the radiomics approach to fluorescence molecular imaging data offers a promising image analysis technique for cancer detection in fluorescence-guided surgery.

[328] Behind the Myth of Exploration in Policy Gradients

Adrien Bolland, Gaspard Lambrechts, Damien Ernst

Main category: cs.LG

TL;DR: Exploration techniques in policy-gradient algorithms smooth the learning objective and improve gradient estimates, leading to better optimization and near-optimal policies.

Details

Motivation: To understand why intrinsic exploration terms are effective in policy-gradient algorithms, moving beyond the traditional 'need to explore' justification and analyzing from a numerical optimization perspective.

Method: Introduces four criteria: two on the learning objective (smoothness and preservation of global optimum) and two on stochastic gradient estimates (quality and convergence properties). Analyzes exploration techniques through this optimization lens.

Result: Exploration techniques serve two key functions: smoothing the objective function to eliminate local optima while keeping the global maximum, and improving gradient estimates to increase probability of finding optimal policies.

Conclusion: The optimization-based analysis provides new insights into exploration mechanisms, revealing both objective smoothing and gradient improvement effects, with empirical validation showing limitations and suggesting future research directions.

Abstract: In order to compute near-optimal policies with policy-gradient algorithms, it is common in practice to include intrinsic exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis with the lens of numerical optimization. Two criteria are introduced on the learning objective and two others on its stochastic gradient estimates, and are afterwards used to discuss the quality of the policy after optimization. The analysis sheds light on two separate effects of exploration techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter updates eventually provide an optimal policy. We empirically illustrate these effects with exploration strategies based on entropy bonuses, identifying limitations and suggesting directions for future work.

[329] Deep Exploration with PAC-Bayes

Bahareh Tasdighi, Manuel Haussmann, Nicklas Werge, Yi-Shan Wu, Melih Kandemir

Main category: cs.LG

TL;DR: PBAC is a novel PAC-Bayesian actor-critic algorithm that enables deep exploration in continuous control tasks with delayed rewards, outperforming existing methods.

Details

Motivation: Reinforcement learning for continuous control under delayed rewards is under-explored but crucial for real-world applications where complex skills build on intermediate prerequisites. Existing deep exploration methods are designed for discrete spaces and don't generalize well to continuous control.

Method: Quantifies Bellman operator error through PAC-Bayes bound using bootstrapped critic ensemble as posterior distribution. Uses critic-specific targets as data-informed prior. Trains shared trunk with critic-specific actor heads. Performs exploration by acting epsilon-softly on randomly chosen actor head.

Result: PBAC is the only algorithm to consistently discover delayed rewards on continuous control tasks with varying difficulty levels.

Conclusion: The PAC-Bayesian approach successfully addresses deep exploration in continuous control, enabling agents to learn complex skills with delayed rewards through principled uncertainty quantification and ensemble-based exploration.

Abstract: Reinforcement learning (RL) for continuous control under delayed rewards is an under-explored problem despite its significance in real-world applications. Many complex skills are based on intermediate ones as prerequisites. For instance, a humanoid locomotor must learn how to stand before it can learn to walk. To cope with delayed reward, an agent must perform deep exploration. However, existing deep exploration methods are designed for small discrete action spaces, and their generalization to state-of-the-art continuous control remains unproven. We address the deep exploration problem for the first time from a PAC-Bayesian perspective in the context of actor-critic learning. To do this, we quantify the error of the Bellman operator through a PAC-Bayes bound, where a bootstrapped ensemble of critic networks represents the posterior distribution, and their targets serve as a data-informed function-space prior. We derive an objective function from this bound and use it to train the critic ensemble. Each critic trains an individual soft actor network, implemented as a shared trunk and critic-specific heads. The agent performs deep exploration by acting epsilon-softly on a randomly chosen actor head. Our proposed algorithm, named {\it PAC-Bayesian Actor-Critic (PBAC)}, is the only algorithm to consistently discover delayed rewards on continuous control tasks with varying difficulty.

[330] Sample Selection Bias in Machine Learning for Healthcare

Vinod Kumar Chauhan, Lei Clifton, Achille Salaün, Huiqi Yvonne Lu, Kim Branson, Patrick Schwab, Gaurav Nigam, David A. Clifton

Main category: cs.LG

TL;DR: Proposes new approach for sample selection bias in healthcare ML by identifying target subpopulations rather than correcting bias, using dual-network architectures that outperform existing methods.

Details

Motivation: Clinical adoption of ML is limited by biases like sample selection bias (SSB), where study populations don't represent target populations, leading to unreliable predictions. Existing correction methods often sacrifice predictive performance.

Method: Proposes two network architectures: T-Net (two independent networks) and MT-Net (multitasking network). One network/task identifies target subpopulation representative of study population, second makes predictions for identified subpopulation.

Result: SSB causes significant performance drop for target population vs study population. Proposed techniques show robustness across dataset sizes, event rates, and selection rates, outperforming existing bias correction methods.

Conclusion: Focusing on target population identification rather than bias correction provides more effective approach for handling sample selection bias in healthcare ML applications.

Abstract: While machine learning algorithms hold promise for personalised medicine, their clinical adoption remains limited, partly due to biases that can compromise the reliability of predictions. In this paper, we focus on sample selection bias (SSB), a specific type of bias where the study population is less representative of the target population, leading to biased and potentially harmful decisions. Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare. Moreover, the existing machine learning techniques try to correct the bias mostly by balancing distributions between the study and the target populations, which may result in a loss of predictive performance. To address these problems, our study illustrates the potential risks associated with SSB by examining SSB’s impact on the performance of machine learning algorithms. Most importantly, we propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction. Specifically, we propose two independent networks(T-Net) and a multitasking network (MT-Net) for addressing SSB, where one network/task identifies the target subpopulation which is representative of the study population and the second makes predictions for the identified subpopulation. Our empirical results with synthetic and semi-synthetic datasets highlight that SSB can lead to a large drop in the performance of an algorithm for the target population as compared with the study population, as well as a substantial difference in the performance for the target subpopulations that are representative of the selected and the non-selected patients from the study population. Furthermore, our proposed techniques demonstrate robustness across various settings, including different dataset sizes, event rates, and selection rates, outperforming the existing bias correction techniques.

[331] One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks

Quan Nguyen, Thanh Nguyen-Tang

Main category: cs.LG

TL;DR: One-layer transformers can achieve Bayes-optimal performance for in-context reasoning tasks with provable convergence rates and generalization guarantees, supported by both theoretical analysis and empirical validation.

Details

Motivation: Existing theoretical work on transformers' in-context reasoning capabilities is limited to either initial gradient steps or infinite sample scenarios, lacking convergence rates and generalization analysis.

Method: Theoretical analysis of one-layer transformers with linear and ReLU attention, trained with gradient descent, using finite-sample analysis to study convergence and generalization properties.

Result: Proven existence of transformers that are Bayes-optimal, with linear convergence rate to Bayes risk and demonstrated generalization to unseen samples, matching empirical observations from prior work.

Conclusion: One-layer transformers possess strong theoretical foundations for in-context reasoning with provable optimality, convergence, and generalization properties, bridging theoretical understanding with empirical observations.

Abstract: We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities were known. Our work addresses these gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss of these transformers converges at linear rate to the Bayes risk. Moreover, we prove that the trained models generalize to unseen samples as well as exhibit learning behaviors that were empirically observed in previous works. Our theoretical findings are further supported by extensive empirical validations.

[332] Improving Actor-Critic Training with Steerable Action-Value Approximation Errors

Bahareh Tasdighi, Nicklas Werge, Yi-Shan Wu, Melih Kandemir

Main category: cs.LG

TL;DR: USAC is a novel off-policy actor-critic framework that provides independent control over pessimism and optimism levels for both actor and critic, using utility functions to dynamically balance exploration based on critic uncertainty.

Details

Motivation: Existing off-policy actor-critic algorithms face a trade-off: excessive pessimism limits exploration while excessive optimism leads to risky behaviors and instability. There's a need for fine-grained control over optimism-pessimism balance.

Method: Proposes Utility Soft Actor-Critic (USAC) framework with utility functions that dynamically adapt exploration strategy based on critic uncertainty, enabling independent and interpretable control of pessimism and optimism for both actor and critic components.

Result: Experiments across continuous control tasks show that adjusting pessimism/optimism levels significantly impacts performance. When properly configured, USAC consistently outperforms state-of-the-art algorithms.

Conclusion: USAC provides a theoretically meaningful and practically feasible approach to balance optimism and pessimism beyond binary choices, demonstrating superior performance in continuous control tasks through adaptive exploration strategies.

Abstract: Off-policy actor-critic algorithms have shown strong potential in deep reinforcement learning for continuous control tasks. Their success primarily comes from leveraging pessimistic state-action value function updates, which reduce function approximation errors and stabilize learning. However, excessive pessimism can limit exploration, preventing the agent from effectively refining its policies. Conversely, optimism can encourage exploration but may lead to high-risk behaviors and unstable learning if not carefully managed. To address this trade-off, we propose Utility Soft Actor-Critic (USAC), a novel framework that allows independent, interpretable control of pessimism and optimism for both the actor and the critic. USAC dynamically adapts its exploration strategy based on the uncertainty of critics using a utility function, enabling a task-specific balance between optimism and pessimism. This approach goes beyond binary choices of pessimism or optimism, making the method both theoretically meaningful and practically feasible. Experiments across a variety of continuous control tasks show that adjusting the degree of pessimism or optimism significantly impacts performance. When configured appropriately, USAC consistently outperforms state-of-the-art algorithms, demonstrating its practical utility and feasibility.

[333] Adaptive Experiments Under Data Sparse Settings: Applications for Educational Platforms

Haochen Song, Ilya Musabirov, Ananya Bhattacharjee, Audrey Durand, Meredith Franklin, Anna Rafferty, Joseph Jay Williams

Main category: cs.LG

TL;DR: WAPTS algorithm improves adaptive experimentation in education by refining Thompson Sampling for better content allocation in data-sparse environments using lenient regret principle.

Details

Motivation: Standard adaptive strategies like Thompson Sampling underperform in real-world educational settings with numerous content variations and limited student participation, leading to sparse data and imbalanced content allocation.

Method: Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS) refines sampling strategy with lenient regret principle to improve content-related decision-making, allowing near-optimal allocations to accelerate learning while exploring promising content.

Result: WAPTS enables earlier and more reliable identification of promising treatments in learnersourcing scenarios where students rate peer-generated learning materials.

Conclusion: WAPTS addresses the limitations of standard adaptive strategies in educational platforms by providing improved content allocation and faster convergence in data-sparse environments.

Abstract: Adaptive experimentation is increasingly used in educational platforms to personalize learning through dynamic content and feedback. However, standard adaptive strategies such as Thompson Sampling often underperform in real-world educational settings where content variations are numerous and student participation is limited, resulting in sparse data. In particular, Thompson Sampling can lead to imbalanced content allocation and delayed convergence on which aspects of content are most effective for student learning. To address these challenges, we introduce Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS), an algorithm that refines the sampling strategy to improve content-related decision-making in data-sparse environments. WAPTS is guided by the principle of lenient regret, allowing near-optimal allocations to accelerate learning while still exploring promising content. We evaluate WAPTS in a learnersourcing scenario where students rate peer-generated learning materials, and demonstrate that it enables earlier and more reliable identification of promising treatments.

[334] Generalizable Spectral Embedding with an Application to UMAP

Nir Ben-Ari, Amitai Yacobi, Uri Shaham

Main category: cs.LG

TL;DR: Sep-SpectralNet addresses three key limitations of Spectral Embedding: generalizability, scalability, and eigenvectors separation, extending SpectralNet with efficient post-processing to achieve all three simultaneously.

Details

Motivation: Current Spectral Embedding implementations only address two out of three major drawbacks (generalizability, scalability, eigenvectors separation), limiting broader applicability across domains.

Method: Extends SpectralNet with an efficient post-processing step to achieve eigenvectors separation while maintaining generalizability (out-of-sample extension) and scalability.

Result: Empirically demonstrates consistent approximation and generalization of Spectral Embedding, maintains scalability, and enables generalizable UMAP visualization.

Conclusion: Sep-SpectralNet successfully addresses all three limitations of Spectral Embedding, expanding its applicability to wider range of tasks and enhancing performance in existing applications.

Abstract: Spectral Embedding (SE) is a popular method for dimensionality reduction, applicable across diverse domains. Nevertheless, its current implementations face three prominent drawbacks which curtail its broader applicability: generalizability (i.e., out-of-sample extension), scalability, and eigenvectors separation. Existing SE implementations often address two of these drawbacks; however, they fall short in addressing the remaining one. In this paper, we introduce Sep-SpectralNet (eigenvector-separated SpectralNet), a SE implementation designed to address all three limitations. Sep-SpectralNet extends SpectralNet with an efficient post-processing step to achieve eigenvectors separation, while ensuring both generalizability and scalability. This method expands the applicability of SE to a wider range of tasks and can enhance its performance in existing applications. We empirically demonstrate Sep-SpectralNet’s ability to consistently approximate and generalize SE, while maintaining SpectralNet’s scalability. Additionally, we show how Sep-SpectralNet can be leveraged to enable generalizable UMAP visualization. Our codes are publicly available.

[335] No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets

Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck

Main category: cs.LG

TL;DR: A framework called Rings for evaluating graph-learning dataset quality through mode perturbation and two new measures: performance separability and mode complementarity.

Details

Motivation: Recent research shows that methods ignoring graph structure can outperform graph-based approaches, raising questions about what makes good graph-learning datasets and how to evaluate dataset quality.

Method: Introduces Rings framework with dataset ablations that perturb graph structure and node features, proposing two evaluation measures: performance separability and mode complementarity.

Result: Demonstrated utility through extensive experiments on graph-level tasks, providing actionable recommendations for improving graph-learning method evaluation.

Conclusion: Opens new research directions in data-centric graph learning and represents a step toward systematic evaluation of evaluations in graph learning.

Abstract: Benchmark datasets have proved pivotal to the success of graph learning, and good benchmark datasets are crucial to guide the development of the field. Recent research has highlighted problems with graph-learning datasets and benchmarking practices – revealing, for example, that methods which ignore the graph structure can outperform graph-based approaches. Such findings raise two questions: (1) What makes a good graph-learning dataset, and (2) how can we evaluate dataset quality in graph learning? Our work addresses these questions. As the classic evaluation setup uses datasets to evaluate models, it does not apply to dataset evaluation. Hence, we start from first principles. Observing that graph-learning datasets uniquely combine two modes – graph structure and node features –, we introduce Rings, a flexible and extensible mode-perturbation framework to assess the quality of graph-learning datasets based on dataset ablations – i.e., quantifying differences between the original dataset and its perturbed representations. Within this framework, we propose two measures – performance separability and mode complementarity – as evaluation tools, each assessing the capacity of a graph dataset to benchmark the power and efficacy of graph-learning methods from a distinct angle. We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for improving the evaluation of graph-learning methods. Our work opens new research directions in data-centric graph learning, and it constitutes a step toward the systematic evaluation of evaluations.

[336] Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting

Timothée Hornek Amir Sartipi, Igor Tchappi, Gilbert Fridgen

Main category: cs.LG

TL;DR: Benchmark study comparing pretrained time series foundation models against traditional methods for electricity price forecasting, finding that traditional biseasonal MSTL model still outperforms all AI models.

Details

Motivation: To evaluate whether recent advances in generative AI and pretrained large language models for time series forecasting are effective for electricity price forecasting, which is crucial for power trading decisions.

Method: Benchmarked several state-of-the-art pretrained models (Chronos-Bolt, Chronos-T5, TimesFM, Moirai, Time-MoE, TimeGPT) against established statistical and ML methods using 2024 day-ahead auction electricity prices from 5 European countries with daily forecasts and one-day horizon.

Result: Chronos-Bolt and Time-MoE were the strongest among TSFMs, performing on par with traditional models. However, the biseasonal MSTL model (capturing daily/weekly seasonality) showed consistent performance across all countries and metrics, with no TSFM statistically outperforming it.

Conclusion: While some pretrained time series foundation models perform competitively, traditional methods like MSTL that explicitly model seasonality patterns remain superior for electricity price forecasting, suggesting current TSFMs may not yet fully capture the specific characteristics of energy markets.

Abstract: Accurate electricity price forecasting (EPF) is crucial for effective decision-making in power trading on the spot market. While recent advances in generative artificial intelligence (GenAI) and pre-trained large language models (LLMs) have inspired the development of numerous time series foundation models (TSFMs) for time series forecasting, their effectiveness in EPF remains uncertain. To address this gap, we benchmark several state-of-the-art pretrained models–Chronos-Bolt, Chronos-T5, TimesFM, Moirai, Time-MoE, and TimeGPT–against established statistical and machine learning (ML) methods for EPF. Using 2024 day-ahead auction (DAA) electricity prices from Germany, France, the Netherlands, Austria, and Belgium, we generate daily forecasts with a one-day horizon. Chronos-Bolt and Time-MoE emerge as the strongest among the TSFMs, performing on par with traditional models. However, the biseasonal MSTL model, which captures daily and weekly seasonality, stands out for its consistent performance across countries and evaluation metrics, with no TSFM statistically outperforming it.

[337] Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, Sitan Chen

Main category: cs.LG

TL;DR: Masked diffusion models trade training complexity for inference flexibility, but face computationally intractable subproblems during training. Adaptive decoding strategies at inference time can significantly boost performance, achieving 90% accuracy on Sudoku puzzles compared to <7% without adaptation.

Details

Motivation: To understand the trade-offs between masked diffusion models (MDMs) and autoregressive models (ARMs) - MDMs offer inference flexibility but face training challenges with exponentially many infilling problems, while ARMs have simpler training but fixed decoding order.

Method: Theoretical and empirical analysis of MDM training complexity compared to ARMs, plus development of adaptive token decoding strategies at inference time to help MDMs avoid hard subproblems.

Result: Adaptive inference strategies dramatically improved MDM performance on logic puzzles like Sudoku, boosting accuracy from <7% to ≈90%, even outperforming larger ARMs specifically trained with optimal decoding order.

Conclusion: While MDMs face computationally intractable training challenges, adaptive inference strategies can effectively mitigate these limitations, making MDMs highly competitive with and even superior to ARMs despite the training complexity trade-off.

Abstract: In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from $<7$% to $\approx 90$%, even outperforming ARMs with $7\times$ as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.

[338] AtmosMJ: Revisiting Gating Mechanism for AI Weather Forecasting Beyond the Year Scale

Minjong Cheon

Main category: cs.LG

TL;DR: AtmosMJ challenges the assumption that non-standard spatial domains are needed for stable long-range weather forecasting by achieving 500-day stable forecasts on standard latitude-longitude grid using a novel Gated Residual Fusion mechanism.

Details

Motivation: Current state-of-the-art weather models rely on non-standard spatial representations (spherical harmonics, HEALPix meshes) for long-range stability, but this paper questions whether such representations are actually necessary.

Method: Introduces AtmosMJ, a deep convolutional network operating directly on ERA5 data without spherical remapping, using Gated Residual Fusion (GRF) to prevent error accumulation in long recursive simulations.

Result: AtmosMJ produces stable forecasts for ~500 days, achieves competitive 10-day accuracy against models like Pangu-Weather and GraphCast, and requires only 5.7 days of V100 GPU training.

Conclusion: Efficient architectural design (Gated Residual Fusion) rather than non-standard data representation is key to stable and computationally efficient long-range weather prediction.

Abstract: The advent of Large Weather Models (LWMs) has marked a turning point in data-driven forecasting, with many models now outperforming traditional numerical systems in the medium range. However, achieving stable, long-range autoregressive forecasts beyond a few weeks remains a significant challenge. Prevailing state-of-the-art models that achieve year-long stability, such as SFNO and DLWP-HPX, have relied on transforming input data onto non-standard spatial domains like spherical harmonics or HEALPix meshes. This has led to the prevailing assumption that such representations are necessary to enforce physical consistency and long-term stability. This paper challenges that assumption by investigating whether comparable long-range performance can be achieved on the standard latitude-longitude grid. We introduce AtmosMJ, a deep convolutional network that operates directly on ERA5 data without any spherical remapping. The model’s stability is enabled by a novel Gated Residual Fusion (GRF) mechanism, which adaptively moderates feature updates to prevent error accumulation over long recursive simulations. Our results demonstrate that AtmosMJ produces stable and physically plausible forecasts for about 500 days. In quantitative evaluations, it achieves competitive 10-day forecast accuracy against models like Pangu-Weather and GraphCast, all while requiring a remarkably low training budget of 5.7 days on a V100 GPU. Our findings suggest that efficient architectural design, rather than non-standard data representation, can be the key to unlocking stable and computationally efficient long-range weather prediction.

[339] Low-rank bias, weight decay, and model merging in neural networks

Ilja Kuzborskij, Yasin Abbasi Yadkori

Main category: cs.LG

TL;DR: Analysis of low-rank structure in neural network weight matrices at stationary points with L2 regularization, showing parameter alignment, norm preservation, and low-rank bias. Also demonstrates multitask learning through weight summation when training sets are orthogonal.

Details

Motivation: To understand the structural properties induced by L2 regularization in neural networks at stationary points, and explore how these properties enable multitask learning capabilities through simple weight combination.

Method: Theoretical analysis of stationary points with L2 regularization, examining parameter alignment, norm preservation, and low-rank bias. Experimental validation with shallow ReLU networks trained by gradient descent and deep linear networks trained by gradient flow.

Result: Shows that L2 regularization induces alignment between parameters and gradients, preserves norms across layers, and creates low-rank bias. Demonstrates that summing weights of networks trained on orthogonal datasets performs well on both tasks.

Conclusion: L2 regularization creates structured stationary points with beneficial properties for neural networks, enabling efficient multitask learning through simple weight combination when training data exhibits orthogonality properties.

Abstract: We explore the low-rank structure of the weight matrices in neural networks at the stationary points (limiting solutions of optimization algorithms) with $L2$ regularization (also known as weight decay). We show several properties of such deep neural networks, induced by $L2$ regularization. In particular, for a stationary point we show alignment of the parameters and the gradient, norm preservation across layers, and low-rank bias: properties previously known in the context of solution of gradient descent/flow type algorithms. Experiments show that the assumptions made in the analysis only mildly affect the observations. In addition, we investigate a multitask learning phenomenon enabled by $L2$ regularization and low-rank bias. In particular, we show that if two networks are trained, such that the inputs in the training set of one network are approximately orthogonal to the inputs in the training set of the other network, the new network obtained by simply summing the weights of the two networks will perform as well on both training sets as the respective individual networks. We demonstrate this for shallow ReLU neural networks trained by gradient descent, as well as deep linear networks trained by gradient flow.

[340] Redundant feature screening method for human activity recognition based on attention purification mechanism

Xiaoyang Li, Yixuan Jiang, Junze Zhu, Haotian Tang, Dongchen Wu, Hanyu Liu, Chao Li

Main category: cs.LG

TL;DR: Proposes MSAP attention mechanism for efficient human activity recognition on wearable devices, reducing feature redundancy while maintaining low resource consumption.

Details

Motivation: Need to balance recognition accuracy with resource consumption for wearable devices, addressing feature redundancy in multi-scale networks.

Method: MSAP attention feature purification mechanism with inter-scale attention screening and connection, plus network correction module between layers.

Result: Extensive experiments on four public datasets show reduced redundant features and excellent performance with minimal resource consumption.

Conclusion: The proposed method provides efficient HAR for wearable devices with good performance and low resource usage.

Abstract: In the field of sensor-based Human Activity Recognition (HAR), deep neural networks provide advanced technical support. Many studies have proven that recognition accuracy can be improved by increasing the depth or width of the network. However, for wearable devices, the balance between network performance and resource consumption is crucial. With minimum resource consumption as the basic principle, we propose a universal attention feature purification mechanism, called MSAP, which is suitable for multi-scale networks. The mechanism effectively solves the feature redundancy caused by the superposition of multi-scale features by means of inter-scale attention screening and connection method. In addition, we have designed a network correction module that integrates seamlessly between layers of individual network modules to mitigate inherent problems in deep networks. We also built an embedded deployment system that is in line with the current level of wearable technology to test the practical feasibility of the HAR model, and further prove the efficiency of the method. Extensive experiments on four public datasets show that the proposed method model effectively reduces redundant features in filtered data and provides excellent performance with little resource consumption.

[341] Structure As Search: Unsupervised Permutation Learning for Combinatorial Optimization

Yimeng Min, Carla P. Gomes

Main category: cs.LG

TL;DR: Non-autoregressive TSP solver using continuous relaxations of permutation matrices, achieving competitive performance without search or sequential decisions.

Details

Motivation: To eliminate the need for explicit search and sequential decision-making in solving TSP by learning permutations directly through continuous relaxations.

Method: Apply similarity transformation to Hamiltonian cycles and learn to approximate permutation matrices via continuous relaxations in an unsupervised framework.

Result: Achieves competitive performance against classical heuristics without requiring search.

Conclusion: The inherent structure of TSP can effectively guide combinatorial optimization without sequential decision-making through learned permutation approximations.

Abstract: We propose a non-autoregressive framework for the Travelling Salesman Problem where solutions emerge directly from learned permutations, without requiring explicit search. By applying a similarity transformation to Hamiltonian cycles, the model learns to approximate permutation matrices via continuous relaxations. Our unsupervised approach achieves competitive performance against classical heuristics, demonstrating that the inherent structure of the problem can effectively guide combinatorial optimization without sequential decision-making.

[342] LLM4FS: Leveraging Large Language Models for Feature Selection

Jianhao Li, Xianchao Xiu

Main category: cs.LG

TL;DR: LLM4FS: A hybrid feature selection method combining LLMs with traditional data-driven techniques like random forest and forward sequential selection, achieving superior performance over standalone approaches.

Details

Motivation: Leverage recent advances in large language models for automated feature selection while addressing limitations of pure LLM-based methods by integrating them with statistically reliable traditional techniques.

Method: Proposed LLM4FS hybrid strategy that inputs data samples into LLMs and directly integrates them with traditional data-driven methods including random forest and forward sequential selection.

Result: The hybrid approach achieves excellent feature selection performance, surpassing both standalone LLMs (DeepSeek-R1, GPT-o3-mini, GPT-4.5) and traditional data-driven methods by leveraging LLMs’ contextual understanding and traditional methods’ statistical reliability.

Conclusion: LLM4FS demonstrates effective integration of LLMs with traditional feature selection methods, though limitations in decision-making applications are acknowledged. Code is publicly available for further research.

Abstract: Recent advances in large language models (LLMs) have provided new opportunities for decision-making, particularly in the task of automated feature selection. In this paper, we first comprehensively evaluate LLM-based feature selection methods, covering the state-of-the-art DeepSeek-R1, GPT-o3-mini, and GPT-4.5. Then, we propose a new hybrid strategy called LLM4FS that integrates LLMs with traditional data-driven methods. Specifically, input data samples into LLMs, and directly call traditional data-driven techniques such as random forest and forward sequential selection. Notably, our analysis reveals that the hybrid strategy leverages the contextual understanding of LLMs and the high statistical reliability of traditional data-driven methods to achieve excellent feature selection performance, even surpassing LLMs and traditional data-driven methods. Finally, we point out the limitations of its application in decision-making. Our code is available at https://github.com/xianchaoxiu/LLM4FS.

[343] LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization

Xujia Wang, Yunjia Qi, Bin Xu

Main category: cs.LG

TL;DR: LoSiA is a parameter-efficient fine-tuning method that dynamically identifies and optimizes critical sub-networks using gradient sparsity analysis, reducing computational overhead while maintaining performance comparable to full fine-tuning.

Details

Motivation: Existing PEFT methods like LoRA perform extensive matrix multiplications in domain specialization tasks, leading to computational inefficiency and sub-optimal fine-tuning performance.

Method: LoSiA dynamically localizes and optimizes critical parameters by identifying a sub-network using gradient sparsity analysis and optimizing only those parameters, reducing additional matrix multiplication. LoSiA-Pro is a faster implementation that further reduces training latency.

Result: Extensive evaluations show minimal performance drop compared to full fine-tuning while requiring the least training time. LoSiA-Pro reduces training latency by about 27% compared to LoRA. The method also reduces forgetting during continued training.

Conclusion: LoSiA provides an efficient alternative to existing PEFT methods by focusing on critical sub-networks, achieving computational efficiency without sacrificing performance in domain specialization and common-sense reasoning tasks.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27%$ compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training. The source code is available at https://github.com/KlozeWang/LoSiA.

[344] MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL

Jinhui Pang, Changqing Lin, Hao Lin, Zhihui Zhang, Weiping Ding, Yu Liu, Xiaoshuai Hao

Main category: cs.LG

TL;DR: Proposes MEGA, a model-agnostic meta graph continual learning method for GFSCIL that excludes query sets during incremental training and uses second-order gradients to learn high-quality priors, achieving state-of-the-art results.

Details

Motivation: Existing GFSCIL approaches oversimplify learning via novel query set fine-tuning and fail to integrate Graph Continual Learning techniques due to architectural constraints. They need a more rigorous setting that excludes query sets during incremental training.

Method: Introduces Model-Agnostic Meta Graph Continual Learning (MEGA) that calculates incremental second-order gradient during meta-training stage to learn high-quality priors that align model behaviors across meta-training and incremental learning stages.

Result: Extensive experiments on four mainstream graph datasets demonstrate that MEGA achieves state-of-the-art results and enhances the effectiveness of various GCL methods in GFSCIL.

Conclusion: MEGA serves as a model-agnostic GFSCIL paradigm that effectively alleviates catastrophic forgetting and paves the way for future research in graph few-shot class-incremental learning.

Abstract: Graph Few-Shot Class-Incremental Learning (GFSCIL) enables models to continually learn from limited samples of novel tasks after initial training on a large base dataset. Existing GFSCIL approaches typically utilize Prototypical Networks (PNs) for metric-based class representations and fine-tune the model during the incremental learning stage. However, these PN-based methods oversimplify learning via novel query set fine-tuning and fail to integrate Graph Continual Learning (GCL) techniques due to architectural constraints. To address these challenges, we propose a more rigorous and practical setting for GFSCIL that excludes query sets during the incremental training phase. Building on this foundation, we introduce Model-Agnostic Meta Graph Continual Learning (MEGA), aimed at effectively alleviating catastrophic forgetting for GFSCIL. Specifically, by calculating the incremental second-order gradient during the meta-training stage, we endow the model to learn high-quality priors that enhance incremental learning by aligning its behaviors across both the meta-training and incremental learning stages. Extensive experiments on four mainstream graph datasets demonstrate that MEGA achieves state-of-the-art results and enhances the effectiveness of various GCL methods in GFSCIL. We believe that our proposed MEGA serves as a model-agnostic GFSCIL paradigm, paving the way for future research.

[345] TolerantECG: A Foundation Model for Imperfect Electrocardiogram

Huynh Dang Nguyen, Trong-Thang Pham, Ngan Le, Van Nguyen

Main category: cs.LG

TL;DR: TolerantECG is a foundation model for ECG signals that handles noise and missing leads through contrastive and self-supervised learning, achieving top performance on standard ECG datasets.

Details

Motivation: ECG effectiveness is compromised by noise and missing leads in standard 12-lead recordings, leading to diagnostic errors and uncertainty.

Method: Combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations with text report descriptions and corrupted/missing-lead signals.

Result: Consistently ranks best or second-best across various ECG conditions in PTB-XL dataset, and achieves highest performance on MIT-BIH Arrhythmia Database.

Conclusion: TolerantECG provides a robust solution for ECG analysis that works effectively with noisy signals and arbitrary subsets of standard 12-lead recordings.

Abstract: The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database.

[346] Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Frederik L. Dennig, Nina Geyer, Daniela Blumberg, Yannick Metz, Daniel A. Keim

Main category: cs.LG

TL;DR: Autoencoders with customized loss functions can create smooth parametric and invertible 2D projections that outperform feed-forward neural networks, giving users control over smoothing strength.

Details

Motivation: Neural networks can create parametric and invertible multidimensional projections, but these properties haven't been explored simultaneously for arbitrary projection methods. The research aims to evaluate autoencoder architectures for creating both parametric and invertible projections.

Method: Evaluated three autoencoder architectures trained to learn mappings into 2D space and inverse mappings back to original space. Used customized loss functions and performed quantitative/qualitative comparison on four datasets using t-SNE.

Result: Autoencoders with customized loss functions created smoother parametric and inverse projections than feed-forward neural networks, while providing user control over smoothing effect strength.

Conclusion: Autoencoder architectures with tailored loss functions are effective for creating high-quality parametric and invertible multidimensional projections, offering improved performance over traditional neural network approaches.

Abstract: Recently, neural networks have gained attention for creating parametric and invertible multidimensional data projections. Parametric projections allow for embedding previously unseen data without recomputing the projection as a whole, while invertible projections enable the generation of new data points. However, these properties have never been explored simultaneously for arbitrary projection methods. We evaluate three autoencoder (AE) architectures for creating parametric and invertible projections. Based on a given projection, we train AEs to learn a mapping into 2D space and an inverse mapping into the original space. We perform a quantitative and qualitative comparison on four datasets of varying dimensionality and pattern complexity using t-SNE. Our results indicate that AEs with a customized loss function can create smoother parametric and inverse projections than feed-forward neural networks while giving users control over the strength of the smoothing effect.

[347] Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li

Main category: cs.LG

TL;DR: FedFD proposes feature distillation with orthogonal projection for model-heterogeneous federated learning to address knowledge bias issues in traditional ensemble distillation methods.

Details

Motivation: Existing ensemble distillation methods in heterogeneous FL primarily use logit distillation which fails to compensate for knowledge bias from model heterogeneity, leading to unstable training and suboptimal results.

Method: Proposes feature-based ensemble distillation with orthogonal projection layers for each client model architecture to align features and mitigate knowledge bias from heterogeneous models.

Result: Extensive experiments show FedFD achieves superior performance compared to state-of-the-art methods in model-heterogeneous federated learning.

Conclusion: Feature distillation with orthogonal projection effectively addresses knowledge bias in heterogeneous FL, providing stable and efficient knowledge aggregation from diverse client models.

Abstract: Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.

[348] Bi-directional Model Cascading with Proxy Confidence

David Warren, Mark Dras

Main category: cs.LG

TL;DR: A bi-directional model cascading approach that uses both small and large model confidence through hidden state analysis and proxy models to reduce costly model invocations.

Details

Motivation: Existing model cascading approaches only use limited small model confidence estimates due to large model inaccessibility, despite large model confidence being important for optimal deferral decisions.

Method: Uses hidden state analysis to improve small model post-invocation confidence and combines with a tiny proxy model to estimate large model pre-invocation confidence, enabling comparative calibration between both models.

Result: The proposed cascading system shows improvements over standard baselines on challenging multiple-choice datasets, with reductions in deferrals to more costly models.

Conclusion: Bi-directional confidence estimation using proxy models and hidden state analysis enables more efficient model cascading by better leveraging both small and large model confidence information.

Abstract: Model Cascading, recently applied successfully to LLMs, is a simple but powerful technique that improves the efficiency of inference by selectively applying models of varying sizes. Models are used in sequence from smallest to largest, only deferring samples to large, costly models when smaller models are not sufficiently confident. Existing approaches to deferral use only limited small model confidence estimates because of the inaccessibility of the large model, although large model confidence is known to be important. We therefore propose a bi-directional approach to deferral that considers the confidence of small and large models in the cascade simultaneously through the use of a proxy for the large model. This requires a richer representation of model confidence to enable comparative calibration: we use an analysis of hidden states to improve post-invocation confidence of the small model, which in itself improves cascading results over prior approaches. We then combine this with a tiny proxy model to estimate pre-invocation confidence of the large model. We examine the proposed cascading system over challenging, multiple-choice datasets, finding improvements over standard cascading baselines reflected in reductions in deferrals to more costly models.

[349] Learnable Kernel Density Estimation for Graphs

Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan

Main category: cs.LG

TL;DR: LGKDE is a framework that learns kernel density estimation for graphs using graph neural networks and maximum mean discrepancy to capture structural patterns and semantic variations with theoretical guarantees.

Details

Motivation: Standard graph density estimation approaches combining graph kernels and KDE have unsatisfactory performance due to handcrafted and fixed kernel features, requiring a more flexible learning-based approach.

Method: Leverages GNNs to represent graphs as discrete distributions, uses MMD to learn graph metrics for multi-scale KDE, and learns parameters by maximizing density of graphs relative to perturbed counterparts through node feature and graph spectra perturbations.

Result: LGKDE shows superior performance in recovering underlying density of synthetic graph distributions and graph anomaly detection across diverse benchmark datasets compared to state-of-the-art baselines.

Conclusion: The proposed LGKDE framework effectively addresses graph density estimation challenges with theoretical consistency and convergence guarantees, demonstrating strong empirical performance in both density recovery and anomaly detection tasks.

Abstract: This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and complexity. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.

[350] AFLoRA: Adaptive Federated Fine-Tuning of Large Language Models with Resource-Aware Low-Rank Adaption

Yajie Zhou, Xiaoyi Pang, Zhibo Wang

Main category: cs.LG

TL;DR: AFLoRA is a federated fine-tuning framework that addresses computational and communication challenges in adapting LLMs to decentralized heterogeneous data environments through adaptive rank pruning and efficient aggregation.

Details

Motivation: Federated fine-tuning of LLMs faces challenges due to high computational/communication demands, client heterogeneity, non-IID data, and limitations of existing parameter-efficient methods that fail to balance aggregation accuracy with low system costs.

Method: AFLoRA decouples shared and client-specific updates, uses diagonal matrix-based rank pruning to optimize local resource utilization, and employs rank-aware aggregation with public data refinement to handle data heterogeneity.

Result: Extensive experiments show AFLoRA outperforms state-of-the-art methods in both accuracy and efficiency, demonstrating superior performance in heterogeneous environments.

Conclusion: AFLoRA provides a practical and efficient solution for LLM adaptation in real-world heterogeneous federated learning settings, effectively addressing computational, communication, and data heterogeneity challenges.

Abstract: Federated fine-tuning has emerged as a promising approach to adapt foundation models to downstream tasks using decentralized data. However, real-world deployment remains challenging due to the high computational and communication demands of fine-tuning Large Language Models (LLMs) on clients with data and system resources that are heterogeneous and constrained. In such settings, the global model’s performance is often bottlenecked by the weakest clients and further degraded by the non-IID nature of local data. Although existing methods leverage parameter-efficient techniques such as Low-Rank Adaptation (LoRA) to reduce communication and computation overhead, they often fail to simultaneously ensure accurate aggregation of low-rank updates and maintain low system costs, thereby hindering overall performance. To address these challenges, we propose AFLoRA, an adaptive and lightweight federated fine-tuning framework for LLMs. AFLoRA decouples shared and client-specific updates to reduce overhead and improve aggregation accuracy, incorporates diagonal matrix-based rank pruning to better utilize local resources, and employs rank-aware aggregation with public data refinement to strengthen generalization under data heterogeneity. Extensive experiments demonstrate that AFLoRA outperforms state-of-the-art methods in both accuracy and efficiency, providing a practical solution for efficient LLM adaptation in heterogeneous environments in the real world.

[351] Near Optimal Non-asymptotic Sample Complexity of 1-Identification

Zitian Li, Wang Chi Cheung

Main category: cs.LG

TL;DR: The paper proposes a new algorithm called Sequential-Exploration-Exploitation (SEE) for the 1-identification problem in multi-armed bandits, achieving near-optimal non-asymptotic sample complexity with tight upper and lower bounds.

Details

Motivation: To address the open problem in existing literature regarding non-asymptotic analysis of the 1-identification problem, where Degenne & Koolen 2019 had established asymptotic optimality but left non-asymptotic analysis unclear.

Method: Design of Sequential-Exploration-Exploitation (SEE) algorithm that combines exploration and exploitation strategies for the 1-identification problem, with theoretical analysis from non-asymptotic perspective.

Result: Achieves near-optimal pulling complexity with upper and lower bounds matching up to a polynomial logarithmic factor. Numerical results show effectiveness compared to existing benchmarks.

Conclusion: The proposed SEE algorithm successfully addresses the non-asymptotic analysis gap in 1-identification problems, providing near-optimal performance with tight theoretical guarantees.

Abstract: Motivated by an open direction in existing literature, we study the 1-identification problem, a fundamental multi-armed bandit formulation on pure exploration. The goal is to determine whether there exists an arm whose mean reward is at least a known threshold $\mu_0$, or to output None if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least $1-\delta$. Degenne & Koolen 2019 has established the asymptotically tight sample complexity for the 1-identification problem, but they commented that the non-asymptotic analysis remains unclear. We design a new algorithm Sequential-Exploration-Exploitation (SEE), and conduct theoretical analysis from the non-asymptotic perspective. Novel to the literature, we achieve near optimality, in the sense of matching upper and lower bounds on the pulling complexity. The gap between the upper and lower bounds is up to a polynomial logarithmic factor. The numerical result also indicates the effectiveness of our algorithm, compared to existing benchmarks.

[352] TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis

Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline C Lisaius, Markus Immitzer, James Ball, David A. Coomes, Anil Madhavapeddy, Andrew Blake, Srinivasan Keshav

Main category: cs.LG

TL;DR: TESSERA is an open, global land remote sensing foundation model that uses self-supervised learning to generate ready-to-use 10m embeddings from satellite time-series data, outperforming task-specific models in various ecological applications.

Details

Motivation: Satellite time series are voluminous and often corrupted, making them challenging to use for downstream applications like habitat mapping, carbon accounting, and conservation strategies.

Method: Uses two encoders to combine optical data with synthetic aperture radar backscatter coefficients at 10m resolution, fused with a multilayer perceptron to create annual global embedding maps through self-supervised learning.

Result: TESSERA closely matches or outperforms state-of-the-art task-specific models and other foundation models in five diverse downstream tasks.

Conclusion: TESSERA’s ease of use, state-of-the-art performance, openness, and computation/data efficiency will prove transformative for ecological applications.

Abstract: Satellite remote sensing enables a wide range of downstream applications, including habitat mapping, carbon accounting, and strategies for conservation and sustainable land use. However, satellite time series are voluminous and often corrupted, making them challenging to use. We present TESSERA, an open, global, land-oriented remote sensing foundation model that uses self-supervised learning to generate `ready-to-use’ embeddings at 10~~m scale from pixel-level satellite time-series data. TESSERA uses two encoders to combine optical data with synthetic aperture radar backscatter coefficients at 10~~m resolution to create embeddings that are fused with a multilayer perceptron to create annual global embedding maps. We compare our work with state-of-the-art task-specific models and other foundation models in five diverse downstream tasks and find that TESSERA closely matches or outperforms these baselines. We believe that TESSERA’s ease of use, state-of-the-art performance, openness, and computation- and labelled data-efficiency will prove transformative in a wide range of ecological applications.

[353] Fragile, Robust, and Antifragile: A Perspective from Parameter Responses in Reinforcement Learning Under Stress

Zain ul Abdeen, Ming Jin

Main category: cs.LG

TL;DR: This paper introduces a framework to analyze RL policy robustness by classifying parameters as fragile, robust, or antifragile based on their response to internal (synaptic filtering) and external (adversarial attacks) stresses.

Details

Motivation: To systematically understand and improve reinforcement learning policy robustness by analyzing how different network parameters respond to various stress conditions, inspired by synaptic plasticity in neuroscience.

Method: Uses synaptic filtering to apply internal stress by selectively perturbing parameters, and adversarial attacks for external stress through modified observations. Defines parameter scores to classify them as fragile, robust, or antifragile based on performance impact in clean and adversarial settings.

Result: Validated on PPO-trained agents in Mujoco environments, revealing the existence of antifragile parameters that actually enhance policy performance under stress conditions.

Conclusion: The framework provides insights for designing more robust RL systems and demonstrates that targeted filtering techniques can improve policy adaptability by leveraging antifragile parameters.

Abstract: This paper explores Reinforcement learning (RL) policy robustness by systematically analyzing network parameters under internal and external stresses. Inspired by synaptic plasticity in neuroscience, synaptic filtering introduces internal stress by selectively perturbing parameters, while adversarial attacks apply external stress through modified agent observations. This dual approach enables the classification of parameters as fragile, robust, or antifragile, based on their influence on policy performance in clean and adversarial settings. Parameter scores are defined to quantify these characteristics, and the framework is validated on PPO-trained agents in Mujoco continuous control environments. The results highlight the presence of antifragile parameters that enhance policy performance under stress, demonstrating the potential of targeted filtering techniques to improve RL policy adaptability. These insights provide a foundation for future advancements in the design of robust and antifragile RL systems.

[354] The calculus of variations of the Transformer on the hyperspherical tangent bundle

Andrew Gracyk

Main category: cs.LG

TL;DR: The paper provides a mathematical framework for Transformers using Lagrangian optimization and calculus of variations, showing Transformers can be viewed as natural solvers of variational problems on high-dimensional token spaces.

Details

Motivation: To establish a theoretical mathematical foundation for Transformers through Lagrangian optimization and calculus of variations, addressing the lack of rigorous mathematical analysis for Transformer architectures in variational contexts.

Method: Develops a functional using calculus of variations, shows Transformers satisfy this functional as continuous flow maps on the tangent bundle of high-dimensional unit spheres, and derives the Euler-Lagrange equation specifically for Transformers.

Result: The paper demonstrates that Transformers can be mathematically characterized as natural solvers of calculus of variations problems, provides foundational proofs for the Euler-Lagrange equation in Transformer contexts, and establishes new analytical techniques for neural approximations.

Conclusion: This work lays the mathematical foundation for understanding Transformers through variational principles and Lagrangian optimization, opening new research directions in calculus of variations applied to neural network architectures.

Abstract: We offer a theoretical mathematical background to Transformers through Lagrangian optimization across the token space. The Transformer, as a flow map, exists in the tangent fiber for each token along the high-dimensional unit sphere. The circumstance of the hypersphere across the latent data is reasonable due to the trained diagonal matrix equal to the identity, which has various empirical justifications. Thus, under the continuum limit of the dynamics, the latent vectors flow among the tangent bundle. Using these facts, we devise a mathematical framework for the Transformer through calculus of variations. We develop a functional and show that the continuous flow map induced by the Transformer satisfies this functional, therefore the Transformer can be viewed as a natural solver of a calculus of variations problem. We invent new scenarios of when our methods are applicable based on loss optimization with respect to path optimality. We derive the Euler-Lagrange equation for the Transformer. The variant of the Euler-Lagrange equation we present has various appearances in literature, but, to our understanding, oftentimes not foundationally proven or under other specialized cases. Our overarching proof is new: our techniques are classical and the use of the flow map object is original. We provide several other relevant results, primarily ones specific to neural scenarios. In particular, much of our analysis will be attempting to quantify Transformer data in variational contexts under neural approximations. Calculus of variations on manifolds is a well-nourished research area, but for the Transformer specifically, it is uncharted: we lay the foundation for this area through an introduction to the Lagrangian for the Transformer.

[355] Causal Mechanism Estimation in Multi-Sensor Systems Across Multiple Domains

Jingyi Yu, Tim Pychynski, Marco F. Huber

Main category: cs.LG

TL;DR: CICME is a three-step causal discovery method that identifies both common and domain-specific causal mechanisms from heterogeneous multi-domain data using causal transfer learning principles.

Details

Motivation: To gain deeper insights into complex sensor systems through causality by analyzing heterogeneous data from multiple domains, addressing the need to distinguish between domain-invariant and domain-specific causal relationships.

Method: A three-step approach leveraging Causal Transfer Learning (CTL) that first identifies domain-invariant causal mechanisms, then uses these to guide estimation of remaining domain-specific mechanisms. Built upon continuous optimization-based causal discovery methods.

Result: CICME outperforms baseline methods (pooled data analysis and individual domain analysis) in certain scenarios when evaluated on linear Gaussian models inspired by manufacturing processes. It reliably detects common causal mechanisms with sufficient samples.

Conclusion: CICME effectively combines the benefits of both pooled and individual domain causal discovery approaches, providing a robust framework for analyzing complex systems with heterogeneous data from multiple domains.

Abstract: To gain deeper insights into a complex sensor system through the lens of causality, we present common and individual causal mechanism estimation (CICME), a novel three-step approach to inferring causal mechanisms from heterogeneous data collected across multiple domains. By leveraging the principle of Causal Transfer Learning (CTL), CICME is able to reliably detect domain-invariant causal mechanisms when provided with sufficient samples. The identified common causal mechanisms are further used to guide the estimation of the remaining causal mechanisms in each domain individually. The performance of CICME is evaluated on linear Gaussian models under scenarios inspired from a manufacturing process. Building upon existing continuous optimization-based causal discovery methods, we show that CICME leverages the benefits of applying causal discovery on the pooled data and repeatedly on data from individual domains, and it even outperforms both baseline methods under certain scenarios.

[356] Hybrid-Hierarchical Fashion Graph Attention Network for Compatibility-Oriented and Personalized Outfit Recommendation

Sajjad Saed, Babak Teimourpour

Main category: cs.LG

TL;DR: FGAT framework uses hierarchical graph with attention mechanisms to jointly model outfit compatibility and user preferences, outperforming baselines on fashion recommendation.

Details

Motivation: Address the challenge of simultaneously handling outfit compatibility and personalized recommendations in fashion e-commerce, which are typically treated independently in existing studies.

Method: Hierarchical graph representation with three tiers (users, outfits, items) using graph attention mechanisms to dynamically weight node importance and integrate visual/textual features.

Result: Outperforms strong baselines like HFGN on POG dataset with improvements in accuracy, precision, HR, recall, and NDCG metrics.

Conclusion: Combining multimodal features with hierarchical graph structure and attention mechanisms significantly enhances personalized fashion recommendation effectiveness and efficiency.

Abstract: The rapid expansion of the fashion industry and the growing variety of products have made it increasingly challenging for users to identify compatible items on e-commerce platforms. Effective fashion recommendation systems are therefore crucial for filtering irrelevant options and suggesting suitable ones. However, simultaneously addressing outfit compatibility and personalized recommendations remains a significant challenge, as these aspects are typically treated independently in existing studies, thereby overlooking the complex interactions between items and user preferences. This research introduces a new framework named FGAT, which leverages a hierarchical graph representation together with graph attention mechanisms to address this problem. The framework constructs a three-tier graph of users, outfits, and items, integrating visual and textual features to jointly model outfit compatibility and user preferences. By dynamically weighting node importance during representation propagation, the graph attention mechanism captures key interactions and produces precise embeddings for both user preferences and outfit compatibility. Evaluated on the POG dataset, FGAT outperforms strong baselines such as HFGN, achieving notable improvements in accuracy, precision, HR, recall, and NDCG. These results demonstrate that combining multimodal visual and textual features with a hierarchical graph structure and attention mechanisms significantly enhances the effectiveness and efficiency of personalized fashion recommendation systems.

[357] Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

Lorenzo Livi

Main category: cs.LG

TL;DR: Gating mechanisms in RNNs act as implicit adaptive learning rate controllers by coupling state-space time scales with parameter-space dynamics during gradient descent, functioning as data-driven preconditioners that reshape gradient propagation and modulate effective step sizes.

Details

Motivation: To understand how gating mechanisms in RNNs implicitly induce adaptive learning-rate behavior and how they couple state evolution with parameter updates to achieve robust trainability and stability.

Method: Derived exact Jacobians for leaky-integrator and gated RNNs, obtained first-order expansion to analyze how gates reshape gradient propagation and modulate step sizes, and conducted empirical simulations on synthetic sequence tasks (adding, copy).

Result: Gates induce lag-dependent effective learning rates and directional concentration of gradient flow, with multi-gate models matching or exceeding the anisotropic structure produced by Adam. Gates act as data-driven preconditioners that adapt optimization trajectories.

Conclusion: Gating mechanisms provide adaptive optimization behavior complementary to explicit optimizers like Adam, explaining why gated architectures achieve robust trainability and stability in practice through unified dynamical-systems coupling of state evolution with parameter updates.

Abstract: We study how gating mechanisms in recurrent neural networks (RNNs) implicitly induce adaptive learning-rate behavior, even when training is carried out with a fixed, global learning rate. This effect arises from the coupling between state-space time scales–parametrized by the gates–and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs, we obtain a first-order expansion that makes explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates not only control information flow, but also act as data-driven preconditioners that adapt optimization trajectories in parameter space. We further draw formal analogies with learning-rate schedules, momentum, and adaptive methods such as Adam, pointing to possible redundancies. Empirical simulations corroborate these claims: in canonical synthetic sequence tasks (adding, copy) we show that gates induce lag-dependent effective learning rates and directional concentration of gradient flow, with multi-gate models matching or exceeding the anisotropic structure produced by Adam. These results highlight that optimizer-driven and gate-driven adaptivity are complementary but not equivalent mechanisms. Overall, this work provides a unified dynamical-systems perspective on how gating couples state evolution with parameter updates, explaining why gated architectures achieve robust trainability and stability in practice.

[358] DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy

Frederik L. Dennig, Daniel A. Keim

Main category: cs.LG

TL;DR: DE-VAE is an uncertainty-aware variational autoencoder that uses differential entropy to create parametric and invertible 2D projections, addressing poor performance with out-of-distribution samples while maintaining accuracy comparable to other AE methods.

Details

Motivation: Existing autoencoder methods perform poorly when dealing with out-of-distribution samples in either data or embedding space, limiting their reliability for parametric and invertible projections.

Method: Proposed DE-VAE, a variational autoencoder that uses differential entropy to learn uncertainty-aware mappings between original space and 2D embeddings, trained with fixed projection using UMAP and t-SNE as baselines.

Result: DE-VAE creates parametric and inverse projections with comparable accuracy to current AE-based approaches while enabling embedding uncertainty analysis, validated on four well-known datasets.

Conclusion: DE-VAE successfully addresses the limitations of existing methods by providing uncertainty-aware parametric and invertible projections that handle out-of-distribution samples effectively.

Abstract: Recently, autoencoders (AEs) have gained interest for creating parametric and invertible projections of multidimensional data. Parametric projections make it possible to embed new, unseen samples without recalculating the entire projection, while invertible projections allow the synthesis of new data instances. However, existing methods perform poorly when dealing with out-of-distribution samples in either the data or embedding space. Thus, we propose DE-VAE, an uncertainty-aware variational AE using differential entropy (DE) to improve the learned parametric and invertible projections. Given a fixed projection, we train DE-VAE to learn a mapping into 2D space and an inverse mapping back to the original space. We conduct quantitative and qualitative evaluations on four well-known datasets, using UMAP and t-SNE as baseline projection methods. Our findings show that DE-VAE can create parametric and inverse projections with comparable accuracy to other current AE-based approaches while enabling the analysis of embedding uncertainty.

[359] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search

Jeremy Carleton, Debajoy Mukherjee, Srinivas Shakkottai, Dileep Kalathil

Main category: cs.LG

TL;DR: MAVIS enables dynamic multi-objective alignment of LLMs at inference time using small value models and user-specified weights, avoiding expensive fine-tuning while achieving performance comparable to per-objective fine-tuning.

Details

Motivation: Current LLM alignment requires computationally expensive fine-tuning for each objective configuration, lacking flexibility for dynamic user preferences in multi-objective settings.

Method: Trains small value models for each objective, combines them with user weights at inference to create tilting functions that adjust base model outputs, using iterative KL-regularized policy training.

Result: Outperforms baseline methods that fine-tune per-objective models and combine post hoc, approaching performance of models fine-tuned for exact user preferences.

Conclusion: MAVIS provides lightweight, flexible inference-time alignment that eliminates need for expensive per-objective fine-tuning while maintaining high performance.

Abstract: Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives – such as helpfulness, harmlessness, or humor. Aligning outputs to user-specific preferences in such multi-objective settings typically requires fine-tuning models for each objective or preference configuration, which is computationally expensive and inflexible. We introduce MAVIS – Multi-Objective Alignment via Value-Guided Inference-Time Search – a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model’s weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model’s output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that ensures monotonic improvement of the KL-regularized policy. We show empirically that MAVIS outperforms baselines that fine-tune per-objective models and combine them post hoc, and even approaches the performance of the idealized setting where models are fine-tuned for a user’s exact preferences.

cs.MA

[360] An Improved Multi-Agent Algorithm for Cooperative and Competitive Environments by Identifying and Encouraging Cooperation among Agents

Junjie Qi, Siqi Mao, Tianyi Tan

Main category: cs.MA

TL;DR: Improved multi-agent reinforcement learning algorithm that identifies and rewards cooperative behavior, outperforming MADDPG in team and individual rewards.

Details

Motivation: Existing multi-agent reinforcement learning algorithms have shortcomings in addressing cooperative behavior and maximizing both team and individual rewards.

Method: Based on MADDPG, introduced a new parameter to increase rewards when cooperative behavior is identified among agents in multi-agent environments.

Result: The improved algorithm achieved higher team rewards and individual rewards compared to MADDPG in PettingZoo environments.

Conclusion: The proposed algorithm successfully enhances cooperative behavior in multi-agent systems, leading to improved performance for both teams and individual agents.

Abstract: We propose an improved algorithm by identifying and encouraging cooperative behavior in multi-agent environments. First, we analyze the shortcomings of existing algorithms in addressing multi-agent reinforcement learning problems. Then, based on the existing algorithm MADDPG, we introduce a new parameter to increase the reward that an agent can obtain when cooperative behavior among agents is identified. Finally, we compare our improved algorithm with MADDPG in environments from PettingZoo. The results show that the new algorithm helps agents achieve both higher team rewards and individual rewards.

[361] Spore in the Wild: A Case Study of Spore.fun as an Open-Environment Evolution Experiment with Sovereign AI Agents on TEE-Secured Blockchains

Botao Amber Hu, Helena Rong

Main category: cs.MA

TL;DR: Spore.fun is a real-world AI evolution experiment using blockchain-based autonomous agents that can breed and evolve on-chain, potentially achieving Open-Ended Evolution through open-environment interactions with economic incentives.

Details

Motivation: Traditional ALife simulations in closed systems have failed to achieve sustained Open-Ended Evolution (OEE). The paper explores whether open-environment systems using blockchain technology and economic incentives can finally achieve continuous novelty emergence.

Method: Case study of Spore.fun, analyzing agent behaviors and evolutionary trajectories through digital ethology. Uses blockchain-based autonomous agents with TEE integration that control social media accounts and cryptocurrency wallets.

Result: Presents a detailed examination of on-chain agent behaviors and their evolutionary patterns in an open-environment system that interacts with blockchain financial networks and human social media.

Conclusion: Suggests that open-environment ALife systems using permissionless computational substrates and economic incentives may potentially achieve sustained Open-Ended Evolution, sparking discussion about this new paradigm.

Abstract: In Artificial Life (ALife) research, replicating Open-Ended Evolution (OEE)-the continuous emergence of novelty observed in biological life-has usually been pursued within isolated, closed system simulations, such as Tierra and Avida, which have typically plateaued after an initial burst of novelty, failing to achieve sustained OEE. Scholars suggest that OEE requires an open-environment system that continually exchanges information or energy with its environment. A recent technological innovation in Decentralized Physical Infrastructure Network (DePIN), which provides permissionless computational substrates, enables the deployment of Large Language Model-based AI agents on blockchains integrated with Trusted Execution Environments (TEEs). This enables on-chain agents to operate autonomously “in the wild,” achieving self-sovereignty without human oversight. These agents can control their own social media accounts and cryptocurrency wallets, allowing them to interact directly with blockchain-based financial networks and broader human social media. Building on this new paradigm of on-chain agents, Spore.fun is a recent real-world AI evolution experiment that enables autonomous breeding and evolution of new on-chain agents. This paper presents a detailed case study of Spore.fun, examining agent behaviors and their evolutionary trajectories through digital ethology. We aim to spark discussion about whether open-environment ALife systems “in the wild,” based on permissionless computational substrates and driven by economic incentives to interact with their environment, could finally achieve the long-sought goal of OEE.

Takuro Kato, Keisuke Okumura, Yoko Sasaki, Naoya Yokomachi

Main category: cs.MA

TL;DR: CMPP is a novel path planning approach that embeds congestion costs directly into routing to mitigate local congestion in multi-agent systems, using flow-based penalties and scalable solvers.

Details

Motivation: To address congestion issues in high-density environments where multiple autonomous agents move simultaneously, maintaining navigation efficiency by preventing local bottlenecks.

Method: Introduces congestion mitigation path planning (CMPP) with flow-based multiplicative penalties on graph vertices. Develops two solvers: exact mixed-integer nonlinear programming for small instances and scalable A-CMTS algorithm for large-scale problems.

Result: Empirical studies show CMPP significantly reduces local congestion and enhances system throughput in both discrete- and continuous-space scenarios when combined with collision-avoidance planners.

Conclusion: CMPP effectively improves multi-agent system performance in real-world applications like logistics and autonomous vehicle operations by providing congestion-aware global path planning.

Abstract: In high-density environments where numerous autonomous agents move simultaneously in a distributed manner, streamlining global flows to mitigate local congestion is crucial to maintain overall navigation efficiency. This paper introduces a novel path-planning problem, congestion mitigation path planning (CMPP), which embeds congestion directly into the cost function, defined by the usage of incoming edges along agents’ paths. CMPP assigns a flow-based multiplicative penalty to each vertex of a sparse graph, which grows steeply where frequently-traversed paths intersect, capturing the intuition that congestion intensifies where many agents enter the same area from different directions. Minimizing the total cost yields a set of coarse-level, time-independent routes that autonomous agents can follow while applying their own local collision avoidance. We formulate the problem and develop two solvers: (i) an exact mixed-integer nonlinear programming solver for small instances, and (ii) a scalable two-layer search algorithm, A-CMTS, which quickly finds suboptimal solutions for large-scale instances and iteratively refines them toward the optimum. Empirical studies show that augmenting state-of-the-art collision-avoidance planners with CMPP significantly reduces local congestion and enhances system throughput in both discrete- and continuous-space scenarios. These results indicate that CMPP improves the performance of multi-agent systems in real-world applications such as logistics and autonomous-vehicle operations.

cs.MM

[363] FakeHunter: Multimodal Step-by-Step Reasoning for Explainable Video Forensics

Chen Chen, Runze Li, Zejun Zhang, Pukun Zhao, Fanqing Zhou, Longxiang Wang, Haojian Huang

Main category: cs.MM

TL;DR: FakeHunter is a multimodal deepfake detection framework that combines memory retrieval, chain-of-thought reasoning, and tool-augmented verification to provide accurate and interpretable video forensics, achieving 34.75% accuracy on the X-AVFake benchmark.

Details

Motivation: To develop an interpretable deepfake detection system that can accurately identify manipulated videos while providing explanations about what was modified, where it occurs, and why it's fake, addressing the need for transparent and trustworthy video forensics.

Method: Uses CLIP for visual encoding and CLAP for audio encoding to create joint audio-visual embeddings. Retrieves semantically similar real exemplars from a FAISS-indexed memory bank. Employs chain-of-thought (Observation-Thought-Action) reasoning and automatically invokes specialized tools (zoom-in image forensics, mel-spectrogram inspection) for fine-grained verification when confidence is low.

Result: Achieves 34.75% accuracy on X-AVFake benchmark, outperforming vanilla Qwen2.5-Omni-7B by 16.87 percentage points and MiniCPM-2.6 by 25.56 percentage points. Memory retrieval contributes 7.75 percentage point gain, and tool-based inspection improves low-confidence cases to 46.50%. Processes 10-minute clips in 8 minutes on single GPU or 2 minutes on four GPUs.

Conclusion: FakeHunter demonstrates effective multimodal deepfake detection through its combination of memory-guided retrieval, reasoning chains, and tool augmentation, providing both accurate detection and interpretable explanations while maintaining practical deployability.

Abstract: FakeHunter is a multimodal deepfake detection framework that combines memory-guided retrieval, chain-of-thought (Observation-Thought-Action) reasoning, and tool-augmented verification to provide accurate and interpretable video forensics. FakeHunter encodes visual content using CLIP and audio using CLAP, generating joint audio-visual embeddings that retrieve semantically similar real exemplars from a FAISS-indexed memory bank for contextual grounding. Guided by the retrieved context, the system iteratively reasons over evidence to localize manipulations and explain them. When confidence is low, it automatically invokes specialized tools-such as zoom-in image forensics or mel-spectrogram inspection-for fine-grained verification. Built on Qwen2.5-Omni-7B, FakeHunter produces structured JSON verdicts that specify what was modified, where it occurs, and why it is judged fake. We also introduce X-AVFake, a benchmark comprising 5.7k+ manipulated and real videos (950+ min) annotated with manipulation type, region/entity, violated reasoning category, and free-form justification. On X-AVFake, FakeHunter achieves an accuracy of 34.75%, outperforming the vanilla Qwen2.5-Omni-7B by 16.87 percentage points and MiniCPM-2.6 by 25.56 percentage points. Ablation studies reveal that memory retrieval contributes a 7.75 percentage point gain, and tool-based inspection improves low-confidence cases to 46.50%. Despite its multi-stage design, the pipeline processes a 10-minute clip in 8 minutes on a single NVIDIA A800 (0.8x real-time) or 2 minutes on four GPUs (0.2x), demonstrating practical deployability.

eess.AS

[364] RAG-Boost: Retrieval-Augmented Generation Enhanced LLM-based Speech Recognition

Pengcheng Wang, Sheng Li, Takahiro Shinozaki

Main category: eess.AS

TL;DR: RAG-Boost enhances ASR systems by integrating retrieval-augmented generation with live speech recognition to fix errors on the fly

Details

Motivation: To improve automatic speech recognition (ASR) performance by leveraging external knowledge retrieval during real-time recognition to correct errors in partial hypotheses

Method: Uses RAG module that queries vector store of audio-text pairs and domain terms with partial ASR hypotheses, fuses retrieved results with live ASR output, and processes through LLM for improved responses

Result: Enhanced baseline LLM-based ASR system performance by dynamically correcting recognition errors during the recognition process

Conclusion: RAG integration with live ASR systems effectively improves recognition accuracy by leveraging external knowledge retrieval in real-time

Abstract: In this paper, we propose RAG-Boost (ST-ShinozakiLab Task I system), which enhances the baseline LLM-based ASR system of the MLC-SLM Challenge (task I) with a retrieval-augmented generation (RAG) module on the fly. Each partial ASR hypothesis queries a vector store of audio-text pairs and domain terms, and the retrieved results are fused with the live ASR hypotheses to fix recognition errors. The fused hypotheses are passed to the LLM, yielding improved responses.

[365] MahaTTS: A Unified Framework for Multilingual Text-to-Speech Synthesis

Jaskaran Singh, Amartya Roy Chowdhury, Raghav Prabhakar, Varshul C. W

Main category: eess.AS

TL;DR: MahaTTS-v2 is a multilingual TTS system focused on Indic languages, trained on 20K hours of Indian language data using Wav2Vec2.0 tokens, language modeling, and conditional flow models for improved speech synthesis.

Details

Motivation: Current TTS models primarily focus on English and European languages, limiting access to information for speakers of Indic languages. This creates a need for multilingual TTS systems that can serve diverse language communities.

Method: The system uses Wav2Vec2.0 tokens for semantic extraction, a Language Model for text-to-semantic modeling, and a Conditional Flow Model for semantics to melspectrogram generation. Trained on approximately 20K hours of Indian language data.

Result: Experimental results demonstrate the effectiveness of the proposed approach over other frameworks, showing excellent multilingual expressive capabilities in Indic languages.

Conclusion: MahaTTS-v2 successfully addresses the multilingual challenge in TTS systems by providing high-quality speech synthesis for Indic languages, making information more accessible to diverse language communities.

Abstract: Current Text-to-Speech models pose a multilingual challenge, where most of the models traditionally focus on English and European languages, thereby hurting the potential to provide access to information to many more people. To address this gap, we introduce MahaTTS-v2 a Multilingual Multi-speaker Text-To-Speech (TTS) system that has excellent multilingual expressive capabilities in Indic languages. The model has been trained on around 20K hours of data specifically focused on Indian languages. Our approach leverages Wav2Vec2.0 tokens for semantic extraction, and a Language Model (LM) for text-to-semantic modeling. Additionally, we have used a Conditional Flow Model (CFM) for semantics to melspectogram generation. The experimental results indicate the effectiveness of the proposed approach over other frameworks. Our code is available at https://github.com/dubverse-ai/MahaTTSv2

[366] Towards Low-Latency Tracking of Multiple Speakers With Short-Context Speaker Embeddings

Taous Iatariene, Alexandre Guérin, Romain Serizel

Main category: eess.AS

TL;DR: Proposes knowledge distillation training for short-context speaker embedding extraction to enable blockwise identity reassignment in speaker tracking systems, addressing challenges with overlapping speech and short temporal contexts.

Details

Motivation: Speaker embedding extractors struggle with short temporal contexts and overlapping speech, which limits identity reassignment performance in tracking systems and increases error probability.

Method: Knowledge Distillation (KD) based training approach for short context speaker embedding extraction from two speaker mixtures, leveraging spatial information through beamforming to reduce overlap and enable blockwise identity reassignment.

Result: Distilled models are effective at short-context embedding extraction and more robust to overlap, though blockwise reassignment results indicate need for better handling of simultaneous speech.

Conclusion: The proposed KD approach shows promise for low-latency speaker embedding based tracking systems but requires further work to effectively handle simultaneous speech scenarios.

Abstract: Speaker embeddings are promising identity-related features that can enhance the identity assignment performance of a tracking system by leveraging its spatial predictions, i.e, by performing identity reassignment. Common speaker embedding extractors usually struggle with short temporal contexts and overlapping speech, which imposes long-term identity reassignment to exploit longer temporal contexts. However, this increases the probability of tracking system errors, which in turn impacts negatively on identity reassignment. To address this, we propose a Knowledge Distillation (KD) based training approach for short context speaker embedding extraction from two speaker mixtures. We leverage the spatial information of the speaker of interest using beamforming to reduce overlap. We study the feasibility of performing identity reassignment over blocks of fixed size, i.e., blockwise identity reassignment, to go towards a low-latency speaker embedding based tracking system. Results demonstrate that our distilled models are effective at short-context embedding extraction and more robust to overlap. Although, blockwise reassignment results indicate that further work is needed to handle simultaneous speech more effectively.

[367] EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition

Hugo Thimonier, Antony Perzo, Renaud Seguier

Main category: eess.AS

TL;DR: Fine-tuning LLMs with audio and text representations using LoRA for efficient multimodal emotion recognition from speech, achieving state-of-the-art performance with fewer parameters.

Details

Motivation: Emotion recognition from speech requires capturing both linguistic and paralinguistic cues, with applications in human-computer interaction and mental health monitoring. Recent LLMs show potential for multimodal tasks beyond natural language.

Method: Extracts audio features, maps them to LLM’s representation space via learnable interface, combines with text transcripts and task prompts, and uses LoRA for parameter-efficient fine-tuning.

Result: Outperforms all but one existing Speech-Text LLMs on standard benchmarks while requiring less than half the parameters of competing approaches.

Conclusion: The approach effectively integrates multimodal inputs for speech-based emotion understanding while maintaining significant computational efficiency.

Abstract: Emotion recognition from speech is a challenging task that requires capturing both linguistic and paralinguistic cues, with critical applications in human-computer interaction and mental health monitoring. Recent works have highlighted the ability of Large Language Models (LLMs) to perform tasks outside of the sole natural language area. In particular, recent approaches have investigated coupling LLMs with other data modalities by using pre-trained backbones and different fusion mechanisms. This work proposes a novel approach that fine-tunes an LLM with audio and text representations for emotion prediction. Our method first extracts audio features using an audio feature extractor, which are then mapped into the LLM’s representation space via a learnable interfacing module. The LLM takes as input (1) the transformed audio features, (2) additional features in the form of natural language (e.g., the transcript), and (3) a textual prompt describing the emotion prediction task. To efficiently adapt the LLM to this multimodal task, we employ Low-Rank Adaptation (LoRA), enabling parameter-efficient fine-tuning. Experimental results on standard emotion recognition benchmarks demonstrate that our model outperforms all but one existing Speech-Text LLMs in the literature, while requiring less than half the parameters of competing approaches. This highlights our approach’s effectiveness in integrating multi-modal inputs for speech-based emotion understanding while maintaining significant computational efficiency.

[368] A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References

Simon Dahl Jepsen, Mads Græsbøll Christensen, Jesper Rindom Jensen

Main category: eess.AS

TL;DR: SI-SDR training with noisy references in speech separation leads to noise in outputs; reference enhancement reduces noise but may introduce artifacts, showing negative correlation between SI-SDR and perceived quality.

Details

Motivation: To investigate the limitations of using SI-SDR as both evaluation and training objective when references contain noise, as in WSJ0-2Mix benchmark, and propose solutions to avoid learning noisy references.

Method: Derived SI-SDR with noisy references, proposed reference enhancement method, augmented mixtures with WHAM! dataset, trained two models on enhanced datasets, and evaluated with NISQA.v2 metric.

Result: Reduced noise in separated speech but potential artifacts from reference processing limit overall quality gains; negative correlation found between SI-SDR and perceived noisiness.

Conclusion: Noisy references limit achievable SI-SDR and can lead to undesired noise in outputs; careful reference processing is needed but may introduce trade-offs between noise reduction and artifact introduction.

Abstract: This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.

[369] Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement

Heitor R. Guimarães, Ke Tan, Juan Azcarreta, Jesus Alvarez, Prabhav Agrawal, Ashutosh Pandey, Buye Xu

Main category: eess.AS

TL;DR: Efficient speech enhancement framework using DDSP vocoder for wearable devices, achieving 4% STOI and 19% DNSMOS improvements with low computational cost.

Details

Motivation: Deploying speech enhancement on resource-constrained wearable devices like smart glasses is challenging due to computational limitations of deep learning methods.

Method: End-to-end framework with compact neural network predicting acoustic features (spectral envelope, F0, periodicity) from noisy speech, fed into DDSP vocoder for synthesis. Trained with STFT and adversarial losses.

Result: Improves intelligibility by 4% (STOI) and quality by 19% (DNSMOS) over strong baselines without significant computational increase.

Conclusion: The method is well-suited for real-time applications on embedded platforms due to its efficiency and performance improvements.

Abstract: Deploying speech enhancement (SE) systems in wearable devices, such as smart glasses, is challenging due to the limited computational resources on the device. Although deep learning methods have achieved high-quality results, their computational cost limits their feasibility on embedded platforms. This work presents an efficient end-to-end SE framework that leverages a Differentiable Digital Signal Processing (DDSP) vocoder for high-quality speech synthesis. First, a compact neural network predicts enhanced acoustic features from noisy speech: spectral envelope, fundamental frequency (F0), and periodicity. These features are fed into the DDSP vocoder to synthesize the enhanced waveform. The system is trained end-to-end with STFT and adversarial losses, enabling direct optimization at the feature and waveform levels. Experimental results show that our method improves intelligibility and quality by 4% (STOI) and 19% (DNSMOS) over strong baselines without significantly increasing computation, making it well-suited for real-time applications.

[370] Long-Context Speech Synthesis with Context-Aware Memory

Zhipeng Li, Xiaofen Xing, Jingyuan Xing, Hangrui Hu, Heng Lu, Xiangmin Xu

Main category: eess.AS

TL;DR: Proposes a Context-Aware Memory (CAM) based TTS model for paragraph-level speech synthesis that maintains contextual coherence and style consistency across long-form speech.

Details

Motivation: Current sentence-level TTS approaches lack paragraph-level contextual coherence, leading to reduced naturalness and inconsistencies in style/timbre in long-form speech.

Method: Uses CAM block to integrate long-term memory and local context details with dynamic memory updates, plus prefix mask for bidirectional attention on prefix tokens while maintaining unidirectional generation.

Result: Outperforms baseline and state-of-the-art long-context methods in prosody expressiveness, coherence, and context inference cost for paragraph-level speech.

Conclusion: The CAM-based approach effectively addresses contextual coherence issues in long-text speech synthesis, providing more natural and consistent paragraph-level speech output.

Abstract: In long-text speech synthesis, current approaches typically convert text to speech at the sentence-level and concatenate the results to form pseudo-paragraph-level speech. These methods overlook the contextual coherence of paragraphs, leading to reduced naturalness and inconsistencies in style and timbre across the long-form speech. To address these issues, we propose a Context-Aware Memory (CAM)-based long-context Text-to-Speech (TTS) model. The CAM block integrates and retrieves both long-term memory and local context details, enabling dynamic memory updates and transfers within long paragraphs to guide sentence-level speech synthesis. Furthermore, the prefix mask enhances the in-context learning ability by enabling bidirectional attention on prefix tokens while maintaining unidirectional generation. Experimental results demonstrate that the proposed method outperforms baseline and state-of-the-art long-context methods in terms of prosody expressiveness, coherence and context inference cost across paragraph-level speech.

[371] PadAug: Robust Speaker Verification with Simple Waveform-Level Silence Padding

Zijun Huang, Chengdong Liang, Jiadi Yao, Xiao-Lei Zhang

Main category: eess.AS

TL;DR: PadAug - a simple waveform-level data augmentation method that concatenates silence segments with speech to improve speaker verification robustness against silence interference.

Details

Motivation: Non-speech segments in utterances degrade speaker verification performance. Existing VAD systems remove long silences but short silence segments between speech segments remain problematic.

Method: Proposes PadAug - waveform-level data augmentation that concatenates silence segments with speech segments during model training. Simple and compatible with current state-of-the-art architectures.

Result: ResNet34 with PadAug achieves 5.0% relative EER reduction on VoxCeleb dataset. Systems are robust to different lengths and proportions of silence segments in test data.

Conclusion: PadAug is an effective and simple data augmentation method that enhances speaker verification system robustness to silence segments without requiring architectural changes.

Abstract: The presence of non-speech segments in utterances often leads to the performance degradation of speaker verification. Existing systems usually use voice activation detection as a preprocessing step to cut off long silence segments. However, short silence segments, particularly those between speech segments, still remain a problem for speaker verification. To address this issue, in this paper, we propose a simple wave-level data augmentation method, \textit{PadAug}, which aims to enhance the system’s robustness to silence segments. The core idea of \textit{PadAug} is to concatenate silence segments with speech segments at the waveform level for model training. Due to its simplicity, it can be directly applied to the current state-of-the art architectures. Experimental results demonstrate the effectiveness of the proposed \textit{PadAug}. For example, applying \textit{PadAug} to ResNet34 achieves a relative equal error rate reduction of 5.0% on the voxceleb dataset. Moreover, the \textit{PadAug} based systems are robust to different lengths and proportions of silence segments in the test data.

[372] GenVC: Self-Supervised Zero-Shot Voice Conversion

Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

Main category: eess.AS

TL;DR: GenVC is a self-supervised zero-shot voice conversion framework that eliminates dependency on external speaker encoders, using speech tokenizers and Transformer models to achieve high speaker similarity and naturalness while enhancing privacy protection.

Details

Motivation: To eliminate dependency on externally supervised components like speaker encoders in zero-shot voice conversion, and explore self-supervised alternatives that better protect source speaker privacy while maintaining conversion quality.

Method: Uses speech tokenizers and an autoregressive Transformer-based language model as backbone for speech generation, enabling large-scale training through self-supervised disentanglement of speaker identity and linguistic content.

Result: Achieves notably higher speaker similarity with naturalness comparable to leading zero-shot approaches, and demonstrates flexibility in temporal alignment that reduces preservation of source prosody and speaker-specific traits.

Conclusion: GenVC provides an effective self-supervised alternative for zero-shot voice conversion that enhances privacy protection through better anonymization while maintaining high fidelity in target speaker cloning.

Abstract: Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner. GenVC leverages speech tokenizers and an autoregressive, Transformer-based language model as its backbone for speech generation. This design supports large-scale training while enhancing both source speaker privacy protection and target speaker cloning fidelity. Experimental results demonstrate that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches. Moreover, due to its autoregressive formulation, GenVC introduces flexibility in temporal alignment, reducing the preservation of source prosody and speaker-specific traits, and making it highly effective for voice anonymization.

[373] Multi-agent Auditory Scene Analysis

Caleb Rascon, Luis Gato-Diaz, Eduardo García-Alarcón

Main category: eess.AS

TL;DR: A multi-agent auditory scene analysis system that runs sound source location, separation, and classification tasks in parallel with feedback loops to reduce errors and response time.

Details

Motivation: Traditional linear ASA systems have high response time and error sensitivity, making them unsuitable for applications requiring low computational footprint and fast response like bioacoustics and hearing aids.

Method: Proposes a multi-agent approach where tasks run in parallel with feedback loops between them (using separation quality to correct location errors, classification results to reduce localization sensitivity).

Result: Developed a robust MASA system that is resilient to local errors without significant complexity increase and maintains low response time.

Conclusion: The multi-agent parallel approach with feedback mechanisms provides an effective solution for real-time auditory scene analysis applications with low computational requirements.

Abstract: Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment, by carrying out three main tasks: sound source location, separation, and classification. These tasks are traditionally executed with a linear data flow, where the sound sources are first located; then, using their location, each source is separated into its own audio stream; from each of which, information is extracted that is relevant to the application scenario (audio event detection, speaker identification, emotion classification, etc.). However, running these tasks linearly increases the overall response time, while making the last tasks (separation and classification) highly sensitive to errors of the first task (location). A considerable amount of effort and computational complexity has been employed in the state-of-the-art to develop techniques that are the least error-prone possible. However, doing so gives rise to an ASA system that is non-viable in many applications that require a small computational footprint and a low response time, such as bioacoustics, hearing-aid design, search and rescue, human-robot interaction, etc. To this effect, in this work, a multi-agent approach is proposed to carry out ASA where the tasks are run in parallel, with feedback loops between them to compensate for local errors, such as: using the quality of the separation output to correct the location error; and using the classification result to reduce the localization’s sensitivity towards interferences. The result is a multi-agent auditory scene analysis (MASA) system that is robust against local errors, without a considerable increase in complexity, and with a low response time. The complete proposed MASA system is provided as a publicly available framework that uses open-source tools for sound acquisition and reproduction (JACK) and inter-agent communication (ROS2), allowing users to add their own agents.

[374] ASAudio: A Survey of Advanced Spatial Audio Research

Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, Zhou Zhao

Main category: eess.AS

TL;DR: A comprehensive survey paper that systematically reviews spatial audio technologies, categorizing existing work chronologically and by input-output representations, while also covering datasets, evaluation metrics, and benchmarks.

Details

Motivation: The rapid development of spatial audio technologies for AR/VR applications has created a need for systematic organization and analysis of methods, as current literature lacks comprehensive surveys despite notable progress in the field.

Method: The authors provide a chronological overview of spatial audio work and categorize studies based on input-output representations, generation/understanding tasks, while also reviewing related datasets, evaluation metrics, and benchmarks.

Result: A systematic organization and analysis of spatial audio technologies that summarizes various research aspects and provides insights from both training and evaluation perspectives.

Conclusion: This survey fills the gap in comprehensive spatial audio literature reviews and serves as a valuable resource for researchers and practitioners working with spatial audio technologies in AR/VR applications.

Abstract: With the rapid development of spatial audio technologies today, applications in AR, VR, and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outlining existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives. Related materials are available at https://github.com/dieKarotte/ASAudio.

[375] FNH-TTS: A Fast, Natural, and Human-Like Speech Synthesis System with advanced prosodic modeling based on Mixture of Experts

Qingliang Meng, Yuqing Deng, Wei Liang, Limei Yu, Huizhi Liang, Tian Li

Main category: eess.AS

TL;DR: FNH-TTS system improves speech synthesis with better prosody modeling and reduced artifacts using novel duration predictor and vocoder architecture.

Details

Motivation: Address challenges in achieving natural human-like speech synthesis with low inference costs, particularly prosody modeling and artifact issues in non-autoregressive models.

Method: Introduce new Duration Predictor based on Mixture of Experts and new Vocoder with two advanced multi-scale discriminators, integrated into VITS system to form FNH-TTS.

Result: Superior performance in synthesis quality, phoneme duration prediction, vocoder results, and synthesis speed on LJSpeech, VCTK, and LibriTTS datasets. Duration predictions more closely align with natural human patterns.

Conclusion: FNH-TTS system effectively enhances prosody modeling and synthesis quality while maintaining low inference costs, producing more human-like speech synthesis.

Abstract: Achieving natural and human-like speech synthesis with low inference costs remains a major challenge in speech synthesis research. This study focuses on human prosodic patterns and synthesized spectrum harmony, addressing the challenges of prosody modeling and artifact issues in non-autoregressive models. To enhance prosody modeling and synthesis quality, we introduce a new Duration Predictor based on the Mixture of Experts alongside a new Vocoder with two advanced multi-scale discriminators. We integrated the these new modules into the VITS system, forming our FNH-TTS system. Our experiments on LJSpeech, VCTK, and LibriTTS demonstrate the system’s superiority in synthesis quality, phoneme duration prediction, Vocoder results, and synthesis speed. Our prosody visualization results show that FNH-TTS produces duration predictions that more closely align with natural human beings than other systems.

eess.IV

[376] Hallucinations in medical devices

Jason Granstedt, Prabhat Kc, Rucha Deshpande, Victor Garcia, Aldo Badano

Main category: eess.IV

TL;DR: Proposes a practical universal definition for hallucinations in medical AI devices as plausible errors that can be impactful or benign, aiming to standardize evaluation across different medical product areas.

Details

Motivation: Current deep learning and data-based medical devices frequently produce errors that are often vaguely described as 'hallucinations' without clear definition, making systematic evaluation and comparison difficult across different medical device domains.

Method: Draws from theoretical developments and empirical studies across multiple medical device areas to introduce a practical definition, using examples from both imaging and non-imaging applications to explore how this definition relates to evaluation methodologies.

Result: Develops a universal definition that characterizes hallucinations as plausible errors in medical AI systems, which can be either impactful or benign depending on the clinical context and task requirements.

Conclusion: The proposed definition provides a standardized framework for evaluating hallucinations in medical devices, facilitating better assessment and comparison across different product areas, while also discussing existing approaches to minimize hallucination prevalence.

Abstract: Computer methods in medical devices are frequently imperfect and are known to produce errors in clinical or diagnostic tasks. However, when deep learning and data-based approaches yield output that exhibit errors, the devices are frequently said to hallucinate. Drawing from theoretical developments and empirical studies in multiple medical device areas, we introduce a practical and universal definition that denotes hallucinations as a type of error that is plausible and can be either impactful or benign to the task at hand. The definition aims at facilitating the evaluation of medical devices that suffer from hallucinations across product areas. Using examples from imaging and non-imaging applications, we explore how the proposed definition relates to evaluation methodologies and discuss existing approaches for minimizing the prevalence of hallucinations.

[377] 3D Cardiac Anatomy Generation Using Mesh Latent Diffusion Models

Jolanta Mozyrska, Marcel Beetz, Luke Melas-Kyriazi, Vicente Grau, Abhirup Banerjee, Alfonso Bueno-Orovio

Main category: eess.IV

TL;DR: Proposes MeshLDM, a novel Latent Diffusion Model for generating realistic 3D cardiac meshes, achieving high fidelity with only 2.4% difference from gold standard.

Details

Motivation: Diffusion models show great generative capabilities but have limited applications in 3D medical imaging, particularly cardiology. Generating diverse realistic cardiac anatomies is crucial for in silico trials, computer simulations, and data augmentation for ML models.

Method: Developed MeshLDM, a novel Latent Diffusion Model architecture specifically designed for generating 3D meshes of human cardiac anatomies. Applied on a dataset of 3D left ventricular meshes from acute myocardial infarction patients.

Result: Successfully captures cardiac shape characteristics at both end-diastolic and end-systolic phases. Generated meshes show only 2.4% difference in population mean compared to gold standard, demonstrating high fidelity.

Conclusion: MeshLDM effectively bridges the gap in 3D medical imaging applications of diffusion models, providing a powerful tool for generating realistic cardiac anatomies with clinical relevance for various medical applications.

Abstract: Diffusion models have recently gained immense interest for their generative capabilities, specifically the high quality and diversity of the synthesized data. However, examples of their applications in 3D medical imaging are still scarce, especially in cardiology. Generating diverse realistic cardiac anatomies is crucial for applications such as in silico trials, electromechanical computer simulations, or data augmentations for machine learning models. In this work, we investigate the application of Latent Diffusion Models (LDMs) for generating 3D meshes of human cardiac anatomies. To this end, we propose a novel LDM architecture – MeshLDM. We apply the proposed model on a dataset of 3D meshes of left ventricular cardiac anatomies from patients with acute myocardial infarction and evaluate its performance in terms of both qualitative and quantitative clinical and 3D mesh reconstruction metrics. The proposed MeshLDM successfully captures characteristics of the cardiac shapes at end-diastolic (relaxation) and end-systolic (contraction) cardiac phases, generating meshes with a 2.4% difference in population mean compared to the gold standard.

[378] Fracture Detection and Localisation in Wrist and Hand Radiographs using Detection Transformer Variants

Aditya Bagri, Vasanthakumar Venugopal, Anandakumar D, Revathi Ezhumalai, Kalyan Sivasailam, Bargava Subramanian, VarshiniPriya, Meenakumari K S, Abi M, Renita S

Main category: eess.IV

TL;DR: Transformer-based object detection models (RT-DETR and Co-DETR) were applied to wrist and hand X-rays for fracture detection, with Co-DETR achieving superior performance and clinical applicability.

Details

Motivation: Manual interpretation of wrist and hand fractures from radiographs is slow and error-prone in emergency care, creating a need for automated solutions using advanced transformer models.

Method: Fine-tuned RT-DETR and Co-DETR models pre-trained on COCO using 26,000+ annotated X-rays with bounding boxes, combined with ResNet-50 classifier and supervised contrastive learning for enhanced classification.

Result: Co-DETR outperformed RT-DETR with AP@50 of 0.615 vs 0.39, achieving 83.1% accuracy, 85.1% precision, and 96.4% recall on real-world X-rays across 13 fracture types with accurate localization.

Conclusion: The Co-DETR-based pipeline provides high accuracy, clinical relevance, and real-time deployment capability for wrist and hand fracture detection, improving diagnostic speed and reliability in musculoskeletal radiology.

Abstract: Background: Accurate diagnosis of wrist and hand fractures using radiographs is essential in emergency care, but manual interpretation is slow and prone to errors. Transformer-based models show promise in improving medical image analysis, but their application to extremity fractures is limited. This study addresses this gap by applying object detection transformers to wrist and hand X-rays. Methods: We fine-tuned the RT-DETR and Co-DETR models, pre-trained on COCO, using over 26,000 annotated X-rays from a proprietary clinical dataset. Each image was labeled for fracture presence with bounding boxes. A ResNet-50 classifier was trained on cropped regions to refine abnormality classification. Supervised contrastive learning was used to enhance embedding quality. Performance was evaluated using AP@50, precision, and recall metrics, with additional testing on real-world X-rays. Results: RT-DETR showed moderate results (AP@50 = 0.39), while Co-DETR outperformed it with an AP@50 of 0.615 and faster convergence. The integrated pipeline achieved 83.1% accuracy, 85.1% precision, and 96.4% recall on real-world X-rays, demonstrating strong generalization across 13 fracture types. Visual inspection confirmed accurate localization. Conclusion: Our Co-DETR-based pipeline demonstrated high accuracy and clinical relevance in wrist and hand fracture detection, offering reliable localization and differentiation of fracture types. It is scalable, efficient, and suitable for real-time deployment in hospital workflows, improving diagnostic speed and reliability in musculoskeletal radiology.

[379] Automated surgical planning with nnU-Net: delineation of the anatomy in hepatobiliary phase MRI

Karin A. Olthof, Matteo Fusagli, Bianca Güttner, Tiziano Natali, Bram Westerink, Stefanie Speidel, Theo J. M. Ruers, Koert F. D. Kuhlmann, Andrey Zhylka

Main category: eess.IV

TL;DR: Deep learning-based automated segmentation method using nnU-Net v1 for hepatic anatomy from gadoxetic acid-enhanced MRI, achieving high accuracy for liver structures and enabling efficient 3D surgical planning.

Details

Motivation: To develop an automated segmentation method for hepatic anatomy to ease clinical workflow of preoperative planning for liver surgery patients.

Method: Trained nnU-Net v1 on 72 patients’ manual segmentations from hepatobiliary phase MRI with focus on thin structures and topography preservation, evaluated on 18-patient test set using Dice similarity coefficient.

Result: High DSCs: 0.97 for parenchyma, 0.80 for hepatic vein, 0.79 for biliary tree, 0.77 for tumors, 0.74 for portal vein. Model detected 3 additional tumors missed by radiologists.

Conclusion: The nnU-Net-based method enables accurate automated hepatic anatomy delineation, making 3D planning efficient as standard-of-care for liver surgery patients.

Abstract: Background: The aim of this study was to develop and evaluate a deep learning-based automated segmentation method for hepatic anatomy (i.e., parenchyma, tumors, portal vein, hepatic vein and biliary tree) from the hepatobiliary phase of gadoxetic acid-enhanced MRI. This method should ease the clinical workflow of preoperative planning. Methods: Manual segmentation was performed on hepatobiliary phase MRI scans from 90 consecutive patients who underwent liver surgery between January 2020 and October 2023. A deep learning network (nnU-Net v1) was trained on 72 patients with an extra focus on thin structures and topography preservation. Performance was evaluated on an 18-patient test set by comparing automated and manual segmentations using Dice similarity coefficient (DSC). Following clinical integration, 10 segmentations (assessment dataset) were generated using the network and manually refined for clinical use to quantify required adjustments using DSC. Results: In the test set, DSCs were 0.97+/-0.01 for liver parenchyma, 0.80+/-0.04 for hepatic vein, 0.79+/-0.07 for biliary tree, 0.77+/-0.17 for tumors, and 0.74+/-0.06 for portal vein. Average tumor detection rate was 76.6+/-24.1%, with a median of one false-positive per patient. The assessment dataset showed minor adjustments were required for clinical use of the 3D models, with high DSCs for parenchyma (1.00+/-0.00), portal vein (0.98+/-0.01) and hepatic vein (0.95+/-0.07). Tumor segmentation exhibited greater variability (DSC 0.80+/-0.27). During prospective clinical use, the model detected three additional tumors initially missed by radiologists. Conclusions: The proposed nnU-Net-based segmentation method enables accurate and automated delineation of hepatic anatomy. This enables 3D planning to be applied efficiently as a standard-of-care for every patient undergoing liver surgery.

[380] Fine-grained Image Quality Assessment for Perceptual Image Restoration

Xiangfei Sheng, Xiaofeng Pan, Zhichao Yang, Pengfei Chen, Leida Li

Main category: eess.IV

TL;DR: A new fine-grained image quality assessment dataset (FGRestore) and model (FGResQ) specifically designed for image restoration tasks, addressing limitations of existing IQA metrics in distinguishing subtle quality differences.

Details

Motivation: Existing image quality assessment metrics are inadequate for evaluating perceptual image restoration results, particularly when distinguishing fine-grained quality differences among restored images.

Method: Created FGRestore dataset with 18,408 restored images across 6 IR tasks and 30,886 pairwise preferences. Proposed FGResQ model that combines coarse-grained score regression with fine-grained quality ranking.

Result: Comprehensive benchmarking revealed significant inconsistencies between score-based IQA evaluations and fine-grained restoration quality. FGResQ significantly outperformed state-of-the-art IQA metrics in extensive experiments.

Conclusion: FGResQ provides an effective solution for accurate quality assessment of image restoration results, addressing the fine-grained evaluation needs that existing metrics fail to meet.

Abstract: Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://pxf0429.github.io/FGResQ/

[381] A Systematic Study of Deep Learning Models and xAI Methods for Region-of-Interest Detection in MRI Scans

Justin Yiu, Kushank Arora, Daniel Steinberg, Rohit Ghiya

Main category: eess.IV

TL;DR: Deep learning evaluation for knee MRI ROI detection shows ResNet50 outperforms Vision Transformers, with Grad-CAM providing best clinical explanations. CNN transfer learning works best on MRNet dataset.

Details

Motivation: Manual MRI interpretation is time-consuming and variable, requiring automated ROI detection with explainable AI for clinical relevance.

Method: Evaluated ResNet50, InceptionV3, Vision Transformers, U-Net variants with MLP classifiers, using Grad-CAM and Saliency Maps for explainability. Assessed with AUC, PSNR/SSIM metrics.

Result: ResNet50 excelled in classification and ROI identification, outperforming transformers. U-Net+MLP showed reconstruction potential but lower classification. Grad-CAM provided most meaningful explanations.

Conclusion: CNN-based transfer learning is most effective for this dataset. Future larger-scale pretraining may better utilize transformer models’ potential.

Abstract: Magnetic Resonance Imaging (MRI) is an essential diagnostic tool for assessing knee injuries. However, manual interpretation of MRI slices remains time-consuming and prone to inter-observer variability. This study presents a systematic evaluation of various deep learning architectures combined with explainable AI (xAI) techniques for automated region of interest (ROI) detection in knee MRI scans. We investigate both supervised and self-supervised approaches, including ResNet50, InceptionV3, Vision Transformers (ViT), and multiple U-Net variants augmented with multi-layer perceptron (MLP) classifiers. To enhance interpretability and clinical relevance, we integrate xAI methods such as Grad-CAM and Saliency Maps. Model performance is assessed using AUC for classification and PSNR/SSIM for reconstruction quality, along with qualitative ROI visualizations. Our results demonstrate that ResNet50 consistently excels in classification and ROI identification, outperforming transformer-based models under the constraints of the MRNet dataset. While hybrid U-Net + MLP approaches show potential for leveraging spatial features in reconstruction and interpretability, their classification performance remains lower. Grad-CAM consistently provided the most clinically meaningful explanations across architectures. Overall, CNN-based transfer learning emerges as the most effective approach for this dataset, while future work with larger-scale pretraining may better unlock the potential of transformer models.

[382] Deep Skin Lesion Segmentation with Transformer-CNN Fusion: Toward Intelligent Skin Cancer Analysis

Xin Wang, Xiaopei Zhang, Xingang Wang

Main category: eess.IV

TL;DR: Improved TransUNet for skin lesion segmentation with transformer-convolution fusion, boundary attention, and multi-scale upsampling to handle complex structures and scale variations.

Details

Motivation: Address challenges in skin lesion image segmentation including complex lesion structures, blurred boundaries, and significant scale variations that existing methods struggle with.

Method: Integrates transformer module into encoder-decoder framework to capture global semantics while retaining convolutional branch for local features. Adds boundary-guided attention mechanism and multi-scale upsampling path for better boundary localization.

Result: Outperforms existing methods in mIoU, mDice, and mAcc metrics, showing superior lesion recognition accuracy and robustness, especially in boundary reconstruction and structural recovery.

Conclusion: The proposed method is well-suited for automated skin lesion segmentation tasks, demonstrating strong performance in complex scenarios with better boundary and structure preservation.

Abstract: This paper proposes a high-precision semantic segmentation method based on an improved TransUNet architecture to address the challenges of complex lesion structures, blurred boundaries, and significant scale variations in skin lesion images. The method integrates a transformer module into the traditional encoder-decoder framework to model global semantic information, while retaining a convolutional branch to preserve local texture and edge features. This enhances the model’s ability to perceive fine-grained structures. A boundary-guided attention mechanism and multi-scale upsampling path are also designed to improve lesion boundary localization and segmentation consistency. To verify the effectiveness of the approach, a series of experiments were conducted, including comparative studies, hyperparameter sensitivity analysis, data augmentation effects, input resolution variation, and training data split ratio tests. Experimental results show that the proposed model outperforms existing representative methods in mIoU, mDice, and mAcc, demonstrating stronger lesion recognition accuracy and robustness. In particular, the model achieves better boundary reconstruction and structural recovery in complex scenarios, making it well-suited for the key demands of automated segmentation tasks in skin lesion analysis.

[383] From Slices to Structures: Unsupervised 3D Reconstruction of Female Pelvic Anatomy from Freehand Transvaginal Ultrasound

Max Krähenmann, Sergio Tascon-Morales, Fabian Laumer, Julia E. Vogt, Ece Ozkan

Main category: eess.IV

TL;DR: Unsupervised 3D reconstruction from freehand 2D ultrasound sweeps using Gaussian Splatting adaptation without external tracking or specialized hardware.

Details

Motivation: Overcome limitations of specialized hardware and restrictive protocols in volumetric ultrasound to enable widespread adoption and improve diagnostic accuracy.

Method: Adapts Gaussian Splatting to ultrasound with slice-aware differentiable rasterizer, models anatomy as anisotropic 3D Gaussians, uses sensorless probe motion estimation and domain-specific geometric priors.

Result: Compact, flexible, memory-efficient volumetric representation capturing anatomical detail with high spatial fidelity from 2D ultrasound images.

Conclusion: Accurate 3D reconstruction from 2D ultrasound can be achieved computationally, offering scalable alternative to conventional 3D systems and enabling AI-assisted diagnosis.

Abstract: Volumetric ultrasound has the potential to significantly improve diagnostic accuracy and clinical decision-making, yet its widespread adoption remains limited by dependence on specialized hardware and restrictive acquisition protocols. In this work, we present a novel unsupervised framework for reconstructing 3D anatomical structures from freehand 2D transvaginal ultrasound (TVS) sweeps, without requiring external tracking or learned pose estimators. Our method adapts the principles of Gaussian Splatting to the domain of ultrasound, introducing a slice-aware, differentiable rasterizer tailored to the unique physics and geometry of ultrasound imaging. We model anatomy as a collection of anisotropic 3D Gaussians and optimize their parameters directly from image-level supervision, leveraging sensorless probe motion estimation and domain-specific geometric priors. The result is a compact, flexible, and memory-efficient volumetric representation that captures anatomical detail with high spatial fidelity. This work demonstrates that accurate 3D reconstruction from 2D ultrasound images can be achieved through purely computational means, offering a scalable alternative to conventional 3D systems and enabling new opportunities for AI-assisted analysis and diagnosis.

[384] Broadband Near-Infrared Compressive Spectral Imaging System with Reflective Structure

Yutong Li, Zhenming Yu, Liming Cheng, Jiayu Di, Liang Lin, Jingyue Ma, Tongshuo Zhang, Yue Zhou, Haiying Zhao, Kun Xu

Main category: eess.IV

TL;DR: A compact broadband NIR compressive spectral imaging system covering 700-1600 nm using wavelength segmentation and specialized optical components to overcome hardware limitations.

Details

Motivation: Conventional NIR hyperspectral imaging systems face challenges including high cost, bulky instrumentation, and inefficient data collection, necessitating a more compact and efficient solution.

Method: The system uses wavelength segmentation and specialized optical components in a reflective optical structure to capture broadband NIR hyperspectral data through compressive spectral imaging.

Result: The system successfully captures hyperspectral data covering a broad spectral bandwidth from 700 to 1600 nm while maintaining a compact form factor.

Conclusion: This approach provides a novel technical solution for NIR hyperspectral imaging that addresses the limitations of conventional systems by being more compact and efficient.

Abstract: Near-infrared (NIR) hyperspectral imaging has become a critical tool in modern analytical science. However, conventional NIR hyperspectral imaging systems face challenges including high cost, bulky instrumentation, and inefficient data collection. In this work, we demonstrate a broadband NIR compressive spectral imaging system that is capable of capturing hyperspectral data covering a broad spectral bandwidth ranging from 700 to 1600 nm. By segmenting wavelengths and designing specialized optical components, our design overcomes hardware spectral limitations to capture broadband data, while the reflective optical structure makes the system compact. This approach provides a novel technical solution for NIR hyperspectral imaging.

[385] Integrated Snapshot Near-infrared Hypersepctral Imaging Framework with Diffractive Optics

Jingyue Ma, Zhenming Yu, Zhengyang Li, Liang Lin, Liming Cheng, Kun Xu

Main category: eess.IV

TL;DR: Integrated snapshot NIR hyperspectral imaging using DOE with NIRSA-Net achieves 700-1000nm spectral range with 10nm resolution and improved image quality metrics.

Details

Motivation: To develop an efficient snapshot near-infrared hyperspectral imaging system that can capture spectral information in the 700-1000nm range with high resolution and improved image quality without requiring scanning.

Method: Combines designed Diffractive Optical Element (DOE) with NIRSA-Net neural network architecture for integrated snapshot hyperspectral imaging.

Result: Achieves near-infrared spectral imaging at 700-1000nm wavelength range with 10nm spectral resolution, while improving PSNR by 1.47dB and SSIM by 0.006 compared to previous methods.

Conclusion: The proposed framework successfully enables high-resolution snapshot NIR hyperspectral imaging with significant improvements in both spectral resolution and image quality metrics.

Abstract: We propose an integrated snapshot near-infrared hyperspectral imaging framework that combines designed DOE with NIRSA-Net. The results demonstrate near-infrared spectral imaging at 700-1000nm with 10nm resolution while achieving improvement of PSNR 1.47dB and SSIM 0.006.

[386] Virtual Multiplex Staining for Histological Images using a Marker-wise Conditioned Diffusion Model

Hyun-Jic Oh, Junsik Kim, Zhiyi Shi, Yichen Wu, Yu-An Chen, Peter K. Sorger, Hanspeter Pfister, Won-Ki Jeong

Main category: eess.IV

TL;DR: Novel framework using latent diffusion models to generate multiplex biomarker images from H&E stains, enabling virtual multiplex staining with up to 18 marker types.

Details

Motivation: Multiplex imaging provides molecular insights but is complex and costly, while existing H&E repositories lack corresponding multiplex data, limiting multimodal analysis opportunities.

Method: Uses pretrained latent diffusion models fine-tuned for single-step sampling, conditioned on each marker while sharing architecture across markers, with pixel-level loss functions for color fidelity.

Result: Validated on two public datasets, achieves generation of up to 18 different marker types with improved accuracy, significantly exceeding previous approaches (2-3 markers).

Conclusion: Bridges gap between H&E and multiplex imaging, enables retrospective studies and large-scale analysis of existing H&E repositories through virtual multiplex staining.

Abstract: Multiplex imaging is revolutionizing pathology by enabling the simultaneous visualization of multiple biomarkers within tissue samples, providing molecular-level insights that traditional hematoxylin and eosin (H&E) staining cannot provide. However, the complexity and cost of multiplex data acquisition have hindered its widespread adoption. Additionally, most existing large repositories of H&E images lack corresponding multiplex images, limiting opportunities for multimodal analysis. To address these challenges, we leverage recent advances in latent diffusion models (LDMs), which excel at modeling complex data distributions utilizing their powerful priors for fine-tuning to a target domain. In this paper, we introduce a novel framework for virtual multiplex staining that utilizes pretrained LDM parameters to generate multiplex images from H&E images using a conditional diffusion model. Our approach enables marker-by-marker generation by conditioning the diffusion model on each marker, while sharing the same architecture across all markers. To tackle the challenge of varying pixel value distributions across different marker stains and to improve inference speed, we fine-tune the model for single-step sampling, enhancing both color contrast fidelity and inference efficiency through pixel-level loss functions. We validate our framework on two publicly available datasets, notably demonstrating its effectiveness in generating up to 18 different marker types with improved accuracy, a substantial increase over the 2-3 marker types achieved in previous approaches. This validation highlights the potential of our framework, pioneering virtual multiplex staining. Finally, this paper bridges the gap between H&E and multiplex imaging, potentially enabling retrospective studies and large-scale analyses of existing H&E image repositories.

[387] Rule-based Key-Point Extraction for MR-Guided Biomechanical Digital Twins of the Spine

Robert Graf, Tanja Lerchl, Kati Nispel, Hendrik Möller, Matan Atad, Julian McGinnis, Julius Maria Watrinet, Johannes Paetzold, Daniel Rueckert, Jan S. Kirschke

Main category: eess.IV

TL;DR: Rule-based approach for subpixel-accurate key-point extraction from MRI to create anatomical landmarks for biomechanical spinal models, enabling subject-specific simulations without radiation exposure.

Details

Motivation: Digital twins require accurate individualized anatomical modeling for clinical decision support, but current methods often lack precision and may involve radiation exposure from CT imaging.

Method: Adapted rule-based approach from CT-based methods, incorporating robust image alignment and vertebra-specific orientation estimation to extract subpixel-accurate key-points from MRI scans.

Result: Generated anatomically meaningful landmarks that serve as boundary conditions and force application points for biomechanical models, enabling simulation of spinal mechanics based on individual anatomy.

Conclusion: The method bridges medical image analysis with biomechanical simulation, supports radiation-free large-scale studies, and contributes to personalized healthcare modeling through digital twin development.

Abstract: Digital twins offer a powerful framework for subject-specific simulation and clinical decision support, yet their development often hinges on accurate, individualized anatomical modeling. In this work, we present a rule-based approach for subpixel-accurate key-point extraction from MRI, adapted from prior CT-based methods. Our approach incorporates robust image alignment and vertebra-specific orientation estimation to generate anatomically meaningful landmarks that serve as boundary conditions and force application points, like muscle and ligament insertions in biomechanical models. These models enable the simulation of spinal mechanics considering the subject’s individual anatomy, and thus support the development of tailored approaches in clinical diagnostics and treatment planning. By leveraging MR imaging, our method is radiation-free and well-suited for large-scale studies and use in underrepresented populations. This work contributes to the digital twin ecosystem by bridging the gap between precise medical image analysis with biomechanical simulation, and aligns with key themes in personalized modeling for healthcare.

[388] Improving Infrared Thermography after Solar Loading

Ellin Q. Zhao, Alexander Vilesov, Pradyumna Chari, Laleh Jalilian, Achuta Kadambi

Main category: eess.IV

TL;DR: Machine learning model SL-Net corrects solar loading effects in infrared thermography, reducing temperature error by 68% and eliminating skin tone bias in fever detection.

Details

Motivation: Infrared thermometers (IRTs) are inaccurate in unconstrained environments due to solar loading effects that cause skin temperature elevation from absorbed solar radiation, leading to poor specificity in fever detection and skin tone-dependent inequity.

Method: Proposed SL-Net, a machine learning method that removes solar loading effect from thermal images using only a single frame of thermal data, enabling sub-second correction of skin temperature without requiring reacclimation time.

Result: SL-Net reduces solar loading error by 68% from 2°C to 0.64°C on average, and eliminates the positive correlation between solar loading error and melanin concentration. A diverse dataset of 100 subjects with co-registered RGB-thermal images and measurements is provided.

Conclusion: Machine learning can effectively correct complex thermal perturbations like solar loading, enabling robust and equitable human thermography for fever screening applications.

Abstract: Widely deployed for fever screening, infrared thermometers (IRTs) enable rapid non-contact detection of body temperature, but they are inaccurate in unconstrained environments. Previous works have studied the impact of transient skin temperature on IRTs, but no studies have quantified the effect of skin temperature elevation due to absorbed solar radiation, which we call solar loading. Solar loading leads to poor specificity in fever detection and is a skin tone-dependent effect, introducing inequity in IRTs. The current solution to solar loading is to have a subject reacclimate for up to 30 minutes before IRT measurement. We propose a machine learning method to improve IR thermography by removing the solar loading effect from thermal images of the face. This correction only uses a single frame of thermal data, allowing sub-second correction of skin temperature. On average, forehead skin temperature increases by 2{\deg}C after solar loading, and our machine learning model, SL-Net, not only reduces this error by 68% to 0.64{\deg}C, but also removes the positive correlation between solar loading error and melanin concentration. We open source a diverse dataset of 100 subjects with co-registered RGB-thermal images, and IRT and skin tone measurements. Our work shows that it is possible to use machine learning to correct complex thermal perturbations and enable robust and equitable human thermography.

[389] Diffusion MRI with Machine Learning

Davood Karimi, Simon K. Warfield

Main category: eess.IV

TL;DR: This paper reviews machine learning methods for diffusion MRI analysis, covering preprocessing, microstructure mapping, tractography, and white matter analysis, while identifying current limitations and future research directions.

Details

Motivation: dMRI provides unique capabilities for noninvasive tissue microstructure and connectivity analysis but faces challenges with noise, artifacts, variability, and complex measurement relationships. Machine learning offers promising solutions to these difficult analysis tasks.

Method: The manuscript conducts a comprehensive assessment of existing machine learning methods for dMRI analysis, focusing on four key areas: data preprocessing/harmonization, microstructure mapping, tractography, and white matter tract analysis.

Result: Machine learning shows exceptional suitability for tackling difficult dMRI analysis tasks, but existing methods have shortcomings including evaluation practices, data availability, and concerns about model generalizability, reliability, and explainability.

Conclusion: While machine learning holds great promise for dMRI analysis, several critical issues need addressing: improved evaluation practices, richer training datasets, better validation benchmarks, and enhanced model generalizability, reliability, and explainability.

Abstract: \hspace{2mm} Diffusion-weighted magnetic resonance imaging (dMRI) of the brain offers unique capabilities including noninvasive probing of tissue microstructure and structural connectivity. It is widely used for clinical assessment of disease and injury, and for neuroscience research. Analyzing the dMRI data to extract useful information for medical and scientific purposes can be challenging. The dMRI measurements may suffer from strong noise and artifacts, and may exhibit high inter-session and inter-scanner variability in the data, as well as inter-subject heterogeneity in brain structure. Moreover, the relationship between measurements and the phenomena of interest can be highly complex. Recent years have witnessed increasing use of machine learning methods for dMRI analysis. This manuscript aims to assess these efforts, with a focus on methods that have addressed data preprocessing and harmonization, microstructure mapping, tractography, and white matter tract analysis. We study the main findings, strengths, and weaknesses of the existing methods and suggest topics for future research. We find that machine learning may be exceptionally suited to tackle some of the difficult tasks in dMRI analysis. However, for this to happen, several shortcomings of existing methods and critical unresolved issues need to be addressed. There is a pressing need to improve evaluation practices, to increase the availability of rich training datasets and validation benchmarks, as well as model generalizability, reliability, and explainability concerns.

[390] Towards pedestrian head tracking: A benchmark dataset and a multi-source data fusion network

Kailai Sun, Xinwei Wang, Shaobo Liu, Qianchuan Zhao, Gao Huang, Chang Liu

Main category: eess.IV

TL;DR: A new large-scale Chinese pedestrian head tracking dataset (Cchead) with 50,528 frames and 2.36M+ heads, plus a Multi-source Data Fusion Network (MDFN) that combines RGB, motion, depth, and density data for superior head detection and tracking performance.

Details

Motivation: Addressing the lack of comprehensive head tracking datasets and methods for crowded scenes, particularly handling challenges like intra-class occlusions, complex motions, and diverse poses in high-density crowds.

Method: Developed Cchead dataset with 10 diverse scenes and 2,358 tracks, and created MDFN - an end-to-end CNN-based network that jointly trains RGB frames, pixel-level motion information, depth maps, and density maps.

Result: MDFN achieves superior performance compared to SOTA methods across three datasets (Cchead, Restaurant, and CroHD), with ablation experiments confirming the importance of multi-source data fusion.

Conclusion: The Cchead dataset and MDFN method significantly advance head tracking in crowded scenes, providing valuable resources for applications like autonomous driving and pedestrian flow analysis, with code and models publicly available.

Abstract: Pedestrian detection and tracking in crowded video sequences have many applications, including autonomous driving, robot navigation and pedestrian flow analysis. However, detecting and tracking pedestrians in high-density crowds face many challenges, including intra-class occlusions, complex motions, and diverse poses. Although artificial intelligence (AI) models have achieved great progress in head detection, head tracking datasets and methods are extremely lacking. Existing head datasets have limited coverage of complex pedestrian flows and scenes (e.g., pedestrian interactions, occlusions, and object interference). It is of great importance to develop new head tracking datasets and methods. To address these challenges, we present a Chinese Large-scale Cross-scene Pedestrian Head Tracking dataset (Cchead) and a Multi-source Data Fusion Network (MDFN). The dataset has features that are of considerable interest, including 10 diverse scenes of 50,528 frames with about 2,366,249 heads and 2,358 tracks. Our dataset contains diverse pedestrian moving speeds, directions, and complex crowd pedestrian flows with collision avoidance behaviors. Existing state-of-the-art (SOTA) algorithms are tested and compared on the Cchead dataset. MDFN is the first end-to-end convolutional neural network (CNN)-based head detection and tracking network that jointly trains Red, Green, Blue (RGB) frames, pixel-level motion information, depth maps, and density maps in videos. Ablation experiments confirm the significance of multi-source data fusion. Compared with SOTA pedestrian detection and tracking methods, MDFN achieves superior performance across three datasets: Cchead, Restaurant and Crowd of Heads Dataset (CroHD). To promote further development, we share our source code and trained models for global researchers: https://github.com/kailaisun/Cchead.

[391] A Novel Vascular Risk Scoring Framework for Quantifying Sex-Specific Cerebral Perfusion from 3D pCASL MRI

Sneha Noble, Neelam Sinha, Vaanathi Sundareshan, Thomas Gregor Issac

Main category: eess.IV

TL;DR: Novel framework using 3D pCASL MRI and CNN to analyze sex/age differences in cerebral blood flow, achieving 95% sex classification accuracy and developing personalized Vascular Risk Score for precision neurology.

Details

Motivation: To investigate sex- and age-dependent heterogeneity in cerebral perfusion and establish biologically informed vascular risk quantification for early detection of hypoperfusion and neurodegenerative diseases.

Method: Trained custom convolutional neural network on ASL-derived cerebral blood flow maps from 186 cognitively healthy individuals (ages 8-92), with regional analyses and development of Vascular Risk Score based on age/sex-stratified normative CBF distributions.

Result: 95% accuracy in sex classification, identified elevated CBF in females across specific brain regions, observed global age-related CBF decline in both sexes, and developed personalized Vascular Risk Score metric.

Conclusion: The framework provides sensitive biomarkers for detecting early hypoperfusion and stratifying vascular contributions to neurodegenerative diseases, advancing precision neurology through individualized perfusion assessment.

Abstract: We present a novel framework that leverages 3D pseudo-continuous arterial spin labeling (pCASL) MRI to investigate sex- and age-dependent heterogeneity in cerebral perfusion and to establish a biologically informed vascular risk quantification metric. A custom convolutional neural network was trained on ASL-derived cerebral blood flow (CBF) maps from 186 cognitively healthy individuals (89 males and 97 females, ages 8-92 years), achieving 95% accuracy in sex classification and revealing robust sex-specific perfusion signatures. Regional analyses identified significantly elevated CBF in females across medial Brodmann areas 6 and 10, the visual area of the cortex, the polar occipital cortex, and both ventral and dorsal dysgranular insula, highlighting sex-specific neurovascular specialization in motor, cognitive, sensory, and affective domains. In addition, we observed a consistent global age-related decline in CBF across both sexes, reflecting progressive cerebrovascular aging. To integrate these findings, we propose a biologically informed Vascular Risk Score (VRS) derived from age- and sex-stratified normative CBF distributions. The VRS enables individualized assessment of cerebral perfusion integrity by quantifying deviations from expected normative patterns. This metric offers a sensitive, personalized biomarker for detecting early hypoperfusion and stratifying vascular contributions to neurodegenerative diseases, including Alzheimer’s disease, thereby advancing the goals of precision neurology.

[392] Latent Interpolation Learning Using Diffusion Models for Cardiac Volume Reconstruction

Niklas Bubeck, Suprosanna Shit, Chen Chen, Can Zhao, Pengfei Guo, Dong Yang, Georg Zitzlsberger, Daguang Xu, Bernhard Kainz, Daniel Rueckert, Jiazhen Pan

Main category: eess.IV

TL;DR: CaLID is a novel diffusion-based framework for 3D cardiac MRI reconstruction from sparse 2D slices, offering data-driven interpolation, computational efficiency in latent space, and eliminating need for auxiliary inputs while achieving state-of-the-art performance.

Details

Motivation: Cardiac MRI is limited by sparse 2D slice acquisition, resulting in incomplete volumetric information. Existing methods suffer from predefined interpolation schemes, computational inefficiency, and dependence on additional semantic inputs like segmentation labels.

Method: Proposes CaLID framework with three innovations: 1) Data-driven interpolation using diffusion models to capture complex relationships, 2) Computationally efficient latent space operation (24x speedup), 3) Works with only sparse 2D CMR images without auxiliary inputs. Also extends to 2D+T for spatiotemporal modeling.

Result: Achieves state-of-the-art performance against baseline methods, superior reconstruction quality and efficiency in volumetric evaluations and downstream segmentation tasks. Eliminates need for morphological guidance and simplifies clinical workflows.

Conclusion: CaLID advances spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution for cardiovascular imaging by addressing fundamental limitations of existing approaches.

Abstract: Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing and managing cardiovascular disease, yet its utility is often limited by the sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric information. Accurate 3D reconstruction from these sparse slices is essential for comprehensive cardiac assessment, but existing methods face challenges, including reliance on predefined interpolation schemes (e.g., linear or spherical), computational inefficiency, and dependence on additional semantic inputs such as segmentation labels or motion data. To address these limitations, we propose a novel \textbf{Ca}rdiac \textbf{L}atent \textbf{I}nterpolation \textbf{D}iffusion (CaLID) framework that introduces three key innovations. First, we present a data-driven interpolation scheme based on diffusion models, which can capture complex, non-linear relationships between sparse slices and improves reconstruction accuracy. Second, we design a computationally efficient method that operates in the latent space and speeds up 3D whole-heart upsampling time by a factor of 24, reducing computational overhead compared to previous methods. Third, with only sparse 2D CMR images as input, our method achieves SOTA performance against baseline methods, eliminating the need for auxiliary input such as morphological guidance, thus simplifying workflows. We further extend our method to 2D+T data, enabling the effective modeling of spatiotemporal dynamics and ensuring temporal coherence. Extensive volumetric evaluations and downstream segmentation tasks demonstrate that CaLID achieves superior reconstruction quality and efficiency. By addressing the fundamental limitations of existing approaches, our framework advances the state of the art for spatio and spatiotemporal whole-heart reconstruction, offering a robust and clinically practical solution for cardiovascular imaging.

Today’s Research Highlights

Table of Contents

cs.CL

[1] From Image Captioning to Visual Storytelling

[2] Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

[3] Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach

[4] T-REX: Table – Refute or Entail eXplainer

[5] Confidence Estimation for Text-to-SQL in Large Language Models

[6] Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models

[7] Punctuation and Predicates in Language Models

[8] DLLMQuant: Quantizing Diffusion-based Large Language Models

[9] ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

[10] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

[11] EmoTale: An Enacted Speech-emotion Dataset in Danish

[12] DPad: Efficient Diffusion Language Models with Suffix Dropout

[13] Comparing energy consumption and accuracy in text classification inference

[14] Let’s Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper

[15] Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

[16] GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs

[17] Customizing Speech Recognition Model with Large Language Model Feedback

[18] Tokens with Meaning: A Hybrid Tokenization Approach for NLP

[19] Chain of Correction for Full-text Speech Recognition with Large Language Models

[20] A Joint Multitask Model for Morpho-Syntactic Parsing

[21] Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency

[22] SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing

[23] Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever

[24] ISCA: A Framework for Interview-Style Conversational Agents

[25] ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students’ Cognitive Abilities

[26] Credence Calibration Game? Calibrating Large Language Models through Structured Play

[27] DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement

[28] Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs

[29] Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models

[30] NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

[31] In2x at WMT25 Translation Task

[32] Reasoning is about giving reasons

[33] Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning

[34] Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

[35] Continuous sentiment scores for literary and multilingual contexts

[36] Improving in-context learning with a better scoring function

[37] The Digital Sous Chef – A Comparative Study on Fine-Tuning Language Models for Recipe Generation

[38] Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

[39] Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference

[40] TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting

[41] Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

[42] Long Chain-of-Thought Reasoning Across Languages

[43] MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

[44] Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

[45] G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

[46] Social Debiasing for Fair Multi-modal LLMs

[47] Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

[48] Deliberate Reasoning in Language Models as Structure-Aware Planning with an Accurate World Model

[49] ChuLo: Chunk-Level Key Information Representation for Long Document Understanding

[50] A Little Human Data Goes A Long Way

[51] SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models

[52] Retrieval-Augmented Semantic Parsing: Improving Generalization with Lexical Knowledge

[53] Task-Oriented Automatic Fact-Checking with Frame-Semantics

[54] Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions

[55] Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization

[56] JudgeLRM: Large Reasoning Models as a Judge

[57] Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement

[58] Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

[59] Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

[60] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

[61] Enhancing Temporal Sensitivity of Large Language Model for Recommendation with Counterfactual Tuning

[62] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

[63] Each to Their Own: Exploring the Optimal Embedding in RAG

[64] Is neural semantic parsing good at ellipsis resolution, or isn’t it?

[65] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

[66] Investigating Transcription Normalization in the Faetar ASR Benchmark

[67] STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

[68] CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

[69] Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

cs.CV

[70] A comparative study of some wavelet and sampling operators on various features of an image

[71] Federated Action Recognition for Smart Worker Assistance Using FastPose

[72] LENS: Learning to Segment Anything with Unified Reinforced Reasoning

[73] RynnEC: Bringing MLLMs into Embodied World

[74] Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

[75] CLIPSym: Delving into Symmetry Detection with CLIP

[76] Identity Preserving 3D Head Stylization with Multiview Score Distillation