Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 112]
cs.CV [Total: 184]
cs.AI [Total: 95]
cs.SD [Total: 12]
cs.LG [Total: 160]
cs.MA [Total: 11]
cs.MM [Total: 2]
eess.AS [Total: 5]
eess.IV [Total: 13]

cs.CL

[1] A Preliminary Study of RAG for Taiwanese Historical Archives

Claire Lin, Bo-Han Feng, Xuanjun Chen, Te-Lun Yang, Hung-yi Lee, Jyh-Shing Roger Jang

Main category: cs.CL

TL;DR: Initial study of RAG pipeline applied to Taiwanese historical archives showing that early metadata integration improves retrieval and answer accuracy, but challenges remain with hallucinations and temporal/multi-hop queries.

Details

Motivation: Few studies have examined RAG for Taiwanese Historical Archives, despite its promise for knowledge-intensive tasks.

Method: Applied RAG pipeline to two historical Traditional Chinese datasets (Fort Zeelandia and Taiwan Provincial Council Gazette) with open-ended query sets, investigating query characteristics and metadata integration strategies.

Result: Early-stage metadata integration enhances both retrieval and answer accuracy, but reveals persistent challenges including hallucinations during generation and difficulties with temporal or multi-hop historical queries.

Conclusion: RAG shows promise for historical archives but requires further development to address generation hallucinations and complex temporal/multi-hop queries.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.

[2] Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

Fatemeh Shahhosseini, Arash Marioriyad, Ali Momen, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban, Shaghayegh Haghjooy Javanmard

Main category: cs.CL

TL;DR: This survey synthesizes methods for LLM-driven scientific idea generation, categorizing approaches into five families and analyzing them using creativity frameworks to understand how they balance creativity with scientific soundness.

Details

Motivation: Scientific idea generation is crucial for discovery but LLMs' creative capacity remains inconsistent and poorly understood, despite their promising performance in generating coherent and factual outputs.

Method: Categorizes existing methods into five families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. Uses Boden’s taxonomy and Rhodes’ 4Ps framework to analyze creativity aspects.

Result: Provides a structured synthesis that clarifies the state of LLM-driven scientific ideation, showing how different approaches balance creativity with scientific soundness and what types of ideas they generate.

Conclusion: The survey outlines key directions toward reliable, systematic, and transformative applications of LLMs in scientific discovery by aligning methodological advances with creativity frameworks.

Abstract: Scientific idea generation lies at the heart of scientific discovery and has driven human progress-whether by solving unsolved problems or proposing novel hypotheses to explain unknown phenomena. Unlike standard scientific reasoning or general creative generation, idea generation in science is a multi-objective and open-ended task, where the novelty of a contribution is as essential as its empirical soundness. Large language models (LLMs) have recently emerged as promising generators of scientific ideas, capable of producing coherent and factual outputs with surprising intuition and acceptable reasoning, yet their creative capacity remains inconsistent and poorly understood. This survey provides a structured synthesis of methods for LLM-driven scientific ideation, examining how different approaches balance creativity with scientific soundness. We categorize existing methods into five complementary families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. To interpret their contributions, we employ two complementary frameworks: Boden’s taxonomy of Combinatorial, Exploratory and Transformational creativity to characterize the level of ideas each family expected to generate, and Rhodes’ 4Ps framework-Person, Process, Press, and Product-to locate the aspect or source of creativity that each method emphasizes. By aligning methodological advances with creativity frameworks, this survey clarifies the state of the field and outlines key directions toward reliable, systematic, and transformative applications of LLMs in scientific discovery.

[3] GRIP: In-Parameter Graph Reasoning through Fine-Tuning Large Language Models

Jiarui Feng, Donghong Cai, Yixin Chen, Muhan Zhang

Main category: cs.CL

TL;DR: GRIP is a framework that enables LLMs to internalize graph knowledge through fine-tuning with LoRA parameters, allowing graph reasoning without original graph access at inference.

Details

Motivation: Adapting LLMs to handle structural graph data is challenging due to token overhead from graph-to-text conversion or poor modality alignment from additional modules requiring complex training.

Method: Fine-tune LLMs with carefully designed tasks to internalize graph knowledge into lightweight LoRA parameters, enabling graph reasoning without original graph access.

Result: Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of the approach.

Conclusion: GRIP successfully equips LLMs with graph reasoning capabilities through parameter-efficient fine-tuning, overcoming limitations of previous methods.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in modeling sequential textual data and generalizing across diverse tasks. However, adapting LLMs to effectively handle structural data, such as knowledge graphs or web data, remains a challenging problem. Some approaches adopt complex strategies to convert graphs into text sequences, resulting in significant token overhead and rendering them impractical for large-scale graphs. Others introduce additional modules to encode graphs into fixed-size token representations for LLMs. However, these methods typically require large-scale post-training on graph-text corpus and complex alignment procedures, yet often yield sub-optimal results due to poor modality alignment. Inspired by in-parameter knowledge injection for test-time adaptation of LLMs, we propose GRIP, a novel framework that equips LLMs with the ability to internalize complex relational information from graphs through carefully designed fine-tuning tasks. This knowledge is efficiently stored within lightweight LoRA parameters, enabling the fine-tuned LLM to perform a wide range of graph-related tasks without requiring access to the original graph at inference time. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach.

[4] REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

Priyanka Mudgal

Main category: cs.CL

TL;DR: REFLEX is a reference-free evaluation metric for log summarization that uses LLMs as zero-shot evaluators to assess summary quality without needing gold-standard references.

Details

Motivation: Existing metrics like ROUGE and BLEU are limited by their dependence on surface-level lexical overlap and the scarcity of high-quality reference summaries for log summarization tasks.

Method: REFLEX uses large language models as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, eliminating the need for reference data or human annotations.

Result: REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization datasets, and more effectively distinguishes model outputs than traditional metrics.

Conclusion: REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.

Abstract: Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.

[5] It Takes Two: A Dual Stage Approach for Terminology-Aware Translation

Akshat Singh Jaswal

Main category: cs.CL

TL;DR: DuTerm is a two-stage architecture for terminology-constrained machine translation that combines a fine-tuned NMT model with a prompt-based LLM for post-editing, showing that flexible terminology handling outperforms strict enforcement.

Details

Motivation: To improve terminology-constrained machine translation by developing a system that can better handle terminology adherence while maintaining translation quality.

Method: Two-stage architecture: 1) Terminology-aware NMT model fine-tuned on large-scale synthetic data, 2) Prompt-based LLM for post-editing to refine output and enforce terminology adherence.

Result: Evaluated on English-to-German, English-to-Spanish, and English-to-Russian using WMT 2025 Terminology Shared Task corpus. Flexible, context-driven terminology handling by LLM consistently yielded higher quality translations than strict constraint enforcement.

Conclusion: LLMs work best for high-quality translation as context-driven mutators rather than generators, highlighting a critical trade-off in terminology handling approaches.

Abstract: This paper introduces DuTerm, a novel two-stage architecture for terminology-constrained machine translation. Our system combines a terminology-aware NMT model, adapted via fine-tuning on large-scale synthetic data, with a prompt-based LLM for post-editing. The LLM stage refines NMT output and enforces terminology adherence. We evaluate DuTerm on English-to German, English-to-Spanish, and English-to-Russian with the WMT 2025 Terminology Shared Task corpus. We demonstrate that flexible, context-driven terminology handling by the LLM consistently yields higher quality translations than strict constraint enforcement. Our results highlight a critical trade-off, revealing that an LLM’s work best for high-quality translation as context-driven mutators rather than generators.

[6] Motif 2 12.7B technical report

Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon

Main category: cs.CL

TL;DR: Motif-2-12.7B is a 12.7B parameter open-weight LLM that achieves competitive performance through architectural innovations (Grouped Differential Attention) and system optimizations, trained on 5.5T tokens with curriculum scheduling and specialized training infrastructure.

Details

Motivation: To push the efficiency frontier of large language models by combining architectural innovation with system-level optimization for scalable language understanding and robust instruction generalization under constrained compute budgets.

Method: Builds on Motif-2.6B with Grouped Differential Attention (GDA) that disentangles signal and noise-control attention pathways. Pre-trained on 5.5T tokens using curriculum-driven data scheduler. Uses MuonClip optimizer with custom kernels (fused PolyNorm activations, Parallel Muon algorithm) for distributed training efficiency. Three-stage supervised fine-tuning pipeline for instruction adherence, compositional understanding, and linguistic precision.

Result: Demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.

Conclusion: Thoughtful architectural scaling and optimized training design can achieve competitive performance comparable to much larger models, pushing the efficiency frontier of large language models.

Abstract: We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.

[7] Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models

Xin Liu, Qiyang Song, Qihang Zhou, Haichao Du, Shaowen Xu, Wenbo Jiang, Weijuan Zhang, Xiaoqi Jia

Main category: cs.CL

TL;DR: Proposes LAHIS method to identify attention head importance for multilingual capabilities in LLMs, revealing language-specific and language-general heads that enable cross-lingual transfer and improve multilingual performance.

Details

Motivation: To understand the role of multi-head self-attention (MHA) in supporting multilingual processing in LLMs, as this area remains underexplored despite MHA's critical importance in other domains.

Method: Developed Language Attention Head Importance Scores (LAHIS) - an efficient method using single forward/backward pass to identify attention head importance; also introduced lightweight adaptation with soft head mask requiring only 20 parameters.

Result: Identified both language-specific and language-general heads in models like Aya-23-8B, Llama-3.2-3B, and Mistral-7B-v0.1; language-specific heads enable cross-lingual attention transfer and mitigate off-target language generation; lightweight adaptation improved XQuAD accuracy.

Conclusion: The work enhances both interpretability and multilingual capabilities of LLMs from the MHA perspective, providing insights into how attention mechanisms support multilingual processing.

Abstract: Large language models (LLMs) increasingly support multilingual understanding and generation. Meanwhile, efforts to interpret their internal mechanisms have emerged, offering insights to enhance multilingual performance. While multi-head self-attention (MHA) has proven critical in many areas, its role in multilingual capabilities remains underexplored. In this work, we study the contribution of MHA in supporting multilingual processing in LLMs. We propose Language Attention Head Importance Scores (LAHIS), an effective and efficient method that identifies attention head importance for multilingual capabilities via a single forward and backward pass through the LLM. Applying LAHIS to Aya-23-8B, Llama-3.2-3B, and Mistral-7B-v0.1, we reveal the existence of both language-specific and language-general heads. Language-specific heads enable cross-lingual attention transfer to guide the model toward target language contexts and mitigate off-target language generation issue, contributing to addressing challenges in multilingual LLMs. We also introduce a lightweight adaptation that learns a soft head mask to modulate attention outputs over language heads, requiring only 20 tunable parameters to improve XQuAD accuracy. Overall, our work enhances both the interpretability and multilingual capabilities of LLMs from the perspective of MHA.

[8] LLM Optimization Unlocks Real-Time Pairwise Reranking

Jingyu Wu, Aditya Shrivastava, Jing Zhu, Alfy Samuel, Anoop Kumar, Daben Liu

Main category: cs.CL

TL;DR: This paper presents optimization methods for pairwise reranking in RAG systems, achieving 166x latency reduction (from 61.36s to 0.37s per query) with minimal performance loss.

Details

Motivation: Pairwise Reranking Prompting (PRP) using LLMs is effective but computationally expensive, making it impractical for real-time applications due to high latency and resource demands.

Method: Implemented multiple optimizations: using smaller LLMs, limiting reranked document sets, using lower precision, reducing positional bias with one-directional order inference, and restricting output tokens.

Result: Achieved dramatic latency reduction of 166x (61.36s → 0.37s per query) with insignificant performance drop in Recall@k metrics.

Conclusion: Careful design choices can make LLM-based reranking substantially more efficient and feasible for real-world, latency-sensitive deployments.

Abstract: Efficiently reranking documents retrieved from information retrieval (IR) pipelines to enhance overall quality of Retrieval-Augmented Generation (RAG) system remains an important yet challenging problem. Recent studies have highlighted the importance of Large Language Models (LLMs) in reranking tasks. In particular, Pairwise Reranking Prompting (PRP) has emerged as a promising plug-and-play approach due to its usability and effectiveness. However, the inherent complexity of the algorithm, coupled with the high computational demands and latency incurred due to LLMs, raises concerns about its feasibility in real-time applications. To address these challenges, this paper presents a focused study on pairwise reranking, demonstrating that carefully applied optimization methods can significantly mitigate these issues. By implementing these methods, we achieve a remarkable latency reduction of up to 166 times, from 61.36 seconds to 0.37 seconds per query, with an insignificant drop in performance measured by Recall@k. Our study highlights the importance of design choices that were previously overlooked, such as using smaller models, limiting the reranked set, using lower precision, reducing positional bias with one-directional order inference, and restricting output tokens. These optimizations make LLM-based reranking substantially more efficient and feasible for latency-sensitive, real-world deployments.

[9] LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives

Ratna Kandala, Katie Hoemann

Main category: cs.CL

TL;DR: Dutch-tuned LLMs underperformed traditional lexicon-based tools (LIWC and Pattern) for valence prediction in Flemish, challenging assumptions about LLM superiority in sentiment analysis.

Details

Motivation: To evaluate whether Dutch-specific LLMs can outperform traditional lexicon-based tools in understanding emotional nuances in everyday Flemish language, a low-resource variant.

Method: Evaluated three Dutch-tuned LLMs (ChocoLlama-8B-Instruct, Reynaerde-7B-chat, GEITje-7B-ultra) against LIWC and Pattern using ~25,000 spontaneous textual responses from 102 Dutch-speaking participants with self-assessed valence ratings.

Result: Traditional methods outperformed Dutch-tuned LLMs, with Pattern showing superior performance in valence prediction despite LLMs’ architectural advancements.

Conclusion: Current LLM fine-tuning approaches may not adequately capture nuanced emotional expressions in everyday language, highlighting the need for culturally and linguistically tailored evaluation frameworks for low-resource languages.

Abstract: Understanding emotional nuances in everyday language is crucial for computational linguistics and emotion research. While traditional lexicon-based tools like LIWC and Pattern have served as foundational instruments, Large Language Models (LLMs) promise enhanced context understanding. We evaluated three Dutch-specific LLMs (ChocoLlama-8B-Instruct, Reynaerde-7B-chat, and GEITje-7B-ultra) against LIWC and Pattern for valence prediction in Flemish, a low-resource language variant. Our dataset comprised approximately 25000 spontaneous textual responses from 102 Dutch-speaking participants, each providing narratives about their current experiences with self-assessed valence ratings (-50 to +50). Surprisingly, despite architectural advancements, the Dutch-tuned LLMs underperformed compared to traditional methods, with Pattern showing superior performance. These findings challenge assumptions about LLM superiority in sentiment analysis tasks and highlight the complexity of capturing emotional valence in spontaneous, real-world narratives. Our results underscore the need for developing culturally and linguistically tailored evaluation frameworks for low-resource language variants, while questioning whether current LLM fine-tuning approaches adequately address the nuanced emotional expressions found in everyday language use.

Yi Zhao, Siqi Wang, Jing Li

Main category: cs.CL

TL;DR: LaF-GRPO method uses LLM to simulate visually impaired user responses for training VLMs, improving navigation instruction generation with reduced real-world data needs. Introduces NIG4VI dataset for evaluation.

Details

Motivation: Navigation instruction generation for visually impaired individuals is critical but underexplored, requiring precise in-situ instructions that are practically usable.

Method: Proposed LaF-GRPO where LLM simulates VI user responses to provide feedback rewards for VLM post-training, reducing real-world data collection. Introduced NIG4VI dataset with 27k samples.

Result: Quantitative improvements: Zero-(LaF-GRPO) boosts BLEU 14%; SFT+(LaF-GRPO) achieves METEOR 0.542 vs. GPT-4o 0.323. Qualitative analysis shows more intuitive and safer instructions.

Conclusion: LaF-GRPO effectively enhances navigation instruction accuracy and usability for visually impaired users while reducing data collection costs.

Abstract: Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study focuses on generating precise, in-situ, step-by-step navigation instructions that are practically usable for VI users. Specifically, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to navigation instructions, thereby providing feedback rewards to guide the post-training of a Vision-Language Model (VLM). This enhances instruction accuracy and usability while reducing costly real-world data collection needs. To address the scarcity of dedicated benchmarks in this field, we introduce NIG4VI, a 27k-sample open-source dataset to facilitate training and evaluation. It comprises diverse navigation scenarios with accurate spatial coordinates, supporting detailed and open-ended in-situ instruction generation. Experiments on NIG4VI demonstrate the effectiveness of LaF-GRPO through quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU 14%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o 0.323), and qualitative analysis further confirms that our method yields more intuitive and safer instructions.

[11] Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

Sai Shridhar Balamurali, Lu Cheng

Main category: cs.CL

TL;DR: NLI-based evaluation matches GPT-4o’s accuracy on long-form QA while being much more computationally efficient, and a new benchmark DIVER-QA is introduced for rigorous evaluation.

Details

Motivation: Current LLM evaluation methods are either too simplistic (lexical metrics) or too expensive (LLM-as-Judge), creating a need for efficient yet accurate evaluation alternatives.

Method: Use off-the-shelf Natural Language Inference (NLI) scoring with a simple lexical-match flag, and create DIVER-QA benchmark with 3000 human-annotated samples across five QA datasets and five LLMs.

Result: NLI-based evaluation achieves 89.9% accuracy matching GPT-4o’s performance while requiring orders-of-magnitude fewer parameters.

Conclusion: Inexpensive NLI-based evaluation remains competitive with state-of-the-art methods and DIVER-QA provides a valuable open resource for future metric research.

Abstract: Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas “LLM-as-Judge” scoring is computationally expensive. We re-evaluate a lightweight alternative – off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o’s accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.

[12] Stress Testing Factual Consistency Metrics for Long-Document Summarization

Zain Muhammad Mujahid, Dustin Wright, Isabelle Augenstein

Main category: cs.CL

TL;DR: Systematic evaluation of 6 reference-free factuality metrics for long-document summarization reveals they produce inconsistent scores for semantically equivalent summaries and struggle with information-dense claims.

Details

Motivation: Existing factuality metrics designed for short-form summarization struggle with long documents due to input length limitations and long-range dependencies, creating a need to evaluate their reliability in long-document settings.

Method: Evaluated 6 reference-free factuality metrics using 7 factuality-preserving perturbations (paraphrasing, simplification, synonym replacement, etc.) across 3 long-form benchmark datasets spanning science fiction, legal, and scientific domains, analyzing sensitivity to retrieval context and claim information density.

Result: Existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims. Expanding retrieval context improves stability in some domains but no metric consistently maintains factual alignment under long-context conditions.

Conclusion: The study highlights directions for improving factuality evaluation including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization.

Abstract: Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.

[13] CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences

Rhitabrat Pokharel, Yufei Tao, Ameeta Agrawal

Main category: cs.CL

TL;DR: CAPO is a confidence-aware preference optimization method that dynamically scales loss based on relative reward confidence, outperforming DPO in multilingual settings by 16% in reward accuracy.

Details

Motivation: Existing preference optimization methods like DPO work well in English but fail to generalize robustly to multilingual settings due to noisy or low-margin comparisons in multilingual text.

Method: CAPO replaces DPO’s fixed treatment of preference pairs with a dynamic loss scaling mechanism based on relative reward, modulating learning signal according to confidence in each preference pair.

Result: CAPO outperforms existing preference optimization baselines by at least 16% in reward accuracy and improves alignment by widening the gap between preferred and dispreferred responses across languages.

Conclusion: CAPO provides a simple yet effective alternative to DPO that enhances robustness to noisy multilingual preference data through confidence-aware dynamic loss scaling.

Abstract: Preference optimization is a critical post-training technique used to align large language models (LLMs) with human preferences, typically by fine-tuning on ranked response pairs. While methods like Direct Preference Optimization (DPO) have proven effective in English, they often fail to generalize robustly to multilingual settings. We propose a simple yet effective alternative, Confidence-Aware Preference Optimization (CAPO), which replaces DPO’s fixed treatment of preference pairs with a dynamic loss scaling mechanism based on a relative reward. By modulating the learning signal according to the confidence in each preference pair, CAPO enhances robustness to noisy or low-margin comparisons, typically encountered in multilingual text. Empirically, CAPO outperforms existing preference optimization baselines by at least 16% in reward accuracy, and improves alignment by widening the gap between preferred and dispreferred responses across languages.

Peiqi Sui, Eamon Duede, Hoyt Long, Richard Jean So

Main category: cs.CL

TL;DR: LLMs can use controlled hallucinations to fill gaps in historical archives caused by social inequality, creating evidence-bound narratives for overlooked historical figures.

Details

Motivation: Address omissions in historical archives due to social and political inequality by using LLM hallucinations to reconstruct narratives for "hidden figures" in history.

Method: Use narrative cloze tasks with masked events in character timelines from unpublished texts, testing various LLMs with different prompts designed to elicit controlled hallucinations.

Result: LLMs demonstrate foundational narrative understanding for critical confabulation, showing controlled hallucinations can support knowledge production while maintaining historical accuracy.

Conclusion: Well-specified LLM hallucinations can be valuable for knowledge production when carefully bounded, bridging archival gaps without sacrificing historical fidelity.

Abstract: LLMs hallucinate, yet some confabulations can have social affordances if carefully bounded. We propose critical confabulation (inspired by critical fabulation from literary and social theory), the use of LLM hallucinations to “fill-in-the-gap” for omissions in archives due to social and political inequality, and reconstruct divergent yet evidence-bound narratives for history’s “hidden figures”. We simulate these gaps with an open-ended narrative cloze task: asking LLMs to generate a masked event in a character-centric timeline sourced from a novel corpus of unpublished texts. We evaluate audited (for data contamination), fully-open models (the OLMo-2 family) and unaudited open-weight and proprietary baselines under a range of prompts designed to elicit controlled and useful hallucinations. Our findings validate LLMs’ foundational narrative understanding capabilities to perform critical confabulation, and show how controlled and well-specified hallucinations can support LLM applications for knowledge production without collapsing speculation into a lack of historical accuracy and fidelity.

[15] Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production

Shiva Upadhye, Richard Futrell

Main category: cs.CL

TL;DR: The paper investigates how contextual predictability affects word production, introducing a new information-theoretic measure that integrates both past and future context predictability, and examines its effects on word duration and substitution errors.

Details

Motivation: To better understand backward predictability effects (word predictability given future context) in naturalistic speech, which are poorly understood but may relate to future planning during language production.

Method: Two studies using naturalistic speech corpora: (1) revisiting predictability effects on word duration with improved measures, (2) analyzing substitution errors using a generative framework that models lexical, contextual, and communicative factors independently.

Result: The proposed conceptually-motivated alternative to backward predictability yields similar effects across both studies. Fine-grained error analysis reveals how speakers prioritize form, meaning, and context-based information during lexical planning.

Conclusion: The findings illuminate the functional roles of past and future context in word encoding and choice, providing a bridge between contextual predictability effects and sentence planning mechanisms.

Abstract: Contextual predictability shapes both the form and choice of words in online language production. The effects of the predictability of a word given its previous context are generally well-understood in both production and comprehension, but studies of naturalistic production have also revealed a poorly-understood backward predictability effect of a word given its future context, which may be related to future planning. Here, in two studies of naturalistic speech corpora, we investigate backward predictability effects using improved measures and more powerful language models, introducing a new principled and conceptually motivated information-theoretic predictability measure that integrates predictability from both the future and the past context. Our first study revisits classic predictability effects on word duration. Our second study investigates substitution errors within a generative framework that independently models the effects of lexical, contextual, and communicative factors on word choice, while predicting the actual words that surface as speech errors. We find that our proposed conceptually-motivated alternative to backward predictability yields qualitatively similar effects across both studies. Through a fine-grained analysis of substitution errors, we further show that different kinds of errors are suggestive of how speakers prioritize form, meaning, and context-based information during lexical planning. Together, these findings illuminate the functional roles of past and future context in how speakers encode and choose words, offering a bridge between contextual predictability effects and the mechanisms of sentence planning.

[16] Design, Results and Industry Implications of the World’s First Insurance Large Language Model Evaluation Benchmark

Hua Zhou, Bing Ma, Yufei Zhang, Yi Zhao

Main category: cs.CL

TL;DR: CUFEInse v1.0 is a comprehensive insurance domain benchmark covering 5 dimensions with 54 sub-indicators and 14,430 questions, evaluating 11 LLMs and revealing bottlenecks in actuarial capabilities, compliance adaptation, and business scenarios.

Details

Motivation: To fill the gap in professional evaluation benchmarks for the insurance field and provide academia and industry with a professional, systematic, and authoritative evaluation tool for large language models in vertical domains.

Method: Established an evaluation framework using “quantitative-oriented, expert-driven, and multi-validation” principles, covering 5 core dimensions (insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, logical rigor) with 54 sub-indicators and 14,430 high-quality questions.

Result: Evaluation of 11 mainstream LLMs revealed: general-purpose models have weak actuarial capabilities and inadequate compliance adaptation; domain-specific models show advantages in insurance vertical scenarios but have shortcomings in business adaptation and compliance; identified bottlenecks in insurance actuarial, underwriting/claim settlement reasoning, and compliant marketing copywriting.

Conclusion: CUFEInse provides an authoritative reference for academic model optimization and industrial model selection, with its construction methodology offering important references for vertical field evaluation paradigms. Future directions include benchmark iteration and “domain adaptation + reasoning enhancement” for insurance large models.

Abstract: This paper comprehensively elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0. Adhering to the principles of “quantitative-oriented, expert-driven, and multi-validation,” the benchmark establishes an evaluation framework covering 5 core dimensions, 54 sub-indicators, and 14,430 high-quality questions, encompassing insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, and logical rigor. Based on this benchmark, a comprehensive evaluation was conducted on 11 mainstream large language models. The evaluation results reveal that general-purpose models suffer from common bottlenecks such as weak actuarial capabilities and inadequate compliance adaptation. High-quality domain-specific training demonstrates significant advantages in insurance vertical scenarios but exhibits shortcomings in business adaptation and compliance. The evaluation also accurately identifies the common bottlenecks of current large models in professional scenarios such as insurance actuarial, underwriting and claim settlement reasoning, and compliant marketing copywriting. The establishment of CUFEInse not only fills the gap in professional evaluation benchmarks for the insurance field, providing academia and industry with a professional, systematic, and authoritative evaluation tool, but also its construction concept and methodology offer important references for the evaluation paradigm of large models in vertical fields, serving as an authoritative reference for academic model optimization and industrial model selection. Finally, the paper looks forward to the future iteration direction of the evaluation benchmark and the core development direction of “domain adaptation + reasoning enhancement” for insurance large models.

[17] From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

Siyu Xia, Zekun Xu, Jiajun Chai, Wentian Fan, Yan Song, Xiaohan Wang, Guojun Yin, Wei Lin, Haifeng Zhang, Jun Wang

Main category: cs.CL

TL;DR: A trainable graph memory framework that structures agent experiences into strategic meta-cognition, optimized via reinforcement learning to enhance LLM agents’ reasoning capabilities.

Details

Motivation: Current LLM agents acquire experience through either implicit training memory (suffering from catastrophic forgetting) or explicit prompting memory (lacking adaptability), creating a need for more effective experience utilization.

Method: Multi-layered graph memory that abstracts agent trajectories into structured decision paths, distills them into strategic meta-cognition, and uses reinforcement-based weight optimization to adapt memory based on task rewards.

Result: The learnable graph memory improves LLM agents’ strategic reasoning performance, delivers robust generalization, and provides consistent benefits during RL training.

Conclusion: The proposed graph memory framework effectively enhances LLM agents’ ability to utilize prior experiences and parametric information through adaptable, structured memory organization.

Abstract: Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving across complex, open-ended environments. A promising approach for improving the reasoning capabilities of LLM agents is to better utilize prior experiences in guiding current decisions. However, LLMs acquire experience either through implicit memory via training, which suffers from catastrophic forgetting and limited interpretability, or explicit memory via prompting, which lacks adaptability. In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework and evaluate how context memory enhances the ability of LLMs to utilize parametric information. The graph abstracts raw agent trajectories into structured decision paths in a state machine and further distills them into high-level, human-interpretable strategic meta-cognition. In order to make memory adaptable, we propose a reinforcement-based weight optimization procedure that estimates the empirical utility of each meta-cognition based on reward feedback from downstream tasks. These optimized strategies are then dynamically integrated into the LLM agent’s training loop through meta-cognitive prompting. Empirically, the learnable graph memory delivers robust generalization, improves LLM agents’ strategic reasoning performance, and provides consistent benefits during Reinforcement Learning (RL) training.

Soyeong Jeong, Aparna Elangovan, Emine Yilmaz, Oleg Rokhlenko

Main category: cs.CL

TL;DR: Proposes a multi-agent framework for refining LLM responses by having specialized agents handle factuality, personalization, and coherence aspects, with dynamic communication to adaptively coordinate agents based on query requirements.

Details

Motivation: LLMs can generate human-like responses but struggle with personalization and specific knowledge, and it's impractical for users to detect errors and request new responses. Existing single-LLM refinement approaches fail to consider diverse aspects needed for effective conversations.

Method: Multi-agent framework where each agent specializes in one aspect (factuality, personalization, coherence), with dynamic communication strategy that adaptively selects and coordinates relevant agents based on query requirements rather than following fixed sequences.

Result: Significantly outperforms relevant baselines on challenging conversational datasets, particularly in tasks involving knowledge, user’s persona, or both.

Conclusion: The multi-agent framework with dynamic communication effectively improves conversational quality by addressing key aspects that single-LLM approaches struggle with, demonstrating superior performance in knowledge-intensive and personalized conversation tasks.

Abstract: Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real-life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for effective conversations. In this work, we propose refining responses through a multi-agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user’s persona, or both.

Chenxi Lin, Weikang Yuan, Zhuoren Jiang, Biao Huang, Ruitao Zhang, Jianan Ge, Yueqian Xu, Jianxing Yu

Main category: cs.CL

TL;DR: AlignSurvey is the first benchmark that systematically replicates and evaluates the full social survey pipeline using LLMs, addressing limitations of traditional surveys and previous LLM-based approaches.

Details

Motivation: Traditional surveys face challenges like fixed formats, high costs, limited adaptability, and cross-cultural issues. Previous LLM-based approaches are limited to structured questions, overlook the full survey process, and risk under-representing marginalized groups due to training data biases.

Method: Introduces AlignSurvey benchmark with four tasks aligned with key survey stages: social role modeling, semi-structured interview modeling, attitude stance modeling, and survey response modeling. Uses multi-tiered dataset architecture including Social Foundation Corpus (44K+ dialogues, 400K+ records) and Entire-Pipeline Survey Datasets. Develops SurveyLM family through two-stage fine-tuning of open-source LLMs.

Result: Provides systematic framework for evaluating LLM alignment with social survey processes, including task-specific metrics for alignment fidelity, consistency, and fairness at individual and group levels with demographic diversity focus.

Conclusion: AlignSurvey enables transparent and socially responsible research by providing comprehensive benchmark, datasets, models, and tools for evaluating LLMs in social survey contexts, addressing limitations of traditional methods and previous AI approaches.

Abstract: Understanding human attitudes, preferences, and behaviors through social surveys is essential for academic research and policymaking. Yet traditional surveys face persistent challenges, including fixed-question formats, high costs, limited adaptability, and difficulties ensuring cross-cultural equivalence. While recent studies explore large language models (LLMs) to simulate survey responses, most are limited to structured questions, overlook the entire survey process, and risks under-representing marginalized groups due to training data biases. We introduce AlignSurvey, the first benchmark that systematically replicates and evaluates the full social survey pipeline using LLMs. It defines four tasks aligned with key survey stages: social role modeling, semi-structured interview modeling, attitude stance modeling and survey response modeling. It also provides task-specific evaluation metrics to assess alignment fidelity, consistency, and fairness at both individual and group levels, with a focus on demographic diversity. To support AlignSurvey, we construct a multi-tiered dataset architecture: (i) the Social Foundation Corpus, a cross-national resource with 44K+ interview dialogues and 400K+ structured survey records; and (ii) a suite of Entire-Pipeline Survey Datasets, including the expert-annotated AlignSurvey-Expert (ASE) and two nationally representative surveys for cross-cultural evaluation. We release the SurveyLM family, obtained through two-stage fine-tuning of open-source LLMs, and offer reference models for evaluating domain-specific alignment. All datasets, models, and tools are available at github and huggingface to support transparent and socially responsible research.

Neelesh Kumar Shukla, Pranay Sanghvi

Main category: cs.CL

TL;DR: A system to forecast social unrest events using NLP techniques to analyze news articles and extract key entities involved in planned protests and rallies.

Details

Motivation: To help administrative officials take necessary action by forecasting civil unrest events like protests and rallies that are often announced in advance in news articles.

Method: Uses topic modeling and word2vec to filter relevant news articles, Named Entity Recognition (NER) to identify entities, time normalization for date standardization, and proposes Related Entity Extraction to identify entities actually involved in events.

Result: Developed a geographically independent, generalized model that can identify key features for filtering civil unrest events and extract related entities from news announcements.

Conclusion: The proposed system effectively forecasts social unrest events by analyzing news announcements and extracting relevant entities, providing a practical tool for administrative planning and response.

Abstract: In democracies like India, people are free to express their views and demands. Sometimes this causes situations of civil unrest such as protests, rallies, and marches. These events may be disruptive in nature and are often held without prior permission from the competent authority. Forecasting these events helps administrative officials take necessary action. Usually, protests are announced well in advance to encourage large participation. Therefore, by analyzing such announcements in news articles, planned events can be forecasted beforehand. We developed such a system in this paper to forecast social unrest events using topic modeling and word2vec to filter relevant news articles, and Named Entity Recognition (NER) methods to identify entities such as people, organizations, locations, and dates. Time normalization is applied to convert future date mentions into a standard format. In this paper, we have developed a geographically independent, generalized model to identify key features for filtering civil unrest events. There could be many mentions of entities, but only a few may actually be involved in the event. This paper calls such entities Related Entities and proposes a method to extract them, referred to as Related Entity Extraction.

[21] Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification

Chenhao Dang, Jing Ma

Main category: cs.CL

TL;DR: MC^2F is a novel method that enhances adversarial robustness in text classification while preserving clean data performance by modeling clean data distribution in embedding space and correcting adversarial samples.

Details

Motivation: To resolve the persistent challenge where improving model robustness against adversarial attacks typically degrades performance on clean data in text classification tasks.

Method: Two-module system: 1) Stratified Riemannian Continuous Normalizing Flow (SR-CNF) learns clean data manifold density, 2) Geodesic Purification Solver projects adversarial embeddings back to clean manifold via shortest path.

Result: Establishes new state-of-the-art in adversarial robustness across three datasets and multiple attacks while fully preserving clean data performance with modest accuracy gains.

Conclusion: MC^2F successfully resolves the robustness-clean performance trade-off by operating directly on sentence embeddings and correcting adversarial samples through manifold projection.

Abstract: A persistent challenge in text classification (TC) is that enhancing model robustness against adversarial attacks typically degrades performance on clean data. We argue that this challenge can be resolved by modeling the distribution of clean samples in the encoder embedding manifold. To this end, we propose the Manifold-Correcting Causal Flow (MC^2F), a two-module system that operates directly on sentence embeddings. A Stratified Riemannian Continuous Normalizing Flow (SR-CNF) learns the density of the clean data manifold. It identifies out-of-distribution embeddings, which are then corrected by a Geodesic Purification Solver. This solver projects adversarial points back onto the learned manifold via the shortest path, restoring a clean, semantically coherent representation. We conducted extensive evaluations on text classification (TC) across three datasets and multiple adversarial attacks. The results demonstrate that our method, MC^2F, not only establishes a new state-of-the-art in adversarial robustness but also fully preserves performance on clean data, even yielding modest gains in accuracy.

[22] Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning

Songze Li, Zhiqiang Liu, Zhaoyan Gong, Xiaoke Guo, Zhengke Gui, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: The Logits-to-Logic framework improves LLMs’ logic consistency in structured knowledge reasoning by correcting logical defects through logits strengthening and filtering.

Details

Motivation: LLMs struggle with logic consistency in structured knowledge reasoning tasks due to representational differences between unstructured and structured knowledge, leading to Logic Drift challenges.

Method: Proposes Logits-to-Logic framework with logits strengthening and logits filtering modules to correct logical defects in LLM outputs during autoregressive generation.

Result: Significantly improves LLMs’ logic consistency in structured knowledge reasoning and achieves state-of-the-art performance on multiple KGQA benchmarks.

Conclusion: The Logits-to-Logic framework effectively addresses Logic Drift in LLMs for structured knowledge reasoning tasks.

Abstract: Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text, enabling them to understand the logic in natural language and generate logic-consistent responses. However, the representational differences between unstructured and structured knowledge make LLMs inherently struggle to maintain logic consistency, leading to \textit{Logic Drift} challenges in structured knowledge reasoning tasks such as Knowledge Graph Question Answering (KGQA). Existing methods address this limitation by designing complex workflows embedded in prompts to guide LLM reasoning. Nevertheless, these approaches only provide input-level guidance and fail to fundamentally address the \textit{Logic Drift} in LLM outputs. Additionally, their inflexible reasoning workflows cannot adapt to different tasks and knowledge graphs. To enhance LLMs’ logic consistency in structured knowledge reasoning, we specifically target the logits output from the autoregressive generation process. We propose the \textit{Logits-to-Logic} framework, which incorporates logits strengthening and logits filtering as core modules to correct logical defects in LLM outputs. Extensive experiments show that our approach significantly improves LLMs’ logic consistency in structured knowledge reasoning and achieves state-of-the-art performance on multiple KGQA benchmarks.

Nur Shazwani Kamarudin, Ghazaleh Beigi, Lydia Manikonda, Huan Liu

Main category: cs.CL

TL;DR: Survey of social media data analysis for mental health detection using NLP and ML methods to identify depression, anxiety, and suicidal thoughts from user content.

Details

Motivation: Social media provides anonymous platforms for people to discuss stigmatized mental health issues, creating opportunities to study and detect mental health challenges through user-generated content.

Method: Uses linguistic, visual, and emotional indicators from social media data with machine learning, feature engineering, and natural language processing techniques to analyze mental health disclosures.

Result: Demonstrates how social media data can be leveraged to identify mental health issues and provides a categorization of data types and analytical methods used in this research domain.

Conclusion: Social media data offers a novel source for improving mental health practice, enabling timely support, and informing policy decisions through advanced computational analysis methods.

Abstract: There is an increasing number of virtual communities and forums available on the web. With social media, people can freely communicate and share their thoughts, ask personal questions, and seek peer-support, especially those with conditions that are highly stigmatized, without revealing personal identity. We study the state-of-the-art research methodologies and findings on mental health challenges like de- pression, anxiety, suicidal thoughts, from the pervasive use of social media data. We also discuss how these novel thinking and approaches can help to raise awareness of mental health issues in an unprecedented way. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. The main goal of this chapter is to show how this new source of data can be tapped to improve medical practice, provide timely support, and influence government or policymakers. In the context of social media for mental health issues, this chapter categorizes social media data used, introduces different deployed machine learning, feature engineering, natural language processing, and surveys methods and outlines directions for future research.

[24] Distinct Theta Synchrony across Speech Modes: Perceived, Spoken, Whispered, and Imagined

Jung-Sun Lee, Ha-Na Jo, Eunyeong Ko

Main category: cs.CL

TL;DR: This study compares theta-band neural synchrony across four speech modes (perceived, overt, whispered, imagined) and finds distinct synchronization patterns that reflect different neural mechanisms for each mode.

Details

Motivation: Previous research has focused on single speech modes, lacking integrated comparisons of theta synchrony across different modes. The study aims to uncover shared and distinct neural dynamics underlying language perception and imagined speech.

Method: Analyzed differences in theta-band neural synchrony across speech modes using connectivity metrics, focusing on region-wise variations in brain connectivity patterns.

Result: Overt and whispered speech showed broader frontotemporal synchrony (motor-phonological coupling), perceived speech had dominant posterior/temporal synchrony (auditory processing), and imagined speech showed confined frontal/supplementary motor synchrony.

Conclusion: Theta synchrony extent and spatial distribution differ substantially across speech modes, with overt articulation engaging widespread cortical interactions, whispered speech showing intermediate engagement, and perception relying on temporoparietal networks.

Abstract: Human speech production encompasses multiple modes such as perceived, overt, whispered, and imagined, each reflecting distinct neural mechanisms. Among these, theta-band synchrony has been closely associated with language processing, attentional control, and inner speech. However, previous studies have largely focused on a single mode, such as overt speech, and have rarely conducted an integrated comparison of theta synchrony across different speech modes. In this study, we analyzed differences in theta-band neural synchrony across speech modes based on connectivity metrics, focusing on region-wise variations. The results revealed that overt and whispered speech exhibited broader and stronger frontotemporal synchrony, reflecting active motor-phonological coupling during overt articulation, whereas perceived speech showed dominant posterior and temporal synchrony patterns, consistent with auditory perception and comprehension processes. In contrast, imagined speech demonstrated a more spatially confined but internally coherent synchronization pattern, primarily involving frontal and supplementary motor regions. These findings indicate that the extent and spatial distribution of theta synchrony differ substantially across modes, with overt articulation engaging widespread cortical interactions, whispered speech showing intermediate engagement, and perception relying predominantly on temporoparietal networks. Therefore, this study aims to elucidate the differences in theta-band neural synchrony across various speech modes, thereby uncovering both the shared and distinct neural dynamics underlying language perception and imagined speech.

[25] Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

Matthias De Lange, Jens-Joris Decorte, Jeroen Van Hautte

Main category: cs.CL

TL;DR: WorkBench is a unified evaluation suite for work-related NLP tasks, and UWE is a task-agnostic bi-encoder that outperforms generalist embedding models on work domain tasks with zero-shot ranking capabilities.

Details

Motivation: Workforce transformation creates demand for specialized NLP, but work-related tasks have complex characteristics like long-tailed distributions and scarce data. Generalist embedding models' performance in work domain is unclear as progress focuses on individual tasks.

Method: Created WorkBench benchmark with 6 work-related ranking tasks. Developed UWE using task-specific bipartite graphs from real data enriched synthetically, with many-to-many InfoNCE objective and token-level embeddings with soft late interaction.

Result: UWE shows significant positive cross-task transfer, achieves zero-shot ranking on unseen target spaces, enables low-latency inference, and demonstrates significant gains in macro-averaged MAP and RP@10 over generalist embedding models.

Conclusion: UWE effectively addresses work domain NLP challenges, outperforming generalist models while enabling efficient zero-shot ranking capabilities for workforce applications.

Abstract: Workforce transformation across diverse industries has driven an increased demand for specialized natural language processing capabilities. Nevertheless, tasks derived from work-related contexts inherently reflect real-world complexities, characterized by long-tailed distributions, extreme multi-label target spaces, and scarce data availability. The rise of generalist embedding models prompts the question of their performance in the work domain, especially as progress in the field has focused mainly on individual tasks. To this end, we introduce WorkBench, the first unified evaluation suite spanning six work-related tasks formulated explicitly as ranking problems, establishing a common ground for multi-task progress. Based on this benchmark, we find significant positive cross-task transfer, and use this insight to compose task-specific bipartite graphs from real-world data, synthetically enriched through grounding. This leads to Unified Work Embeddings (UWE), a task-agnostic bi-encoder that exploits our training-data structure with a many-to-many InfoNCE objective, and leverages token-level embeddings with task-agnostic soft late interaction. UWE demonstrates zero-shot ranking performance on unseen target spaces in the work domain, enables low-latency inference by caching the task target space embeddings, and shows significant gains in macro-averaged MAP and RP@10 over generalist embedding models.

[26] NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLMs for NOTAM Interpretation

Maoqi Liu, Quan Fang, Yuhao Wu, Can Zhao, Yang Yang, Kaiquan Cai

Main category: cs.CL

TL;DR: NOTAM-Evolve is a self-evolving LLM framework that achieves 30.4% accuracy improvement over base LLM for interpreting cryptic aviation NOTAMs through dual-reasoning combining knowledge grounding and schema-based inference.

Details

Motivation: Current automated NOTAM interpretation systems only perform shallow parsing and fail to extract actionable intelligence needed for aviation safety decisions, due to the condensed and cryptic nature of NOTAM language.

Method: Proposed NOTAM-Evolve framework with knowledge graph-enhanced retrieval for data grounding and closed-loop learning where LLM progressively improves from its own outputs, minimizing need for human-annotated reasoning traces.

Result: Achieved 30.4% absolute accuracy improvement over base LLM, establishing new state of the art on structured NOTAM interpretation using new benchmark dataset of 10,000 expert-annotated NOTAMs.

Conclusion: NOTAM-Evolve successfully enables LLMs to autonomously master complex NOTAM interpretation through self-evolving framework with dual-reasoning approach, significantly advancing aviation safety automation.

Abstract: Accurate interpretation of Notices to Airmen (NOTAMs) is critical for aviation safety, yet their condensed and cryptic language poses significant challenges to both manual and automated processing. Existing automated systems are typically limited to shallow parsing, failing to extract the actionable intelligence needed for operational decisions. We formalize the complete interpretation task as deep parsing, a dual-reasoning challenge requiring both dynamic knowledge grounding (linking the NOTAM to evolving real-world aeronautical data) and schema-based inference (applying static domain rules to deduce operational status). To tackle this challenge, we propose NOTAM-Evolve, a self-evolving framework that enables a large language model (LLM) to autonomously master complex NOTAM interpretation. Leveraging a knowledge graph-enhanced retrieval module for data grounding, the framework introduces a closed-loop learning process where the LLM progressively improves from its own outputs, minimizing the need for extensive human-annotated reasoning traces. In conjunction with this framework, we introduce a new benchmark dataset of 10,000 expert-annotated NOTAMs. Our experiments demonstrate that NOTAM-Evolve achieves a 30.4% absolute accuracy improvement over the base LLM, establishing a new state of the art on the task of structured NOTAM interpretation.

[27] State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić

Main category: cs.CL

TL;DR: LLMs show strong zero-shot performance on South Slavic language text classification, matching or surpassing fine-tuned BERT models, but have drawbacks like unpredictable outputs, slow inference, and high computational costs.

Details

Motivation: To evaluate LLM performance on text classification in less-resourced South Slavic languages and compare with fine-tuned BERT models, as this area remains under-explored despite the shift toward LLM prompting.

Method: Compare openly available fine-tuned BERT-like models with open-source and closed-source LLMs across three tasks: sentiment classification in parliamentary speeches, topic classification in news/articles, and genre identification in web texts for South Slavic languages.

Result: LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT models, and perform comparably in South Slavic languages and English in zero-shot setup.

Conclusion: While LLMs show competitive performance, fine-tuned BERT-like models remain more practical for large-scale text annotation due to LLMs’ unpredictable outputs, slow inference, and high computational costs.

Abstract: Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.

[28] Self-Correction Distillation for Structured Data Question Answering

Yushan Zhu, Wen Zhang, Long Jin, Mengshu Sun, Ling Zhong, Zhiqiang Liu, Juan Li, Lei Liang, Chong Long, Chao Deng, Junlan Feng

Main category: cs.CL

TL;DR: Proposes Self-Correction Distillation (SCD) method to improve small-scale LLMs’ structured data QA by transferring query-generation and error-correction capabilities from large-scale LLMs.

Details

Motivation: Small-scale LLMs struggle with structured data QA due to errors in generating structured queries, while existing unified frameworks like TrustUQA work better with large-scale LLMs.

Method: Self-Correction Distillation (SCD) with Error Prompt Mechanism (EPM) to detect errors and provide customized messages, plus two-stage distillation to transfer capabilities from large to small LLMs.

Result: SCD achieves best performance on small-scale LLMs (8B) across 5 benchmarks with 3 structured data types, closely approaching GPT4 on some datasets. EPM also helps large-scale LLMs surpass SOTA results.

Conclusion: SCD effectively improves small-scale LLMs’ structured data QA capabilities through error correction and knowledge distillation, demonstrating superior generalization and performance.

Abstract: Structured data question answering (QA), including table QA, Knowledge Graph (KG) QA, and temporal KG QA, is a pivotal research area. Advances in large language models (LLMs) have driven significant progress in unified structural QA frameworks like TrustUQA. However, these frameworks face challenges when applied to small-scale LLMs since small-scale LLMs are prone to errors in generating structured queries. To improve the structured data QA ability of small-scale LLMs, we propose a self-correction distillation (SCD) method. In SCD, an error prompt mechanism (EPM) is designed to detect errors and provide customized error messages during inference, and a two-stage distillation strategy is designed to transfer large-scale LLMs’ query-generation and error-correction capabilities to small-scale LLM. Experiments across 5 benchmarks with 3 structured data types demonstrate that our SCD achieves the best performance and superior generalization on small-scale LLM (8B) compared to other distillation methods, and closely approaches the performance of GPT4 on some datasets. Furthermore, large-scale LLMs equipped with EPM surpass the state-of-the-art results on most datasets.

[29] HyCoRA: Hyper-Contrastive Role-Adaptive Learning for Role-Playing

Shihao Yang, Zhicong Lu, Yong Yang, Bo Lv, Yang Shen, Nayu Liu

Main category: cs.CL

TL;DR: HyCoRA is a framework that improves multi-character role-playing by balancing distinct and shared traits through a Hyper-Half Low-Rank Adaptation structure with hyper-contrastive learning.

Details

Motivation: Existing methods either use shared modules that ignore distinct traits or role-specific modules that overlook shared traits, limiting personality learning and commonality modeling.

Method: Proposes Hyper-Half Low-Rank Adaptation with role-specific modules generated by hyper-network and trainable role-shared modules, plus hyper-contrastive learning to distinguish unique characteristics.

Result: Experimental results on English and Chinese benchmarks show superiority, with GPT-4 evaluations and visual analyses confirming capability to capture role characteristics.

Conclusion: HyCoRA effectively balances distinct and shared trait learning for improved multi-character role-playing through its adaptive structure and contrastive mechanism.

Abstract: Multi-character role-playing aims to equip models with the capability to simulate diverse roles. Existing methods either use one shared parameterized module across all roles or assign a separate parameterized module to each role. However, the role-shared module may ignore distinct traits of each role, weakening personality learning, while the role-specific module may overlook shared traits across multiple roles, hindering commonality modeling. In this paper, we propose a novel HyCoRA: Hyper-Contrastive Role-Adaptive learning framework, which efficiently improves multi-character role-playing ability by balancing the learning of distinct and shared traits. Specifically, we propose a Hyper-Half Low-Rank Adaptation structure, where one half is a role-specific module generated by a lightweight hyper-network, and the other half is a trainable role-shared module. The role-specific module is devised to represent distinct persona signatures, while the role-shared module serves to capture common traits. Moreover, to better reflect distinct personalities across different roles, we design a hyper-contrastive learning mechanism to help the hyper-network distinguish their unique characteristics. Extensive experimental results on both English and Chinese available benchmarks demonstrate the superiority of our framework. Further GPT-4 evaluations and visual analyses also verify the capability of HyCoRA to capture role characteristics.

[30] BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

Abdullah Muhammad Moosa, Nusrat Sultana, Mahdi Muhammad Moosa, Md. Miraiz Hossain

Main category: cs.CL

TL;DR: This paper introduces BARD10, a new balanced Bangla authorship attribution dataset, and analyzes the impact of stop-word removal across classical and deep learning models, revealing that Bangla stop-words serve as essential stylistic indicators and that classical TF-IDF + SVM outperforms transformer models.

Details

Motivation: To investigate Bangla authorship attribution by introducing a new balanced benchmark corpus and systematically analyzing the impact of stop-word removal to uncover the stylistic significance of Bangla stop-words.

Method: Created BARD10 corpus with Bangla blog/opinion prose from 10 authors, assessed four classifiers (SVM, Bangla BERT, XGBoost, MLP) with uniform preprocessing on both BARD10 and BAAD16 datasets, and conducted error analysis on stop-word pruning effects.

Result: TF-IDF + SVM baseline outperformed all models, achieving macro-F1 of 0.997 on BAAD16 and 0.921 on BARD10, while Bangla BERT lagged by up to five points. BARD10 authors were highly sensitive to stop-word removal, while BAAD16 authors remained robust, showing genre-dependent reliance on stop-word signatures.

Conclusion: Three key insights: Bangla stop-words are essential stylistic indicators; finely calibrated ML models are effective for short-text limitations; BARD10 bridges formal literature with contemporary web dialogue and provides a reproducible benchmark for future transformer research.

Abstract: This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop-word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop-words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative classifiers: SVM (Support Vector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perception), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF-IDF + SVM baseline outperformed, attaining a macro-F1 score of 0.997 on BAAD16 and 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to stop-word pruning, while BAAD16 authors remain comparatively robust highlighting genre-dependent reliance on stop-word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop-words serve as essential stylistic indicators; finely calibrated ML models prove effective within short-text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long-context or domain-adapted transformers.

[31] Estranged Predictions: Measuring Semantic Category Disruption with Masked Language Modelling

Yuxuan Liu, Haim Dubossarsky, Ruth Ahnert

Main category: cs.CL

TL;DR: Science fiction shows higher conceptual permeability than general fiction, especially for machine terms, using MLM to measure category boundary disruptions.

Details

Motivation: To computationally measure Darko Suvin's theory of estrangement in science fiction by analyzing how ontological categories (human/animal/machine) become destabilized.

Method: Used masked language modeling (RoBERTa) on science fiction and general fiction corpora, generating lexical substitutes for masked referents and classifying them via Gemini, with metrics for retention rate, replacement rate, and entropy.

Result: Science fiction exhibits heightened conceptual permeability, particularly around machine referents with significant cross-category substitution, while human terms maintain semantic coherence and anchor substitution hierarchies.

Conclusion: Estrangement in science fiction operates as controlled perturbation of semantic norms detectable through probabilistic modeling, and MLMs can serve as interpretive instruments for genre-conditioned ontological assumptions.

Abstract: This paper examines how science fiction destabilises ontological categories by measuring conceptual permeability across the terms human, animal, and machine using masked language modelling (MLM). Drawing on corpora of science fiction (Gollancz SF Masterworks) and general fiction (NovelTM), we operationalise Darko Suvin’s theory of estrangement as computationally measurable deviation in token prediction, using RoBERTa to generate lexical substitutes for masked referents and classifying them via Gemini. We quantify conceptual slippage through three metrics: retention rate, replacement rate, and entropy, mapping the stability or disruption of category boundaries across genres. Our findings reveal that science fiction exhibits heightened conceptual permeability, particularly around machine referents, which show significant cross-category substitution and dispersion. Human terms, by contrast, maintain semantic coherence and often anchor substitutional hierarchies. These patterns suggest a genre-specific restructuring within anthropocentric logics. We argue that estrangement in science fiction operates as a controlled perturbation of semantic norms, detectable through probabilistic modelling, and that MLMs, when used critically, serve as interpretive instruments capable of surfacing genre-conditioned ontological assumptions. This study contributes to the methodological repertoire of computational literary studies and offers new insights into the linguistic infrastructure of science fiction.

[32] Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

Main category: cs.CL

TL;DR: MLLMs struggle with cross-modality skill composition despite various improvement attempts.

Details

Motivation: To evaluate how well Multimodal Large Language Models can combine previously learned skills across different modalities to solve new tasks.

Method: Created three evaluation tasks requiring sequential composition of two modality-dependent skills, tested models with direct prompting and two-step cascaded inference, and explored chain-of-thought prompting and fine-tuning to improve performance.

Result: All evaluated MLLMs showed significant cross-modality skill composition gaps. Improvement strategies helped but gaps remained substantial.

Conclusion: More research is needed to effectively improve cross-modal skill composition capabilities in MLLMs.

Abstract: Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.

[33] Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition

Raquel Montero, Natalia Moskvina, Paolo Morosi, Tamara Serrano, Elena Pagliarini, Evelina Leivada

Main category: cs.CL

TL;DR: MLLMs struggle with quantification due to differences in how they encode key human quantification features like quantifier scales, usage ranges, and numerical biases compared to humans.

Details

Motivation: To understand why (Multimodal) Large Language Models perform poorly on quantification tasks and investigate how they encode three key human quantification features that interface with logic, pragmatics, and numerical domains.

Method: Examined three unexplored human quantification features in MLLMs: quantifier ordering scales, usage ranges/prototypicality, and approximate number system biases. Analyzed how these features are encoded in model architecture and compared human vs. model performance across different tasks and languages.

Result: Found clear differences between humans and MLLMs across various tasks, showing that models encode quantification features differently than humans. Results varied based on model type and language investigated.

Conclusion: This research helps address MLLMs’ nature as semantic and pragmatic agents and shows cross-linguistic analysis can reveal whether their quantification abilities are robust across different languages.

Abstract: Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This papers looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models’ architecture, how they may differ from humans, and whether the results are affected by the type of model and language under investigation. We find that there are clear differences between humans and MLLMs with respect to these features across various tasks that tap into the representation of quantification in vivo vs. in silico. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.

[34] Sentence-Anchored Gist Compression for Long-Context LLMs

Dmitrii Tarasov, Elizaveta Goncharova, Kuznetsov Andrey

Main category: cs.CL

TL;DR: Fine-tuning LLMs to compress context using learned tokens, achieving 2x-8x compression with minimal performance loss.

Details

Motivation: Reduce memory and computational demands of processing long sequences in LLMs.

Method: Fine-tune pre-trained LLMs to compress context using learned compression tokens.

Result: Achieves 2x-8x compression without significant performance degradation; matches alternative techniques with higher compression ratios on 3B-parameter LLaMA model.

Conclusion: Learned compression tokens enable effective context compression in LLMs while maintaining performance.

Abstract: This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned to compress their context by factors of 2x to 8x without significant performance degradation, as evaluated on both short-context and long-context benchmarks. Furthermore, in experiments on a 3-billion-parameter LLaMA model, our method achieves results on par with alternative compression techniques while attaining higher compression ratios.

[35] On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Kushal Tatariya, Wessel Poelman, Miryam de Lhoneux

Main category: cs.CL

TL;DR: The study examines whether positional encoding choices in language models affect performance differently across languages, particularly testing the trade-off hypothesis between morphological complexity and word order flexibility.

Details

Motivation: To investigate if architectural bias from English-first language model development degrades performance for structurally different languages, specifically focusing on positional encodings and the morphological complexity/word order flexibility trade-off.

Method: Pretrained monolingual model variants with absolute, relative, and no positional encodings for seven typologically diverse languages, evaluated on four downstream tasks.

Result: No clear interaction observed between positional encodings and morphological complexity or word order flexibility, contrary to previous findings. Task, language, and metric choices proved crucial for stable conclusions.

Conclusion: Positional encoding choices don’t consistently interact with language structural properties as hypothesized, and methodological factors significantly influence findings about architectural biases.

Abstract: Language model architectures are predominantly first created for English and subsequently applied to other languages. It is an open question whether this architectural bias leads to degraded performance for languages that are structurally different from English. We examine one specific architectural choice: positional encodings, through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis posits a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pretrain monolingual model variants with absolute, relative, and no positional encodings for seven typologically diverse languages and evaluate them on four downstream tasks. Contrary to previous findings, we do not observe a clear interaction between position encodings and morphological complexity or word order flexibility, as measured by various proxies. Our results show that the choice of tasks, languages, and metrics are essential for drawing stable conclusions

[36] Relation as a Prior: A Novel Paradigm for LLM-based Document-level Relation Extraction

Qiankun Pi, Yepeng Sun, Jicang Lu, Qinlong Fan, Ningbo Huang, Shiyu Wang

Main category: cs.CL

TL;DR: RelPrior is a new paradigm for document-level relation extraction that uses relation as a prior to filter irrelevant entity pairs and extract triples, achieving state-of-the-art performance on LLM-based methods.

Details

Motivation: Current LLM-based methods for document-level relation extraction suffer from noise from unrelated entity pairs and misjudgment of relations beyond predefined labels, leading to performance gaps.

Method: RelPrior uses binary relation as a prior to filter irrelevant entity pairs and predefined relation as a prior to match entities for triple extraction instead of direct relation prediction.

Result: Extensive experiments on two benchmarks show RelPrior achieves state-of-the-art performance, surpassing existing LLM-based methods.

Conclusion: The RelPrior paradigm effectively addresses key challenges in LLM-based document relation extraction by using relation as prior knowledge, demonstrating superior performance over traditional approaches.

Abstract: Large Language Models (LLMs) have demonstrated their remarkable capabilities in document understanding. However, recent research reveals that LLMs still exhibit performance gaps in Document-level Relation Extraction (DocRE) as requiring fine-grained comprehension. The commonly adopted “extract entities then predict relations” paradigm in LLM-based methods leads to these gaps due to two main reasons: (1) Numerous unrelated entity pairs introduce noise and interfere with the relation prediction for truly related entity pairs. (2) Although LLMs have identified semantic associations between entities, relation labels beyond the predefined set are still treated as prediction errors. To address these challenges, we propose a novel Relation as a Prior (RelPrior) paradigm for LLM-based DocRE. For challenge (1), RelPrior utilizes binary relation as a prior to extract and determine whether two entities are correlated, thereby filtering out irrelevant entity pairs and reducing prediction noise. For challenge (2), RelPrior utilizes predefined relation as a prior to match entities for triples extraction instead of directly predicting relation. Thus, it avoids misjudgment caused by strict predefined relation labeling. Extensive experiments on two benchmarks demonstrate that RelPrior achieves state-of-the-art performance, surpassing existing LLM-based methods.

[37] Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?

Kunal Kingkar Das, Manoj Balaji Jagadeeshan, Nallani Chakravartula Sahith, Jivnesh Sandhan, Pawan Goyal

Main category: cs.CL

TL;DR: Instruction-tuned LLMs underperform specialized ByT5-Sanskrit models on Sanskrit poetry-to-prose conversion, despite sophisticated prompting strategies based on Paninian grammar.

Details

Motivation: To test whether LLMs can serve as universal solutions for low-resource, morphologically rich languages like Sanskrit, using poetry-to-prose conversion as a challenging test case.

Method: Compared instruction-tuned LLMs with in-context prompting against fully fine-tuned ByT5-Sanskrit Seq2Seq models on Sanskrit verse-to-prose conversion, incorporating Paninian grammar and classical commentary heuristics.

Result: Domain-specific fine-tuning of ByT5-Sanskrit significantly outperformed all LLM approaches, with human evaluation strongly confirming these results through high correlation with Kendall’s Tau scores.

Conclusion: Specialized task-specific models remain superior to general-purpose LLMs for complex linguistic tasks in low-resource languages, though prompting strategies offer viable alternatives when domain-specific training data is unavailable.

Abstract: Large Language Models (LLMs) are increasingly treated as universal, general-purpose solutions across NLP tasks, particularly in English. But does this assumption hold for low-resource, morphologically rich languages such as Sanskrit? We address this question by comparing instruction-tuned and in-context-prompted LLMs with smaller task-specific encoder-decoder models on the Sanskrit poetry-to-prose conversion task. This task is intrinsically challenging: Sanskrit verse exhibits free word order combined with rigid metrical constraints, and its conversion to canonical prose (anvaya) requires multi-step reasoning involving compound segmentation, dependency resolution, and syntactic linearisation. This makes it an ideal testbed to evaluate whether LLMs can surpass specialised models. For LLMs, we apply instruction fine-tuning on general-purpose models and design in-context learning templates grounded in Paninian grammar and classical commentary heuristics. For task-specific modelling, we fully fine-tune a ByT5-Sanskrit Seq2Seq model. Our experiments show that domain-specific fine-tuning of ByT5-Sanskrit significantly outperforms all instruction-driven LLM approaches. Human evaluation strongly corroborates this result, with scores exhibiting high correlation with Kendall’s Tau scores. Additionally, our prompting strategies provide an alternative to fine-tuning when domain-specific verse corpora are unavailable, and the task-specific Seq2Seq model demonstrates robust generalisation on out-of-domain evaluations.

[38] Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?

Arzu Burcu Güven, Anna Rogers, Rob van der Goot

Main category: cs.CL

TL;DR: Analysis of BabyLM corpus and CHILDES age-groups shows limited syntactic differentiation by age, but syntactic knowledge aids model performance interpretation. Curriculum learning approaches, particularly using syntactically categorizable data subsets, improve performance more than full noisy corpora.

Details

Motivation: To understand syntactic properties of child language data and explore how syntactic knowledge and curriculum learning approaches can improve model performance on linguistic tasks.

Method: Examined syntactic properties of BabyLM corpus and CHILDES age-groups, explored developmental and cognitively inspired curriculum approaches for curriculum learning.

Result: CHILDES shows weak syntactic differentiation by age; syntactic knowledge helps interpret model performance; curriculum learning with syntactically categorizable data subsets improves performance more than full noisy corpora.

Conclusion: Syntactic knowledge of training data is valuable for interpreting model performance, and curriculum learning benefits more from using syntactically categorizable data subsets rather than full noisy corpora.

Abstract: We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.

[39] Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction

Shivam Rawat, Lucie Flek, Akbar Karimi

Main category: cs.CL

TL;DR: An encoder-based system using SciBERT for extracting key entities and contextual information from astronomy papers, outperforming GPT baselines.

Details

Motivation: Rapid expansion of astronomy literature necessitates automation of entity extraction from research papers.

Method: Multi-task transformer system built on SciBERT, fine-tuned on astronomy corpora with stochastic sampling and majority voting.

Result: System significantly outperforms open-weight GPT baseline despite simplicity and low-cost implementation.

Conclusion: The proposed encoder-based approach effectively extracts astronomical entities and outperforms existing baselines.

Abstract: Scientific literature in astronomy is rapidly expanding, making it increasingly important to automate the extraction of key entities and contextual information from research papers. In this paper, we present an encoder-based system for extracting knowledge from astronomy articles. Our objective is to develop models capable of classifying telescope references, detecting auxiliary semantic attributes, and recognizing instrument mentions from textual content. To this end, we implement a multi-task transformer-based system built upon the SciBERT model and fine-tuned for astronomy corpora classification. To carry out the fine-tuning, we stochastically sample segments from the training data and use majority voting over the test segments at inference time. Our system, despite its simplicity and low-cost implementation, significantly outperforms the open-weight GPT baseline.

[40] Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

Yishan Du, Conrad Borchers, Mutlu Cukurova

Main category: cs.CL

TL;DR: This paper presents an embedding-based framework to detect gender bias in LLMs for educational feedback, finding asymmetric responses to gender substitutions across multiple models.

Details

Motivation: As teachers increasingly use GenAI in education, robust benchmarking methods are needed to detect bias in LLMs for pedagogical purposes, particularly in formative feedback.

Method: Used 600 authentic student essays with controlled counterfactuals: implicit cues via lexicon-based gendered term swaps and explicit cues via gendered author backgrounds. Tested 6 LLMs, quantified response divergence with embedding distances, assessed significance via permutation tests, and visualized with dimensionality reduction.

Result: All models showed larger semantic shifts for male-female counterfactuals than female-male. Only GPT and Llama models were sensitive to explicit gender cues. Qualitative analysis revealed linguistic differences (more autonomy-supportive feedback for male cues vs. controlling feedback for female cues).

Conclusion: State-of-the-art LLMs exhibit persistent gender biases in educational feedback, requiring fairness auditing, reporting standards for counterfactual evaluation, and practical guidance for equitable prompt design and deployment.

Abstract: As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback. Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions: (i) implicit cues via lexicon-based swaps of gendered terms within essays, and (ii) explicit cues via gendered author background in the prompt. We investigated six representative LLMs (i.e. GPT-5 mini, GPT-4o mini, DeepSeek-R1, DeepSeek-R1-Qwen, Gemini 2.5 Pro, Llama-3-8B). We first quantified the response divergence with cosine and Euclidean distances over sentence embeddings, then assessed significance via permutation tests, and finally, visualised structure using dimensionality reduction. In all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Only the GPT and Llama models showed sensitivity to explicit gender cues. These findings show that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, suggesting persistent gender biases in feedback they provide learners. Qualitative analyses further revealed consistent linguistic differences (e.g., more autonomy-supportive feedback under male cues vs. more controlling feedback under female cues). We discuss implications for fairness auditing of pedagogical GenAI, propose reporting standards for counterfactual evaluation in learning analytics, and outline practical guidance for prompt design and deployment to safeguard equitable feedback.

[41] VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Heyang Liu, Ziyang Cheng, Yuhao Wang, Hongcheng Liu, Yiqi Li, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: VocalBench-zh is a comprehensive Mandarin speech-to-speech evaluation suite with 10 subsets and over 10K instances, addressing the lack of systematic benchmarks for multi-modal LLMs in Mandarin contexts.

Details

Motivation: The scarcity of comprehensive speech-to-speech benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users, despite Mandarin being widely supported by most models.

Method: Proposed VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters.

Result: Evaluation of 14 mainstream models revealed common challenges for current approaches and highlighted the need for new insights into next-generation speech interactive systems.

Conclusion: VocalBench-zh provides a systematic evaluation framework for Mandarin speech-to-speech capabilities, enabling better model comparison and identifying areas for improvement in speech interactive systems.

Abstract: The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at https://github.com/SJTU-OmniAgent/VocalBench-zh.

[42] Prompt Tuning for Natural Language to SQL with Embedding Fine-Tuning and RAG

Jisoo Jang, Tien-Cuong Bui, Yunjun Choi, Wen-Syan Li

Main category: cs.CL

TL;DR: A novel NL-to-SQL framework with error correction mechanism using prompt tuning, inspired by medical diagnostics, that improves accuracy by 12% over baselines.

Details

Motivation: Address the need for efficient and accurate natural language to SQL translation in data-driven environments with growing use of natural language interfaces.

Method: Integrates error correction mechanism that diagnoses error types, identifies causes, provides fixing instructions, and applies corrections, enhanced by embedding fine-tuning and RAG for external knowledge.

Result: Achieves significant 12 percent accuracy improvement over existing baselines in comprehensive experiments.

Conclusion: The framework has potential to revolutionize data access and handling in contemporary data-driven environments.

Abstract: This paper introduces an Error Correction through Prompt Tuning for NL-to-SQL, leveraging the latest advancements in generative pre-training-based LLMs and RAG. Our work addresses the crucial need for efficient and accurate translation of natural language queries into SQL expressions in various settings with the growing use of natural language interfaces. We explore the evolution of NLIDBs from early rule-based systems to advanced neural network-driven approaches. Drawing inspiration from the medical diagnostic process, we propose a novel framework integrating an error correction mechanism that diagnoses error types, identifies their causes, provides fixing instructions, and applies these corrections to SQL queries. This approach is further enriched by embedding fine-tuning and RAG, which harnesses external knowledge bases for improved accuracy and transparency. Through comprehensive experiments, we demonstrate that our framework achieves a significant 12 percent accuracy improvement over existing baselines, highlighting its potential to revolutionize data access and handling in contemporary data-driven environments.

[43] ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech

Marios Koniaris, Argyro Tsipi, Panayiotis Tsanakas

Main category: cs.CL

TL;DR: ParliaBench is a benchmark for parliamentary speech generation that addresses the need for political authenticity and ideological consistency, introducing novel metrics and showing fine-tuning significantly improves model performance.

Details

Motivation: Parliamentary speech generation requires political authenticity and ideological consistency beyond standard text generation, but current models lack specialized training and existing evaluation methods focus on standard NLP metrics rather than political dimensions.

Method: Constructed UK Parliament speech dataset, developed evaluation framework combining computational metrics with LLM-as-a-judge assessments, proposed Political Spectrum Alignment and Party Alignment metrics, fine-tuned five LLMs and generated 28k speeches.

Result: Fine-tuning produced statistically significant improvements across most metrics, and the novel political alignment metrics demonstrated strong discriminative power for political dimensions.

Conclusion: The ParliaBench benchmark successfully addresses the specific challenges of parliamentary speech generation and shows that specialized fine-tuning with appropriate evaluation metrics can significantly improve political authenticity in generated speeches.

Abstract: Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.

[44] Hierarchical structure understanding in complex tables with VLLMs: a benchmark and experiments

Luca Bindini, Simone Giovannini, Simone Marinai, Valeria Nardoni, Kimiya Noor Ali

Main category: cs.CL

TL;DR: VLLMs can understand table structure without special processing, tested on complex hierarchical tables from PubTables-1M dataset using prompt engineering and fine-tuning.

Details

Motivation: To investigate whether Vision Large Language Models can interpret hierarchical table structures in scientific articles without additional processing.

Method: Used PubTables-1M dataset to create CHiTab benchmark of complex hierarchical tables, employed prompt engineering strategies with various formats and styles, evaluated multiple VLLMs both off-the-shelf and fine-tuned, and compared with human performance.

Result: Generic VLLMs not specifically designed for table structure understanding can perform this task, showing potential for processing complex tables.

Conclusion: VLLMs have capability to understand table structures, providing insights for future integration of structured data understanding into general-purpose vision-language models.

Abstract: This work investigates the ability of Vision Large Language Models (VLLMs) to understand and interpret the structure of tables in scientific articles. Specifically, we explore whether VLLMs can infer the hierarchical structure of tables without additional processing. As a basis for our experiments we use the PubTables-1M dataset, a large-scale corpus of scientific tables. From this dataset, we extract a subset of tables that we introduce as Complex Hierarchical Tables (CHiTab): a benchmark collection of complex tables containing hierarchical headings. We adopt a series of prompt engineering strategies to probe the models’ comprehension capabilities, experimenting with various prompt formats and writing styles. Multiple state-of-the-art open-weights VLLMs are evaluated on the benchmark first using their off-the-shelf versions and then fine-tuning some models on our task. We also measure the performance of humans to solve the task on a small set of tables comparing with performance of the evaluated VLLMs. The experiments support our intuition that generic VLLMs, not explicitly designed for understanding the structure of tables, can perform this task. This study provides insights into the potential and limitations of VLLMs to process complex tables and offers guidance for future work on integrating structured data understanding into general-purpose VLLMs.

[45] Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates

Shuaimin Li, Liyang Fan, Yufang Lin, Zeyang Li, Xian Wei, Shiwen Ni, Hamid Alinejad-Rokny, Min Yang

Main category: cs.CL

TL;DR: ReViewGraph is a framework that uses graph reasoning over LLM-simulated reviewer-author debates to improve paper review quality by capturing complex argumentative dynamics.

Details

Motivation: Existing paper review methods suffer from hallucinations, biased scoring, limited reasoning, and fail to capture complex reviewer-author interaction dynamics.

Method: Simulates multi-round reviewer-author debates using LLM-based multi-agent collaboration, extracts diverse opinion relations as typed edges in heterogeneous graphs, and applies graph neural networks for reasoning.

Result: Outperforms strong baselines with 15.73% average relative improvement across three datasets.

Conclusion: Modeling detailed reviewer-author debate structures through graph reasoning significantly enhances paper review quality.

Abstract: Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer-author interactions. To address these limitations, we propose ReViewGraph (Reviewer-Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates. In our approach, reviewer-author exchanges are simulated through LLM-based multi-agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine-grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, underscoring the value of modeling detailed reviewer-author debate structures.

[46] AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, Yajie Yang, Wenxiang Chen, Zhihao Zhang, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: Proposes AgentPRM, a process reward model for LLM agents that evaluates sequential decisions based on goal proximity and progress, using TD-based estimation for efficient training.

Details

Motivation: LLMs struggle with multi-turn decision-making tasks that require sequential intelligent decisions based on environmental feedback, and existing approaches rely heavily on prompt engineering or expert trajectory fine-tuning.

Method: Develop AgentPRM to capture interdependence between sequential decisions and their contribution to final goals, using Temporal Difference-based estimation with Generalized Advantage Estimation for scalable data labeling.

Result: AgentPRM is over 8× more compute-efficient than baselines, shows robust improvement with increased test-time compute, and works across different agentic tasks.

Conclusion: AgentPRM provides an effective approach for guiding LLM agents in multi-turn decision-making tasks, enabling better progress tracking and exploration-exploitation balance.

Abstract: Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent’s decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.

[47] DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

Xinyi Wang, Yiping Song, Zhiliang Tian, Bo Liu, Tingjin Luo, Minlie Huang

Main category: cs.CL

TL;DR: DPRM is a dual implicit process reward model that trains two separate PRMs for CoT and KG reasoning in MHQA tasks, using outcome signals without explicit annotations and introducing consistency constraints between reasoning paths.

Details

Motivation: Existing implicit PRMs cannot handle graph structure constraints in KGs or capture inconsistencies between CoT and KG paths in MHQA tasks, limiting their effectiveness for multi-step reasoning.

Method: Train two implicit PRMs (KG-PRM and CoT-PRM) that derive step-level rewards from outcome signals via reward parameterization. KG-PRM uses preference pairs to learn structural constraints from KGs, and DPRM introduces consistency constraints between CoT and KG reasoning steps.

Result: Outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1 metric.

Conclusion: DPRM effectively addresses limitations of existing implicit PRMs by handling graph constraints and ensuring consistency between reasoning paths, demonstrating superior performance in MHQA tasks.

Abstract: In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.

[48] The Dynamic Articulatory Model DYNARTmo: Dynamic Movement Generation and Speech Gestures

Bernd J. Kröger

Main category: cs.CL

TL;DR: DYNARTmo is a dynamic articulatory model that generates continuous articulator movements using speech gestures and gesture scores, simulating hierarchical speech production control from linguistic to articulatory-acoustic levels.

Details

Motivation: To create a neurobiologically inspired computational framework for simulating the hierarchical control of speech production, bridging linguistic representations with articulatory-acoustic realization.

Method: Uses speech gestures organized in a gesture inventory, coordinates them through gesture scores, and translates them into continuous articulator trajectories that control the DYNARTmo vocal tract model.

Result: The paper presents the current implementation of DYNARTmo, detailing its structure for generating continuous articulator movements based on the gesture-based approach.

Conclusion: DYNARTmo provides a comprehensive framework for simulating speech production through gesture-based articulatory modeling, connecting linguistic planning with physical articulation.

Abstract: This paper describes the current implementation of the dynamic articulatory model DYNARTmo, which generates continuous articulator movements based on the concept of speech gestures and a corresponding gesture score. The model provides a neurobiologically inspired computational framework for simulating the hierarchical control of speech production from linguistic representation to articulatory-acoustic realization. We present the structure of the gesture inventory, the coordination of gestures in the gesture score, and their translation into continuous articulator trajectories controlling the DYNARTmo vocal tract model.

[49] TurkEmbed: Turkish Embedding Model on NLI & STS Tasks

Özay Ezerceli, Gizem Gümüşçekiçci, Tuğba Erkoç, Berke Özenç

Main category: cs.CL

TL;DR: TurkEmbed is a new Turkish embedding model that outperforms existing models on NLI and STS tasks by 1-4%, using diverse datasets and matryoshka representation learning for better semantic understanding.

Details

Motivation: Current Turkish embedding models rely on machine-translated datasets, which limits their accuracy and semantic understanding of the Turkish language.

Method: Uses diverse datasets and advanced training techniques including matryoshka representation learning to create robust embeddings that adapt to resource-constrained environments with faster encoding.

Result: Achieves significant improvements on Turkish STS-b-TR dataset using Pearson and Spearman correlation metrics, and surpasses state-of-the-art model Emrecan on All-NLI-TR and STS-b-TR benchmarks by 1-4%.

Conclusion: TurkEmbed enhances the Turkish NLP ecosystem by providing more nuanced language understanding and facilitating advancements in downstream applications.

Abstract: This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resource-constrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-b-TR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.

[50] PCRLLM: Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints

Tangrui Li, Pei Wang, Hongzheng Wang Christian Hahm, Matteo Spatola, Justin Shi

Main category: cs.CL

TL;DR: PCRLLM is a framework that constrains LLM reasoning to single-step inferences with explicit premises, rules, and conclusions, enabling verification and systematic multi-LLM collaboration.

Details

Motivation: LLMs often lack logical coherence and explicit inference rules, mapping premises to conclusions without formal reasoning structure, which raises trustworthiness concerns.

Method: Propose Proof-Carrying Reasoning with LLMs (PCRLLM) that constrains reasoning to single-step inferences while preserving natural language, with explicit specification of premises, rules, and conclusions.

Result: Enables verification against target logic, supports chain-level validation in black-box settings, facilitates systematic multi-LLM collaboration with formal rule integration, and introduces benchmark schema for step-level reasoning data.

Conclusion: PCRLLM provides a framework that combines natural language expressiveness with formal rigor, addressing LLM trustworthiness issues through explicit reasoning structure and verification capabilities.

Abstract: Large Language Models (LLMs) often exhibit limited logical coherence, mapping premises to conclusions without adherence to explicit inference rules. We propose Proof-Carrying Reasoning with LLMs (PCRLLM), a framework that constrains reasoning to single-step inferences while preserving natural language formulations. Each output explicitly specifies premises, rules, and conclusions, thereby enabling verification against a target logic. This mechanism mitigates trustworthiness concerns by supporting chain-level validation even in black-box settings. Moreover, PCRLLM facilitates systematic multi-LLM collaboration, allowing intermediate steps to be compared and integrated under formal rules. Finally, we introduce a benchmark schema for generating large-scale step-level reasoning data, combining natural language expressiveness with formal rigor.

[51] Interaction Dynamics as a Reward Signal for LLMs

Sian Gooding, Edward Grefenstette

Main category: cs.CL

TL;DR: TRACE uses conversational geometry (embedding trajectory dynamics) as a novel reward signal for LLM alignment, achieving comparable performance to text-based methods and superior performance when combined with text analysis.

Details

Motivation: Current LLM alignment methods focus only on text content, ignoring valuable interaction dynamics that could provide complementary signals for better conversational performance.

Method: Developed TRACE - a trajectory-based reward model that analyzes geometric properties of dialogue embedding trajectories (conversational geometry) to create structural reward signals.

Result: TRACE achieved 68.20% pairwise accuracy using only structural signals, comparable to LLM baseline using full transcripts (70.04%). Hybrid model combining both achieved best performance (80.17%).

Conclusion: Interaction dynamics are as predictive as text content for conversational success, offering a privacy-preserving framework for agent alignment and diagnostic analysis of collaboration patterns.

Abstract: The alignment of Large Language Models (LLMs) for multi-turn conversations typically relies on reward signals derived from the content of the text. This approach, however, overlooks a rich, complementary source of signal: the dynamics of the interaction itself. This paper introduces TRACE (Trajectory-based Reward for Agent Collaboration Estimation), a novel reward signal derived from the geometric properties of a dialogue’s embedding trajectory–a concept we term ‘conversational geometry’. Our central finding is that a reward model trained only on these structural signals achieves a pairwise accuracy (68.20%) comparable to a powerful LLM baseline that analyzes the full transcript (70.04%). Furthermore, a hybrid model combining interaction dynamics with textual analysis achieves the highest performance (80.17%), demonstrating their complementary nature. This work provides strong evidence that for interactive settings, how an agent communicates is as powerful a predictor of success as what it says, offering a new, privacy-preserving framework that not only aligns agents but also serves as a diagnostic tool for understanding the distinct interaction patterns that drive successful collaboration.

[52] Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?

Shiyan Zheng, Herun Wan, Minnan Luo, Junhang Huang

Main category: cs.CL

TL;DR: Social bot detectors are vulnerable to shortcut learning based on spurious textual correlations. The study shows these shortcuts cause 32% accuracy drop, and proposes LLM-based counterfactual augmentation strategies that improve performance by 56%.

Details

Motivation: Existing social bot detectors perform well on benchmarks but lack robustness in real-world scenarios due to unclear ground truth and shortcut learning, where models rely on spurious correlations rather than causal features.

Method: Designed shortcut scenarios with spurious associations between user labels and textual cues, then proposed mitigation strategies using large language models and counterfactual data augmentation at three levels: individual user text, overall dataset distribution, and model’s causal extraction ability.

Result: Shifts in irrelevant feature distributions caused average 32% relative accuracy drop in baseline models. The proposed LLM-based mitigation strategies achieved 56% relative performance improvement under shortcut scenarios.

Conclusion: Shortcut learning significantly degrades social bot detector robustness, but counterfactual data augmentation using large language models effectively mitigates this problem across multiple levels.

Abstract: While existing social bot detectors perform well on benchmarks, their robustness across diverse real-world scenarios remains limited due to unclear ground truth and varied misleading cues. In particular, the impact of shortcut learning, where models rely on spurious correlations instead of capturing causal task-relevant features, has received limited attention. To address this gap, we conduct an in-depth study to assess how detectors are influenced by potential shortcuts based on textual features, which are most susceptible to manipulation by social bots. We design a series of shortcut scenarios by constructing spurious associations between user labels and superficial textual cues to evaluate model robustness. Results show that shifts in irrelevant feature distributions significantly degrade social bot detector performance, with an average relative accuracy drop of 32% in the baseline models. To tackle this challenge, we propose mitigation strategies based on large language models, leveraging counterfactual data augmentation. These methods mitigate the problem from data and model perspectives across three levels, including data distribution at both the individual user text and overall dataset levels, as well as the model’s ability to extract causal information. Our strategies achieve an average relative performance improvement of 56% under shortcut scenarios.

[53] SPEAR-MM: Selective Parameter Evaluation and Restoration via Model Merging for Efficient Financial LLM Adaptation

Berkcan Kapusuzoglu, Supriyo Chakraborty, Renkun Ni, Stephen Rawls, Sambit Sahu

Main category: cs.CL

TL;DR: SPEAR-MM is a framework that prevents catastrophic forgetting in domain-adapted LLMs by selectively freezing/restoring transformer layers via model merging, achieving high capability retention with reduced computational costs.

Details

Motivation: LLMs adapted to financial domains often lose general reasoning capabilities essential for customer interactions and complex financial analysis, creating a need for methods that preserve these critical capabilities during domain adaptation.

Method: Approximates layer-wise impact on external benchmarks through post-hoc analysis, then selectively freezes or restores transformer layers via spherical interpolation merging.

Result: Applied to LLaMA-3.1-8B, achieves 91.2% retention of general capabilities vs 69.7% for standard continual pretraining, while maintaining 94% of domain adaptation gains, with 90% computational cost reduction.

Conclusion: SPEAR-MM provides interpretable trade-off control and practical efficiency for resource-constrained financial institutions, effectively balancing domain adaptation with capability preservation.

Abstract: Large language models (LLMs) adapted to financial domains often suffer from catastrophic forgetting of general reasoning capabilities essential for customer interactions and complex financial analysis. We introduce Selective Parameter Evaluation and Restoration via Model Merging (SPEAR-MM), a practical framework that preserves critical capabilities while enabling domain adaptation. Our method approximates layer-wise impact on external benchmarks through post-hoc analysis, then selectively freezes or restores transformer layers via spherical interpolation merging. Applied to LLaMA-3.1-8B for financial tasks, SPEAR-MM achieves 91.2% retention of general capabilities versus 69.7% for standard continual pretraining, while maintaining 94% of domain adaptation gains. The approach provides interpretable trade-off control and reduces computational costs by 90% crucial for resource-constrained financial institutions.

[54] Structured RAG for Answering Aggregative Questions

Omri Koshorek, Niv Granot, Aviv Alloni, Shahar Admati, Roee Hendel, Ido Weiss, Alan Arazi, Shay-Nitzan Cohen, Yonatan Belinkov

Main category: cs.CL

TL;DR: S-RAG is a new approach for aggregative queries that constructs structured corpus representations and translates natural language queries into formal queries, outperforming standard RAG systems and long-context LLMs.

Details

Motivation: Current RAG approaches are limited to queries where only small parts of the corpus are relevant, failing to handle aggregative queries that require gathering and reasoning over information from multiple documents.

Method: S-RAG constructs structured representations of the corpus at ingestion time and translates natural language queries into formal queries over this structured representation at inference time.

Result: Experiments on new datasets (HOTELS and WORLD CUP) and a public benchmark show S-RAG substantially outperforms both common RAG systems and long-context LLMs.

Conclusion: S-RAG effectively addresses the gap in handling aggregative queries and introduces new datasets to promote further research in this area.

Abstract: Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.

[55] Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

Neelavro Saha, Rafi Shahriyar, Nafis Ashraf Roudra, Saadman Sakib, Annajiat Alim Rasel

Main category: cs.CL

TL;DR: The paper introduces Bangla-SGP, a novel parallel dataset for Bangla Sign Language translation, addressing the low-resource nature of sentence-level BdSL translation by providing 1,000 human-annotated and 3,000 synthetically generated sentence-gloss pairs.

Details

Motivation: Bangla Sign Language translation is a low-resource NLP task with limited large-scale datasets for sentence-level translation, as existing research has been confined to word and alphabet level detection.

Method: Created a dataset of 1,000 human-annotated sentence-gloss pairs augmented with 3,000 synthetic pairs using syntactic/morphological rules via rule-based RAG pipeline. Fine-tuned transformer models (mBart50, Google mT5, GPT4.1-nano) for sentence-to-gloss translation.

Result: Evaluated models using BLEU scores and compared gloss-translation consistency across the Bangla-SGP dataset and RWTH-PHOENIX-2014T benchmark.

Conclusion: The work provides a valuable resource for Bangla Sign Language research and demonstrates the effectiveness of transformer models for sentence-level gloss translation in low-resource settings.

Abstract: Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model’s gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.

[56] AlphaResearch: Accelerating New Algorithm Discovery with Language Models

Zhaojian Yu, Kaiyue Feng, Yilun Zhao, Shilin He, Xiao-Ping Zhang, Arman Cohan

Main category: cs.CL

TL;DR: AlphaResearch is an autonomous research agent that discovers new algorithms through iterative proposal, verification, and optimization in a dual research environment combining execution-based verification and simulated peer review.

Details

Motivation: Large language models excel at complex but verifiable problems but struggle with discovering unknown algorithms, creating a need for autonomous research agents that can innovate in open-ended problem domains.

Method: Uses a dual research environment with execution-based verification and simulated peer review. Iteratively runs three steps: propose new ideas, verify ideas in the dual environment, and optimize research proposals for better performance.

Result: Achieves 2/8 win rate against human researchers, with discovered algorithm on “packing circles” problem achieving best-of-known performance. Also provides analysis of 6/8 failure cases for future research insights.

Conclusion: Demonstrates the possibility of accelerating algorithm discovery with LLMs through autonomous research agents, though challenges remain in fully matching human researcher capabilities across all open-ended problems.

Abstract: Large language models have made significant progress in complex but easy-to-verify problems, yet they still struggle with discovering the unknown. In this paper, we present \textbf{AlphaResearch}, an autonomous research agent designed to discover new algorithms on open-ended problems. To synergize the feasibility and innovation of the discovery process, we construct a novel dual research environment by combining the execution-based verify and simulated real-world peer review environment. AlphaResearch discovers new algorithm by iteratively running the following steps: (1) propose new ideas (2) verify the ideas in the dual research environment (3) optimize the research proposals for better performance. To promote a transparent evaluation process, we construct \textbf{AlphaResearchComp}, a new evaluation benchmark that includes an eight open-ended algorithmic problems competition, with each problem carefully curated and verified through executable pipelines, objective metrics, and reproducibility checks. AlphaResearch gets a 2/8 win rate in head-to-head comparison with human researchers, demonstrate the possibility of accelerating algorithm discovery with LLMs. Notably, the algorithm discovered by AlphaResearch on the \emph{``packing circles’’} problem achieves the best-of-known performance, surpassing the results of human researchers and strong baselines from recent work (e.g., AlphaEvolve). Additionally, we conduct a comprehensive analysis of the remaining challenges of the 6/8 failure cases, providing valuable insights for future research.

[57] Investigating CoT Monitorability in Large Reasoning Models

Shu Yang, Junchao Wu, Xilin Gou, Xuansheng Wu, Derek Wong, Ninhao Liu, Di Wang

Main category: cs.CL

TL;DR: This paper investigates the potential and challenges of using Chain-of-Thought (CoT) reasoning traces from Large Reasoning Models for monitoring model misbehavior, addressing issues of faithfulness in verbalization and monitor reliability.

Details

Motivation: To explore how detailed reasoning traces from Large Reasoning Models can enable AI safety monitoring by detecting misbehavior like shortcuts or sycophancy through chain-of-thought analysis, while addressing fundamental challenges of faithfulness and monitor reliability.

Method: Systematic investigation structured around two perspectives: verbalization quality (faithfulness of reasoning traces) and monitor reliability. Includes empirical evidence across mathematical, scientific, and ethical domains, analysis of CoT intervention methods, and proposes MoME - a paradigm where LLMs monitor other models’ misbehavior through their CoT.

Result: Provides empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance. Investigates how different CoT intervention methods affect monitoring effectiveness.

Conclusion: Proposes MoME as a new paradigm for AI safety monitoring where LLMs can monitor other models’ misbehavior through their chain-of-thought reasoning, providing structured judgments with supporting evidence.

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models’ long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models’ misbehavior through their CoT and provide structured judgments along with supporting evidence.

[58] From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL

Amirmohammad Omidi Galdiani, Sepehr Rezaei Melal, Mohammad Norasteh, Arash Yousefi Jordehi, Seyed Abolghasem Mirroshandel

Main category: cs.CL

TL;DR: Methodology for constructing a high-quality SRL dataset from WSJ/OntoNotes 5.0 and adapting it for Opinion Role Labeling tasks using PropBank framework.

Details

Motivation: To create a reusable resource for researchers aiming to leverage Semantic Role Labeling to enhance Opinion Role Labeling, especially in low-resource opinion mining scenarios.

Method: Implemented reproducible extraction pipeline aligning predicate-argument structures with text, converting syntactic tree pointers to spans, and applying rigorous cleaning for semantic fidelity using PropBank annotation framework.

Result: Created dataset with 97,169 predicate-argument instances with Agent (ARG0), Predicate (REL), and Patient (ARG1) roles mapped to ORL’s Holder, Expression, and Target schema.

Conclusion: Provides detailed extraction algorithms, discontinuous argument handling, annotation corrections, and statistical analysis for researchers to use SRL to enhance ORL tasks.

Abstract: This report presents a detailed methodology for constructing a high-quality Semantic Role Labeling (SRL) dataset from the Wall Street Journal (WSJ) portion of the OntoNotes 5.0 corpus and adapting it for Opinion Role Labeling (ORL) tasks. Leveraging the PropBank annotation framework, we implement a reproducible extraction pipeline that aligns predicate-argument structures with surface text, converts syntactic tree pointers to coherent spans, and applies rigorous cleaning to ensure semantic fidelity. The resulting dataset comprises 97,169 predicate-argument instances with clearly defined Agent (ARG0), Predicate (REL), and Patient (ARG1) roles, mapped to ORL’s Holder, Expression, and Target schema. We provide a detailed account of our extraction algorithms, discontinuous argument handling, annotation corrections, and statistical analysis of the resulting dataset. This work offers a reusable resource for researchers aiming to leverage SRL for enhancing ORL, especially in low-resource opinion mining scenarios.

[59] Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Davi Bastos Costa, Felippe Alves, Renato Vicente

Main category: cs.CL

TL;DR: LLMs show varying moral susceptibility and robustness when role-playing personas, with Claude models being most robust and larger models more susceptible to moral shifts.

Details

Motivation: To understand how LLMs express and shift moral judgments in social contexts through persona role-play, as they increasingly operate in social environments.

Method: Used Moral Foundations Questionnaire (MFQ) to create a benchmark measuring moral susceptibility (variability across personas) and moral robustness (variability within personas) across different LLM families and sizes.

Result: Claude family is most robust, followed by Gemini and GPT-4; larger models within families are more susceptible; robustness and susceptibility are positively correlated, especially at family level.

Conclusion: Persona conditioning systematically shapes moral behavior in LLMs, with model family being the primary factor for robustness and model size affecting susceptibility within families.

Abstract: Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across and within personas, respectively. We find that, for moral robustness, model family accounts for most of the variance, while model size shows no systematic effect. The Claude family is, by a significant margin, the most robust, followed by Gemini and GPT-4 models, with other families exhibiting lower robustness. In contrast, moral susceptibility exhibits a mild family effect but a clear within-family size effect, with larger variants being more susceptible. Moreover, robustness and susceptibility are positively correlated, an association that is more pronounced at the family level. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in large language models.

[60] Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang

Main category: cs.CL

TL;DR: TaH is a dynamic latent thinking method that iterates deeper only at hard tokens using a lightweight neural decider, avoiding overthinking by exempting easy tokens from additional iterations.

Details

Motivation: To address latent overthinking in recurrent transformers where easy tokens that are already correct get revised into errors during additional iterations, and improve reasoning capabilities under parameter constraints.

Method: Uses a neural decider to trigger latent iterations only at likely incorrect tokens, employs LoRA modules for hard-token refinement, and introduces duo-causal attention for cross-iteration information flow while maintaining parallelism.

Result: Boosts reasoning performance across five benchmarks with same parameter count, delivers 8.1-11.3% accuracy gains over baselines while exempting 94% of tokens from second iteration, and achieves 4.0-5.0% gains over single-iteration models.

Conclusion: TaH effectively improves LLM reasoning by dynamically focusing computational resources on hard tokens, preventing overthinking on easy tokens while maintaining efficiency and parameter constraints.

Abstract: Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.

[61] Training Language Models to Explain Their Own Computations

Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas

Main category: cs.CL

TL;DR: LMs can learn to generate natural language explanations of their internal computations through fine-tuning, and self-explanation works better than using other models.

Details

Motivation: To investigate whether LMs can leverage their privileged access to internal computations to produce faithful explanations of their behavior, complementing existing interpretability methods.

Method: Fine-tune LMs using existing interpretability techniques as ground truth to generate natural language descriptions of feature information, causal structure of activations, and token influence on outputs.

Result: Explainer models show non-trivial generalization with only tens of thousands of training examples, and self-explanation performs better than using different models even if more capable.

Conclusion: LMs can reliably explain their internal computations, offering a scalable complement to existing interpretability methods.

Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs’ privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs’ internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models’ privileged access to their own internals: using a model to explain its own computations generally works better than using a different model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.

[62] Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Sher Badshah, Hassan Sajjad

Main category: cs.CL

TL;DR: A reference-guided verdict method using multiple LLMs as judges for evaluating open-ended generative tasks, showing improved reliability and strong correlation with human evaluations.

Details

Motivation: Conventional metrics like EM and F1 are inadequate for capturing the full semantics and contextual depth of LLM-generated conversations in open-ended tasks.

Method: Propose a reference-guided verdict method that automates evaluation by leveraging multiple LLMs as judges, tested on free-form question-answering tasks.

Result: Combining multiple models improves reliability and accuracy of evaluations, with strong correlation to human evaluations.

Conclusion: The proposed method serves as a reliable alternative to traditional metrics for evaluating LLM-generated content in open-ended tasks.

Abstract: The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.

[63] Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Question Answering

Larissa Pusch, Tim O. F. Conrad

Main category: cs.CL

TL;DR: A hybrid approach combining LLMs and Knowledge Graphs to reduce hallucinations in biomedical question-answering systems, featuring query validation and a user-friendly interface.

Details

Motivation: Address the hallucination problem in LLMs for critical domains like biomedicine, where accuracy is crucial to prevent dangerous misinformation.

Method: Built on LangChain framework with query checker for syntactical/semantic validation of LLM-generated queries, extracting information from biomedical Knowledge Graphs.

Result: GPT-4 Turbo outperforms other models in generating accurate queries; open-source models like llama3:70b show promise with proper prompt engineering. Evaluated on 50 biomedical questions.

Conclusion: Hybrid LLM-KG approach effectively reduces data gaps and hallucinations, providing reliable and intuitive question-answering for biomedical applications.

Abstract: Advancements in natural language processing have revolutionized the way we can interact with digital information systems, such as databases, making them more accessible. However, challenges persist, especially when accuracy is critical, as in the biomedical domain. A key issue is the hallucination problem, where models generate information unsupported by the underlying data, potentially leading to dangerous misinformation. This paper presents a novel approach designed to bridge this gap by combining Large Language Models (LLM) and Knowledge Graphs (KG) to improve the accuracy and reliability of question-answering systems, on the example of a biomedical KG. Built on the LangChain framework, our method incorporates a query checker that ensures the syntactical and semantic validity of LLM-generated queries, which are then used to extract information from a Knowledge Graph, substantially reducing errors like hallucinations. We evaluated the overall performance using a new benchmark dataset of 50 biomedical questions, testing several LLMs, including GPT-4 Turbo and llama3:70b. Our results indicate that while GPT-4 Turbo outperforms other models in generating accurate queries, open-source models like llama3:70b show promise with appropriate prompt engineering. To make this approach accessible, a user-friendly web-based interface has been developed, allowing users to input natural language queries, view generated and corrected Cypher queries, and verify the resulting paths for accuracy. Overall, this hybrid approach effectively addresses common issues such as data gaps and hallucinations, offering a reliable and intuitive solution for question answering systems. The source code for generating the results of this paper and for the user-interface can be found in our Git repository: https://git.zib.de/lpusch/cyphergenkg-gui

[64] Selection of LLM Fine-Tuning Data based on Orthogonal Rules

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, Hong Hu

Main category: cs.CL

TL;DR: Proposes a novel rule-based data selection framework using orthogonality metrics and determinantal point process (DPP) to select complementary rules for high-quality training data selection for LLMs.

Details

Motivation: Existing methods for using LLMs to rate and select training data rely heavily on heuristics, lack principled rule evaluation metrics, and generalize poorly to new tasks.

Method: Automated pipeline that: 1) uses LLMs to generate diverse rules covering multiple data quality aspects, 2) rates samples by these rules, 3) applies DPP to select most independent rules, 4) scores full dataset and selects high-scoring samples for downstream tasks.

Result: Experiments across IMDB, Medical, Math, and Code domains show DPP-based rule selection consistently improves both rating accuracy and downstream model performance over strong baselines.

Conclusion: The proposed framework provides a principled approach for selecting complementary rules that enhances data selection quality and downstream LLM performance across diverse domains.

Abstract: High-quality training data is critical to the performance of large language models (LLMs). Recent work has explored using LLMs to rate and select data based on a small set of human-designed criteria (rules), but these approaches often rely heavily on heuristics, lack principled metrics for rule evaluation, and generalize poorly to new tasks. We propose a novel rule-based data selection framework that introduces a metric based on the orthogonality of rule score vectors to evaluate and select complementary rules. Our automated pipeline first uses LLMs to generate diverse rules covering multiple aspects of data quality, then rates samples according to these rules and applies the determinantal point process (DPP) to select the most independent rules. These rules are then used to score the full dataset, and high-scoring samples are selected for downstream tasks such as LLM fine-tuning. We evaluate our framework in two experiment setups: (1) alignment with ground-truth ratings and (2) performance of LLMs fine-tuned on the selected data. Experiments across IMDB, Medical, Math, and Code domains demonstrate that our DPP-based rule selection consistently improves both rating accuracy and downstream model performance over strong baselines.

[65] VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

Zhehao Zhang, Ryan Rossi, Tong Yu, Franck Dernoncourt, Ruiyi Zhang, Jiuxiang Gu, Sungchul Kim, Xiang Chen, Zichao Wang, Nedim Lipka

Main category: cs.CL

TL;DR: VipAct is a multi-agent framework that enhances vision-language models (VLMs) for fine-grained visual perception tasks by integrating specialized agents and vision expert models to enable more precise visual understanding and comprehensive reasoning.

Details

Motivation: VLMs struggle with fine-grained visual perception tasks requiring detailed pixel-level analysis, and effectively eliciting comprehensive reasoning from VLMs on intricate visual elements remains challenging.

Method: Multi-agent framework with orchestrator agent for task management and specialized agents for specific tasks like image captioning, plus vision expert models for high-precision perceptual information. Combines planning, reasoning, and tool use.

Result: Significant performance improvements over state-of-the-art baselines across diverse visual perception tasks. Ablation studies show multi-agent collaboration enables detailed System-2 reasoning and image input is crucial for task planning.

Conclusion: VipAct provides a flexible, extensible framework that addresses VLMs’ limitations in visual perception and paves the way for more advanced visual perception systems in real-world applications.

Abstract: While vision-language models (VLMs) have demonstrated remarkable performance across various tasks combining textual and visual information, they continue to struggle with fine-grained visual perception tasks that require detailed pixel-level analysis. Effectively eliciting comprehensive reasoning from VLMs on such intricate visual elements remains an open challenge. In this paper, we present VipAct, an agent framework that enhances VLMs by integrating multi-agent collaboration and vision expert models, enabling more precise visual understanding and comprehensive reasoning. VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks such as image captioning and vision expert models that provide high-precision perceptual information. This multi-agent approach allows VLMs to better perform fine-grained visual perception tasks by synergizing planning, reasoning, and tool use. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements over state-of-the-art baselines across all tasks. Furthermore, comprehensive ablation studies reveal the critical role of multi-agent collaboration in eliciting more detailed System-2 reasoning and highlight the importance of image input for task planning. Additionally, our error analysis identifies patterns of VLMs’ inherent limitations in visual perception, providing insights into potential future improvements. VipAct offers a flexible and extensible framework, paving the way for more advanced visual perception systems across various real-world applications.

[66] From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models

Charles Zhang, Benji Peng, Xintian Sun, Qian Niu, Junyu Liu, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Ming Liu, Yichao Zhang, Xinyuan Song, Cheng Fei, Caitlyn Heqi Yin, Lawrence KQ Yan, Tianyang Wang

Main category: cs.CL

TL;DR: This survey paper comprehensively reviews the evolution of word embeddings and language models in NLP, covering foundational concepts, static and contextualized embeddings, multimodal applications, and advanced topics like model compression and bias mitigation.

Details

Motivation: To provide a comprehensive overview of embedding-based language models, tracing their evolution from basic representations to advanced contextual models, and to synthesize current methodologies with emerging trends for researchers and practitioners.

Method: The paper conducts a systematic review of embedding techniques, analyzing foundational concepts like distributional hypothesis, examining static embeddings (Word2Vec, GloVe, fastText) and contextualized models (ELMo, BERT, GPT), and exploring applications in sentence/document embeddings and multimodal domains.

Result: The survey synthesizes current methodologies in embedding-based language models, highlighting advancements from sparse to dense representations, the shift to contextual embeddings, and applications across various domains including cross-lingual, personalized, and multimodal settings.

Conclusion: The paper identifies future research directions emphasizing scalable training techniques, enhanced interpretability, and robust grounding in non-textual modalities, providing researchers with a comprehensive resource to advance embedding-based language models.

Abstract: Word embeddings and language models have transformed natural language processing (NLP) by facilitating the representation of linguistic elements in continuous vector spaces. This review visits foundational concepts such as the distributional hypothesis and contextual similarity, tracing the evolution from sparse representations like one-hot encoding to dense embeddings including Word2Vec, GloVe, and fastText. We examine both static and contextualized embeddings, underscoring advancements in models such as ELMo, BERT, and GPT and their adaptations for cross-lingual and personalized applications. The discussion extends to sentence and document embeddings, covering aggregation methods and generative topic models, along with the application of embeddings in multimodal domains, including vision, robotics, and cognitive science. Advanced topics such as model compression, interpretability, numerical encoding, and bias mitigation are analyzed, addressing both technical challenges and ethical implications. Additionally, we identify future research directions, emphasizing the need for scalable training techniques, enhanced interpretability, and robust grounding in non-textual modalities. By synthesizing current methodologies and emerging trends, this survey offers researchers and practitioners an in-depth resource to push the boundaries of embedding-based language models.

[67] Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction

WonJin Yoon, Boyu Ren, Spencer Thomas, Chanhwi Kim, Guergana Savova, Mei-Hua Hall, Timothy Miller

Main category: cs.CL

TL;DR: This paper presents a method that uses multiple aspect-oriented LLM summaries of long documents to improve performance on complex tasks like 30-day readmission prediction, showing better results than single-summary approaches.

Details

Motivation: LLMs have suboptimal zero-shot performance on complex tasks with lengthy documents, and single-summary approaches lose important information. Different aspect-oriented summaries capture different information signals that can be integrated for better performance.

Method: Generate multiple LLM summaries using different aspect-oriented prompts, measure the differences between these summaries, and integrate signals from these different summaries for supervised training of transformer models.

Result: The method was validated on 30-day readmission prediction from psychiatric discharge using real-world data from four hospitals, showing increased prediction performance for this complex patient outcome task.

Conclusion: Using multiple aspect-oriented LLM summaries effectively captures different important aspects of original documents and improves performance on complex tasks compared to single-summary approaches.

Abstract: Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then apply supervised fine-tuning to the summary. However, the summarization process inevitably results in some loss of information. In this study we present a method for processing the summaries of long documents aimed to capture different important aspects of the original document. We hypothesize that LLM summaries generated with different aspect-oriented prompts contain different information signals, and we propose methods to measure these differences. We introduce approaches to effectively integrate signals from these different summaries for supervised training of transformer models. We validate our hypotheses on a high-impact task – 30-day readmission prediction from a psychiatric discharge – using real-world data from four hospitals, and show that our proposed method increases the prediction performance for the complex task of predicting patient outcome.

[68] Thus Spake Long-Context Large Language Model

Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Ziwei He, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu

Main category: cs.CL

TL;DR: This survey paper provides a comprehensive overview of long-context Large Language Models (LLMs), analyzing their lifecycle from four perspectives: architecture, infrastructure, training, and evaluation, while drawing analogies to human attempts to transcend mortality.

Details

Motivation: Long context is crucial for LLMs as it enables lifelong learning potential and represents a core competitive advantage, with recent breakthroughs extending context length to millions of tokens.

Method: The survey adopts a systematic approach by examining the full lifecycle of long-context LLMs across four key dimensions: architecture, infrastructure, training, and evaluation technologies.

Result: The paper presents a global picture of long-context LLM technologies and identifies 10 unanswered questions currently facing the field, providing a comprehensive foundation for future research.

Conclusion: Long-context LLMs represent a fundamental struggle between the need for extended context and the reality of finite resources, with ongoing research expanding beyond length extrapolation to encompass broader technological considerations.

Abstract: Long context is an important topic in Natural Language Processing (NLP), running through the development of NLP architectures, and offers immense opportunities for Large Language Models (LLMs), giving LLMs the lifelong learning potential akin to humans. Unfortunately, the pursuit of a long context is accompanied by numerous obstacles. Nevertheless, long context remains a core competitive advantage for LLMs. In the past two years, the context length of LLMs has achieved a breakthrough extension to millions of tokens. Moreover, research on long-context LLMs has expanded beyond length extrapolation to a comprehensive focus on architecture, infrastructure, training, and evaluation technologies. Inspired by the symphonic poem, Thus Spake Zarathustra, we draw an analogy between the journey of extending the context of LLM and the attempts of humans to transcend their mortality. In this survey, we will illustrate how LLM struggles between the tremendous need for a longer context and its equal need to accept the fact that it is ultimately finite. To achieve this, we give a global picture of the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation, showcasing the full spectrum of long-context technologies. At the end of this survey, we will present 10 unanswered questions currently faced by long-context LLMs. We hope this survey can serve as a systematic introduction to research on long-context LLMs. Video: https://www.bilibili.com/video/BV11h9AYoEYj. Github: https://github.com/OpenMOSS/Thus-Spake-Long-Context-LLM.

[69] Figurative Archive: an open dataset and web-based application for the study of metaphor

Maddalena Bressler, Veronica Mangiaterra, Paolo Canal, Federico Frau, Fabrizio Luciani, Biagio Scalingi, Chiara Barattieri di San Pietro, Chiara Battaglini, Chiara Pompei, Fortunata Romeo, Luca Bischetti, Valentina Bambini

Main category: cs.CL

TL;DR: The Figurative Archive is an open database of 996 Italian metaphors with rating and corpus-based measures, validated through correlations between familiarity and other metrics.

Details

Motivation: To meet the growing demand for rigorously constructed and extensively normed experimental materials in metaphor research, which provides insights into linguistic and cognitive processes.

Method: Collection of stimuli from 11 studies, including both everyday and literary metaphors varying in structure and semantic domains, enriched with rating and corpus-based measures.

Result: Creation of an open database with 996 Italian metaphors, featuring measures like familiarity, semantic distance, preferred interpretations, and metaphor inclusiveness for non-discriminatory language use.

Conclusion: The Archive provides a valuable resource for sourcing materials in metaphor processing studies and exploring relationships between metaphor features in humans and computational models.

Abstract: Research on metaphor has steadily increased over the last decades, as this phenomenon opens a window into a range of linguistic and cognitive processes. At the same time, the demand for rigorously constructed and extensively normed experimental materials increased as well. Here, we present the Figurative Archive, an open database of 996 metaphors in Italian enriched with rating and corpus-based measures (from familiarity to semantic distance and preferred interpretations), derived by collecting stimuli used across 11 studies. It includes both everyday and literary metaphors, varying in structure and semantic domains, and is validated based on correlations between familiarity and other measures. The Archive has several aspects of novelty: it is increased in size compared to previous resources; it offers a measure of metaphor inclusiveness, to comply with recommendations for non-discriminatory language use; it is displayed in a web-based interface, with features for a customized consultation. We provide guidelines for using the Archive to source materials for studies investigating metaphor processing and relationships between metaphor features in humans and computational models.

[70] CLEV: LLM-Based Evaluation Through Lightweight Efficient Voting for Free-Form Question-Answering

Sher Badshah, Moamen Moustafa, Hassan Sajjad

Main category: cs.CL

TL;DR: CLEV is a lightweight evaluation framework that uses two primary LLM judges and a third tiebreaker only when needed, providing reliable QA assessment with reduced computational costs.

Details

Motivation: Traditional metrics fail to handle semantic equivalence and variability in free-form QA, while current LLM-based evaluators are computationally expensive.

Method: Uses two primary LLMs as judges with a third judge invoked only in cases of disagreement, implementing consensus via lightweight efficient voting.

Result: Demonstrated through experiments and human evaluation that CLEV provides consistent, scalable, and resource-efficient assessments.

Conclusion: CLEV establishes a robust framework for evaluating LLMs on free-form QA by balancing reliability with computational efficiency.

Abstract: Evaluating free-form Question Answering (QA) remains a challenge due to its diverse and open-ended nature. Traditional automatic metrics fail to capture semantic equivalence or accommodate the variability of open-ended responses. Leveraging Large Language Models (LLMs) as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. We propose Consensus via Lightweight Efficient Voting (CLEV), which employs two primary LLMs as judges and invokes a third judge only in cases of disagreement. This approach prioritizes evaluation reliability while reducing unnecessary computational demands. Through experiments, including human evaluation, we demonstrate CLEV’s ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating LLMs on free-form QA.

[71] “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection

Muhammad Haroon, Magdalena Wojcieszak, Anshuman Chhabra

Main category: cs.CL

TL;DR: LLMs with in-context learning outperform traditional methods for political ideology classification of online content, with label-balanced demonstration selection showing significant improvements.

Details

Motivation: Address limitations of existing ideology classification methods that require extensive human effort, large labeled datasets, and cannot adapt to evolving ideological contexts.

Method: Use Large Language Models (LLMs) with in-context learning, employing label-balanced demonstration selection on three datasets of news articles and YouTube videos, and evaluate metadata influence.

Result: Approach significantly outperforms zero-shot and traditional supervised methods; metadata influences classification; source information affects LLM’s political/non-political content classification.

Conclusion: LLMs with in-context learning offer effective solution for political ideology classification, overcoming limitations of traditional approaches and adapting to evolving contexts.

Abstract: The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM’s classification.

[72] ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models

Xiaomin Li, Xupeng Chen, Jingxuan Fan, Eric Hanchen Jiang, Mingye Gao

Main category: cs.CL

TL;DR: ENCORE is an entropy-guided method that penalizes high-entropy safety rules in multi-attribute reward modeling for LLM safety alignment, outperforming existing approaches without requiring training.

Details

Motivation: Safety alignment of LLMs often uses RLHF with fine-grained safety rule ratings, but rules with high rating entropy have lower accuracy in distinguishing preferred responses.

Method: Propose ENCORE - entropy-guided multi-head reward composition that penalizes rules with high rating entropy, based on theoretical analysis showing such rules get negligible weights under Bradley-Terry loss.

Result: ENCORE consistently outperforms baselines (random/uniform weighting, single-head Bradley-Terry, LLM-as-a-judge) on RewardBench safety tasks.

Conclusion: ENCORE is training-free, dataset-agnostic, interpretable, and provides a practical effective approach for multi-attribute reward modeling in LLM safety alignment.

Abstract: The safety alignment of large language models (LLMs) often relies on reinforcement learning from human feedback (RLHF), which requires human annotations to construct preference datasets. Given the challenge of assigning overall quality scores to data, recent works increasingly adopt fine-grained ratings based on multiple safety rules. In this paper, we discover a robust phenomenon: Rules with higher rating entropy tend to have lower accuracy in distinguishing human-preferred responses. Exploiting this insight, we propose ENCORE, a simple entropy-guided method to compose multi-head rewards by penalizing rules with high rating entropy. Theoretically, we show that such rules yield negligible weights under the Bradley-Terry loss during weight optimization, naturally justifying their penalization. Empirically, ENCORE consistently outperforms strong baselines, including random and uniform weighting, single-head Bradley-Terry, and LLM-as-a-judge, etc. on RewardBench safety tasks. Our method is completely training-free, generally applicable across datasets, and retains interpretability, making it a practical and effective approach for multi-attribute reward modeling.

[73] CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment

Jiangnan Li, Thuy-Trang Vu, Christian Herold, Amirhossein Tebbifakhr, Shahram Khadivi, Gholamreza Haffari

Main category: cs.CL

TL;DR: CONGRAD is a scalable filtering method that selects high-quality preference samples with minimal gradient conflicts across languages for multilingual preference alignment in LLMs, outperforming baselines with minimal alignment tax.

Details

Motivation: Naive joint training of LLMs for multilingual preference alignment suffers from negative interference due to conflicting objectives across languages, which remains underexplored in preference alignment contexts.

Method: Proposes CONGRAD using gradient surgery to select samples aligned with aggregated multilingual update direction, with sublinear gradient compression to reduce memory overhead during gradient accumulation, integrated into self-rewarding framework.

Result: CONGRAD consistently outperforms strong baselines on LLaMA3-8B and Gemma2-2B across 10 languages in both seen and unseen languages.

Conclusion: CONGRAD effectively addresses negative interference in multilingual preference alignment through gradient conflict minimization and scalable filtering, achieving superior performance with minimal alignment tax.

Abstract: Naive joint training of large language models (LLMs) for multilingual preference alignment can suffer from negative interference. This is a known issue in multilingual training, where conflicting objectives degrade overall performance. However, the impact of this phenomenon in the context of multilingual preference alignment remains largely underexplored. To address this issue, we propose CONGRAD, a scalable and effective filtering method that selects high-quality preference samples with minimal gradient conflicts across languages. Our method leverages gradient surgery to retain samples aligned with an aggregated multilingual update direction. Additionally, we incorporate a sublinear gradient compression strategy that reduces memory overhead during gradient accumulation. We integrate CONGRAD into self-rewarding framework and evaluate on LLaMA3-8B and Gemma2-2B across 10 languages. Results show that CONGRAD consistently outperforms strong baselines in both seen and unseen languages, with minimal alignment tax.

[74] STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Yanqing Liu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie

Main category: cs.CL

TL;DR: STAR-1 is a 1k-scale safety dataset for large reasoning models that improves safety performance by 40% with minimal reasoning ability loss (1.1%) through diversity, deliberative reasoning, and rigorous filtering.

Details

Motivation: Address the critical need for safety alignment in large reasoning models (LRMs) by creating a high-quality, specialized safety dataset that can effectively improve safety without significantly compromising reasoning capabilities.

Method: Built on three principles: diversity (integrating existing open-source safety datasets), deliberative reasoning (curating safety policies to generate policy-grounded reasoning samples), and rigorous filtering (using GPT-4o-based safety scoring to select best-practice-aligned training examples).

Result: Fine-tuning LRMs with STAR-1 achieves 40% average improvement in safety performance across four benchmarks, with only 1.1% average decrease in reasoning ability across five reasoning tasks.

Conclusion: STAR-1 effectively enhances safety alignment in LRMs while preserving reasoning capabilities, and ablation studies validate the importance of its design principles for both LRMs and traditional LLMs.

Abstract: This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles – diversity, deliberative reasoning, and rigorous filtering – STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.

[75] ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Xiao Wang, Daniil Larionov, Siwei Wu, Yiqi Liu, Steffen Eger, Nafise Sadat Moosavi, Chenghua Lin

Main category: cs.CL

TL;DR: ContrastScore is a contrastive evaluation metric that outperforms existing LLM-based metrics in aligning with human judgments for text generation tasks like machine translation and summarization, while being more efficient and mitigating common evaluation biases.

Details

Motivation: Current LLM-based metrics for natural language generation assessment, especially smaller models, don't align well with human judgments. Conventional reference-based metrics also show weak correlation with human evaluations.

Method: Introduces ContrastScore, a contrastive evaluation metric that enables higher-quality, less biased, and more efficient assessment of generated text through contrastive learning approaches.

Result: ContrastScore consistently achieves stronger correlation with human judgments than single-model and ensemble-based baselines. Even smaller models (Qwen 3B and 0.5B) outperform larger models (Qwen 7B) while using fewer parameters. It effectively mitigates length and likelihood preference biases.

Conclusion: ContrastScore provides a more robust, efficient, and human-aligned automatic evaluation method for natural language generation tasks, demonstrating superior performance over existing metrics while reducing computational requirements.

Abstract: Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

[76] Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives

Ratna Kandala, Niels Vanhasbroeck, Katie Hoemann

Main category: cs.CL

TL;DR: BERTopic outperforms LDA and KMeans for culturally relevant topic modeling in Flemish personal narratives, showing the importance of contextual embeddings and human evaluation.

Details

Motivation: Standard topic models struggle with culturally specific nuances, especially in underrepresented linguistic contexts like Belgian-Dutch (Flemish).

Method: Compared KMeans, LDA, and BERTopic on 25,000 Flemish personal narratives using both automated metrics and human evaluation.

Result: LDA performed well on automated coherence metrics, but BERTopic identified the most coherent and culturally relevant topics in human evaluation. KMeans performed worse than in similar Dutch corpora.

Conclusion: Contextual embeddings are crucial for robust topic modeling, and human-centered evaluation is essential for low-resource languages and culturally specific domains.

Abstract: Standard topic models often struggle to capture culturally specific nuances in text. This study evaluates the effectiveness of contextual embeddings for identifying culturally resonant themes in an underrepresented linguistic context. We compare the performance of KMeans Clustering, Latent Dirichlet Allocation (LDA), and BERTopic on a corpus of nearly 25,000 daily personal narratives written in Belgian-Dutch (Flemish). While LDA achieves strong performance on automated coherence metrics, subsequent human evaluation reveals that BERTopic consistently identifies the most coherent and culturally relevant topics, highlighting the limitations of purely statistical methods on this narrative-rich data. Furthermore, the diminished performance of K-Means compared to prior work on similar Dutch corpora underscores the unique linguistic challenges posed by personal narrative analysis. Our findings demonstrate the critical role of contextual embeddings in robust topic modeling and emphasize the need for human-centered evaluation, particularly when working with low-resource languages and culturally specific domains.

[77] Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

Pengchao Feng, Ziyang Ma, Wenxi Chen, Yao Li, Sheng Wang, Kai Yu, Xie Chen

Main category: cs.CL

TL;DR: Proposes an end-to-end RAG framework for speech-to-speech dialogue systems that directly retrieves textual knowledge from speech queries, bridging the modality gap between speech input and retrieved text.

Details

Motivation: End-to-end S2S systems have advantages in latency and nonverbal cue integration but struggle with incorporating external knowledge due to the modality gap between speech and text.

Method: Novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries without intermediate text conversion.

Result: Significantly improves performance of end-to-end S2S dialogue systems and achieves higher retrieval efficiency, though still lags behind state-of-the-art cascaded models.

Conclusion: The framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems, with code and dataset released for further research.

Abstract: End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.

[78] On the generalization of language models from in-context learning and finetuning: a controlled study

Andrew K. Lampinen, Arslan Chaudhry, Stephanie C. Y. Chan, Cody Wild, Diane Wan, Alex Ku, Jörg Bornschein, Razvan Pascanu, Murray Shanahan, James L. McClelland

Main category: cs.CL

TL;DR: Language models generalize better through in-context learning than fine-tuning for factual reasoning tasks, and adding reasoning traces to fine-tuning data improves generalization.

Details

Motivation: Large language models show poor generalization from fine-tuning, failing at simple logical deductions and relation reversals, which hinders their reasoning capabilities despite in-context learning showing different inductive biases.

Method: Created novel datasets to isolate knowledge from pretraining, tested models with controlled subsets through ICL vs fine-tuning, and proposed adding in-context reasoning traces to fine-tuning data to improve generalization.

Result: ICL generalizes more flexibly than fine-tuning for various inference types, though fine-tuning can handle reversals in larger knowledge structures; adding reasoning traces to fine-tuning improves generalization across datasets.

Conclusion: Different learning modes (ICL vs fine-tuning) afford different generalization capabilities, and incorporating reasoning traces into fine-tuning can practically improve model performance on reasoning tasks.

Abstract: Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning. E.g. they can fail to generalize to simple reversals of relations they are trained on, or fail to make simple logical deductions based on trained information. These failures to generalize factual information from fine-tuning can significantly hinder the reasoning capabilities of these models. On the other hand, language models’ in-context learning (ICL) shows different inductive biases and deductive reasoning capabilities. Here, we explore these differences in generalization and deductive reasoning between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models’ abilities to make generalizations over factual information from novel data. These datasets are designed to create clean tests of generalization, by isolating the knowledge in the dataset from that in pretraining. We expose pretrained large models to controlled subsets of the information in these datasets – either through ICL or fine-tuning – and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, ICL can generalize several types of inferences more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context reasoning traces to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the generalization afforded by different modes of learning in language models, and practically improving their performance.

[79] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization

Chengyu Huang, Zhengxin Zhang, Claire Cardie

Main category: cs.CL

TL;DR: HAPO is a method that uses historical length information to train LLMs to produce more concise reasoning while maintaining accuracy, achieving 33-59% length reduction with minimal accuracy loss.

Details

Motivation: Current test-time scaling methods for LLMs produce verbose outputs and increase inference costs, without leveraging historical problem-solving information to progressively improve conciseness.

Method: HAPO tracks history states (minimum length of previous correct responses) and uses a novel length reward function to incentivize discovering more concise correct solutions, combined with correctness reward for joint optimization.

Result: HAPO-trained models achieved 33-59% length reduction on math benchmarks with only 2-5% accuracy drops, demonstrating effective induction of concise reasoning abilities.

Conclusion: HAPO successfully enables LLMs to produce significantly more concise reasoning while maintaining high accuracy by leveraging historical length information during training.

Abstract: While scaling the length of responses at test-time has been shown to markedly improve the reasoning abilities and performance of large language models (LLMs), it often results in verbose outputs and increases inference cost. Prior approaches for efficient test-time scaling, typically using universal budget constraints or query-level length optimization, do not leverage historical information from previous encounters with the same problem during training. We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. HAPO employs a novel length reward function based on this history state to incentivize the discovery of correct solutions that are more concise than those previously found. Crucially, this reward structure avoids overly penalizing shorter incorrect responses with the goal of facilitating exploration towards more efficient solutions. By combining this length reward with a correctness reward, HAPO jointly optimizes for correctness and efficiency. We use HAPO to train DeepSeek-R1-Distill-Qwen-1.5B, DeepScaleR-1.5B-Preview, and Qwen-2.5-1.5B-Instruct, and evaluate HAPO on several math benchmarks that span various difficulty levels. Experiment results demonstrate that HAPO effectively induces LLMs’ concise reasoning abilities, producing length reductions of 33-59% with accuracy drops of only 2-5%.

[80] UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

Qizhou Chen, Dakan Wang, Taolin Zhang, Zaoming Yan, Chengsong You, Chengyu Wang, Xiaofeng He

Main category: cs.CL

TL;DR: UniEdit is a unified benchmark for LLM editing that addresses limitations of existing datasets by covering 25 domains across 5 categories using open-domain knowledge graphs, with comprehensive evaluation of ripple effects through a novel sampling algorithm.

Details

Motivation: Current LLM editing datasets are limited to narrow knowledge domains, overlook diverse editing demands, and fail to adequately capture the ripple effects of edits, necessitating a more comprehensive benchmark.

Method: Constructed editing samples from 25 domains using open-domain knowledge graphs, designed Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs for comprehensive ripple effect evaluation, and used proprietary LLMs to convert knowledge subgraphs into natural language text.

Result: Extensive statistical analysis confirmed the scale, comprehensiveness, and diversity of UniEdit benchmark. Comprehensive experiments across multiple LLMs and editors revealed their performance strengths and weaknesses across open knowledge domains and various evaluation criteria.

Conclusion: UniEdit provides a valuable unified benchmark for LLM editing research, offering insights for future work by addressing the limitations of existing datasets and enabling comprehensive evaluation of editing effects across diverse knowledge domains.

Abstract: Model editing aims to enhance the accuracy and reliability of large language models (LLMs) by efficiently adjusting their internal parameters. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce UniEdit, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our UniEdit benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.

[81] Mixed Signals: Understanding Model Disagreement in Multimodal Empathy Detection

Maya Srikanth, Run Chen, Julia Hirschberg

Main category: cs.CL

TL;DR: Analysis of multimodal empathy detection failures when modalities provide conflicting cues, showing that prediction disagreements often reflect ambiguity and can serve as diagnostic signals for system improvement.

Details

Motivation: To understand why multimodal models fail in empathy detection when different modalities provide conflicting information, and to examine cases where unimodal and multimodal predictions diverge.

Method: Used fine-tuned models for text, audio, and video modalities along with a gated fusion model, analyzed prediction disagreements, and compared with human annotator uncertainty.

Result: Found that prediction disagreements often reflect underlying ambiguity, dominant signals in one modality can mislead fusion when unsupported by others, and humans don’t consistently benefit from multimodal input either.

Conclusion: Disagreement between unimodal and multimodal predictions serves as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.

Abstract: Multimodal models play a key role in empathy detection, but their performance can suffer when modalities provide conflicting cues. To understand these failures, we examine cases where unimodal and multimodal predictions diverge. Using fine-tuned models for text, audio, and video, along with a gated fusion model, we find that such disagreements often reflect underlying ambiguity, as evidenced by annotator uncertainty. Our analysis shows that dominant signals in one modality can mislead fusion when unsupported by others. We also observe that humans, like models, do not consistently benefit from multimodal input. These insights position disagreement as a useful diagnostic signal for identifying challenging examples and improving empathy system robustness.

[82] FedSEA-LLaMA: A Secure, Efficient and Adaptive Federated Splitting Framework for Large Language Models

Zishuai Zhang, Hainan zhang, Weihua Li, Qinnan zhang, jin Dong, Yongxin Tong, Zhiming Zheng

Main category: cs.CL

TL;DR: FedSEA-LLaMA is a secure, efficient, and adaptive federated splitting framework for LLaMA2 that addresses privacy, communication overhead, and adaptability challenges in federated LLM training.

Details

Motivation: Private data is valuable for improving LLMs but is scattered across data silos, and traditional federated approaches face security, efficiency, and adaptability limitations when deploying LLMs in federated environments.

Method: The framework uses Gaussian noise injection for secure transmission, attention-mask compression and KV cache collaboration for efficiency, and dynamic partition point adjustment for task-specific adaptability.

Result: Experiments show FedSEA-LLaMA maintains performance comparable to centralized LLaMA2 while achieving up to 8x speedups in training and inference, with effective privacy protection and adaptability.

Conclusion: FedSEA-LLaMA successfully addresses key challenges in federated LLM deployment by providing secure, efficient, and adaptive solutions for transformer-based federated split models.

Abstract: Private data holds promise for improving LLMs due to its high quality, but its scattered distribution across data silos and the high computational demands of LLMs limit their deployment in federated environments. To address this, the transformer-based federated split models are proposed, which offload most model parameters to the server (or distributed clients) while retaining only a small portion on the client to ensure data privacy. Despite this design, they still face three challenges: 1) Peer-to-peer key encryption struggles to secure transmitted vectors effectively; 2) The auto-regressive nature of LLMs means that federated split learning can only train and infer sequentially, causing high communication overhead; 3) Fixed partition points lack adaptability to downstream tasks. In this paper, we introduce FedSEA-LLaMA, a Secure, Efficient, and Adaptive Federated splitting framework based on LLaMA2. First, we inject Gaussian noise into forward-pass hidden states to enable secure end-to-end vector transmission. Second, we employ attention-mask compression and KV cache collaboration to reduce communication costs, accelerating training and inference. Third, we allow users to dynamically adjust the partition points for input/output blocks based on specific task requirements. Experiments on natural language understanding, summarization, and conversational QA tasks show that FedSEA-LLaMA maintains performance comparable to centralized LLaMA2 and achieves up to 8x speedups in training and inference. Further analysis of privacy attacks and different partition points also demonstrates the effectiveness of FedSEA-LLaMA in security and adaptability.

[83] Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

Dillon Plunkett, Adam Morris, Keerthi Reddy, Jorge Morales

Main category: cs.CL

TL;DR: LLMs can accurately describe their internal decision-making processes and be trained to improve this self-explanation capability, which generalizes to other complex decisions.

Details

Motivation: To better understand how LLMs work by investigating their capacity to explain their own internal processes, addressing challenges in neural network interpretability.

Method: Fine-tuned GPT-4o and GPT-4o-mini on complex decision-making tasks with randomly-generated quantitative preferences, then trained them to explain their decision-making processes.

Result: LLMs can accurately report their learned preference weights during decision-making, and this self-explanation capability can be improved through training and generalizes to other complex decisions.

Conclusion: This work advances towards training LLMs to accurately report on their internal processes, which could benefit interpretability, control, and safety of AI systems.

Abstract: We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to explain their own functioning. Here, we show that i) LLMs can accurately describe quantitative features of their own internal processes during certain kinds of decision-making and ii) that it is possible to improve these capabilities through training. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain how they make other complex decisions, not just decisions they have been fine-tuned to make. This work is a step towards training LLMs to accurately and broadly report on their own internal processes – a possibility that would yield substantial benefits for interpretability, control, and safety.

[84] FB-RAG: Improving RAG with Forward and Backward Lookup

Kushal Chawla, Alfy Samuel, Anoop Kumar, Daben Liu

Main category: cs.CL

TL;DR: FB-RAG is a training-free framework that uses a lightweight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator, improving performance without complex finetuning.

Details

Motivation: Traditional RAG struggles with complex queries lacking strong retrieval signals, forcing a trade-off between small context that misses information and large context that confuses LLMs.

Method: Forward-looking strategy using a lightweight LLM to sample multiple potential outputs and identify the most relevant context for a final powerful generator, reducing latency with shorter, focused prompts.

Result: Consistent strong performance across 9 datasets from LongBench and ∞Bench; on EN.QA dataset, matches leading baseline with 48% latency reduction or achieves 8% performance improvement with 10% latency reduction.

Conclusion: Smaller LLMs can systematically improve performance and efficiency of larger ones, with forward-looking attempts sufficient to guide final models to accurate responses even when the lightweight LLM fails to generate correct answers.

Abstract: Traditional Retrieval-Augmented Generation (RAG) struggles with complex queries that lack strong signals to retrieve the most relevant context, forcing a trade-off between choosing a small context that misses key information and a large context that confuses the LLM. To address this, we propose Forward-Backward RAG (FB-RAG), a new training-free framework based on a simple yet powerful forward-looking strategy. FB-RAG employs a light-weight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator. This improves performance without complex finetuning or Reinforcement Learning common in prior work. Across $9$ datasets from LongBench and $\infty$Bench, FB-RAG consistently delivers strong results. Further, the performance gains can be achieved with reduced latency due to a shorter, more focused prompt for the powerful generator. On EN.QA dataset, FB-RAG matches the leading baseline with over $48$% latency reduction or achieves an $8$% performance improvement with a $10$% latency reduction. Our analysis finds cases where even when the forward-looking LLM fails to generate correct answers, its attempts are sufficient to guide the final model to an accurate response, demonstrating how smaller LLMs can systematically improve the performance and efficiency of larger ones.

[85] p2-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models

Wei Zhou, Mohsen Mesgar, Heike Adel, Annemarie Friedrich

Main category: cs.CL

TL;DR: p2-TQA is a process-based preference learning framework that improves table question answering models through automatic preference data construction and contrastive learning, achieving significant performance gains with high efficiency.

Details

Motivation: Existing TQA approaches under-utilize available data and neglect post-training potential, leaving room for improvement in table question answering systems.

Method: Automatically constructs process-based preference data via table-specific pipeline, then optimizes models through contrastive learning on the collected data.

Result: Improves TQA models by up to 5% on in-domain and 2.4% on out-of-domain datasets with only 8,000 training instances. Achieves competitive results against larger SOTA systems while maintaining 5x higher efficiency.

Conclusion: p2-TQA effectively enhances TQA models through automated preference learning, demonstrating strong performance improvements and superior efficiency compared to existing approaches.

Abstract: Table question answering (TQA) focuses on answering questions based on tabular data. Developing TQA systems targets effective interaction with tabular data for tasks such as cell retrieval and data analysis. While recent work has leveraged fine-tuning to improve TQA systems, existing approaches often under-utilize available data and neglect the potential of post-training for further gains. In this work, we introduce p2-TQA, a process-based preference learning framework for TQA post-training. p2-TQA automatically constructs process-based preference data via a table-specific pipeline, eliminating the need for manual or costly data collection. It then optimizes models through contrastive learning on the collected data. Experiments show that p2-TQA effectively improves TQA models by up to 5% on in-domain datasets and 2.4% on out-of-domain datasets with only 8,000 training instances. Furthermore, models enhanced with p2-TQA achieve competitive results against larger, more complex state-of-the-art TQA systems, while maintaining up to five times higher efficiency.

[86] MIDB: Multilingual Instruction Data Booster for Enhancing Cultural Equality in Multilingual Instruction Synthesis

Yilun Liu, Chunguang Zhao, Xinhua Yang, Hongyong Zeng, Shimin Tao, Weibin Meng, Minggui He, Yan Yu, Hongxia Ma, Li Zhang, Daimeng Wei, Boxing Chen

Main category: cs.CL

TL;DR: MIDB is a multilingual instruction data booster that automatically improves quality of synthesized instruction data by addressing content errors, machine translation defects, and localization issues across 16 languages.

Details

Motivation: Multilingual synthesized instruction data suffers from severe quality issues due to machine translation from English, leading to content errors, MT defects, and insufficient localization, which causes cultural inequality in trained LLMs.

Method: Train MIDB on 36.8k human-revised examples across 16 languages to automatically boost low-quality synthesized data by fixing content errors, MT defects, and improving localization.

Result: MIDB steadily improved instruction data quality in 16 languages and significantly enhanced instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on the boosted data.

Conclusion: MIDB effectively addresses multilingual data quality issues, leading to improved linguistic and cultural equality in trained LLMs.

Abstract: Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages, leading to cultural inequality in trained LLMs. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. MIDB is trained on around 36.8k revision examples across 16 languages by human linguistic experts, thereby can boost the low-quality data by addressing content errors and MT defects, and improving localization in these synthesized data. Both automatic and human evaluation indicate that not only MIDB steadily improved instruction data quality in 16 languages, but also the instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on MIDB-boosted data were significantly enhanced, suggesting an improved linguistic and cultural equality.

[87] From Anger to Joy: How Nationality Personas Shape Emotion Attribution in Large Language Models

Mahammed Kamruzzaman, Abdullah Al Monsur, Gene Louis Kim, Anshuman Chhabra

Main category: cs.CL

TL;DR: LLMs exhibit nationality-based emotional stereotypes that don’t align with human responses, particularly for negative emotions, revealing potential biases in AI systems.

Details

Motivation: To investigate whether LLMs show emotional stereotypes when assigned nationality-specific personas and whether these align with cultural norms.

Method: Analyzed pre-trained LLMs by assigning nationality-specific personas and examining emotion attributions, incorporating Hofstede’s cultural dimensions (Power Distance, Uncertainty Avoidance, Long-Term Orientation, Individualism) as interpretive framework.

Result: Found significant nationality-based differences in emotion assignments (shame, fear, joy disproportionately assigned across regions) and notable misalignment between LLM-generated and human emotional responses, especially for negative emotions.

Conclusion: LLMs contain reductive and potentially biased emotional stereotypes that don’t reflect actual cultural norms, highlighting the need for addressing bias in AI systems.

Abstract: Emotions are a fundamental facet of human experience, varying across individuals, cultural contexts, and nationalities. Given the recent success of Large Language Models (LLMs) as role-playing agents, we examine whether LLMs exhibit emotional stereotypes when assigned nationality-specific personas. Specifically, we investigate how different countries are represented in pre-trained LLMs through emotion attributions and whether these attributions align with cultural norms. To provide a deeper interpretive lens, we incorporate four key cultural dimensions, namely Power Distance, Uncertainty Avoidance, Long-Term Orientation, and Individualism, derived from Hofstedes cross-cultural framework. Our analysis reveals significant nationality-based differences, with emotions such as shame, fear, and joy being disproportionately assigned across regions. Furthermore, we observe notable misalignment between LLM-generated and human emotional responses, particularly for negative emotions, highlighting the presence of reductive and potentially biased stereotypes in LLM outputs.

[88] LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

Main category: cs.CL

TL;DR: First systematic investigation of diffusion LLMs’ long-context capabilities, revealing stable perplexity during direct extrapolation and local perception in retrieval tasks. Proposes LongLLaDA method for training-free context extension using NTK-based RoPE extrapolation.

Details

Motivation: Diffusion LLMs have emerged as important NLP models but their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension compared to auto-regressive LLMs.

Method: Systematic comparison of diffusion vs auto-regressive LLMs’ long-context performance. Identifies unique characteristics and proposes LongLLaDA - training-free method integrating LLaDA with NTK-based RoPE extrapolation.

Result: Diffusion LLMs maintain stable perplexity during direct context extrapolation and exhibit local perception enabling successful retrieval from recent context. Established extrapolation scaling laws remain effective for extending diffusion LLMs’ context windows.

Conclusion: Study establishes first length extrapolation method for diffusion LLMs, provides theoretical insights and empirical benchmarks for advancing long-context diffusion LLM research, identifying tasks where diffusion LLMs outperform auto-regressive models.

Abstract: Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably stable perplexity during direct context extrapolation. Moreover, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct local perception phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first length extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs. The code is available at https://github.com/OpenMOSS/LongLLaDA.

[89] REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

Kangqi Chen, Andreas Kosmas Kakolyris, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, Onur Mutlu

Main category: cs.CL

TL;DR: REIS is an In-Storage Processing system designed to accelerate the retrieval stage of Retrieval-Augmented Generation (RAG) by addressing data movement bottlenecks in Approximate Nearest Neighbor Search (ANNS) operations.

Details

Motivation: RAG's retrieval stage is a significant bottleneck due to data movement overheads in ANNS operations with large databases. Existing ISP solutions for ANNS are suboptimal as they use non-tailored algorithms, don't accelerate data retrieval, and require major hardware modifications.

Method: REIS introduces three key mechanisms: 1) a database layout linking embeddings to documents for efficient retrieval, 2) ISP-tailored data placement distributing embeddings across storage planes with lightweight Flash Translation Layer, and 3) leveraging existing computational resources in storage for ANNS.

Result: Compared to server-grade systems, REIS improves retrieval performance by an average of 13x and energy efficiency by 55x.

Conclusion: REIS successfully addresses the limitations of existing ISP approaches for RAG by providing a tailored solution that significantly accelerates the retrieval stage while maintaining hardware compatibility.

Abstract: Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).

[90] ReCode: Updating Code API Knowledge with Reinforcement Learning

Haoze Wu, Yunzhi Yao, Wenhao Yu, Ningyu Zhang

Main category: cs.CL

TL;DR: ReCode is a reinforcement learning framework that improves LLMs’ ability to adapt to API changes by training them on version migration tasks using a modified string similarity metric as reward.

Details

Motivation: LLMs struggle with adapting to frequent API updates due to reliance on outdated training data, hindering reliable code generation in dynamic environments.

Method: Created dataset of 2,000 entries for version migration training, used modified string similarity metric as RL reward, applied GRPO and DAPO algorithms on various LLMs.

Result: ReCode significantly improves code generation in dynamic API scenarios, especially on unseen tasks, with less impact on general coding abilities compared to supervised fine-tuning.

Conclusion: The framework successfully mimics human programmer adaptation to API changes, enabling smaller models to outperform larger specialized models in API update scenarios.

Abstract: Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.

[91] The Trilemma of Truth in Large Language Models

Germans Savcisens, Tina Eliassi-Rad

Main category: cs.CL

TL;DR: sAwMIL is a new probing framework that uses multiple-instance learning and conformal prediction to classify LLM knowledge as true, false, or neither, revealing flaws in existing methods and asymmetric encoding of truth vs falsehood.

Details

Motivation: Existing methods for probing LLM knowledge have flawed assumptions, and the public often mistakenly attributes human-like knowledge to LLMs when they actually encode probabilistic information from training.

Method: Introduced sAwMIL (Sparse-Aware Multiple-Instance Learning) framework combining multiple-instance learning with conformal prediction, using internal LLM activations to classify statements into three categories: true, false, or neither.

Result: Common probing methods are unreliable and sometimes worse than zero-shot prompting; truth and falsehood are encoded asymmetrically; LLMs encode a third distinct signal beyond true/false; evaluated across 16 LLMs on three curated datasets.

Conclusion: sAwMIL provides a more reliable framework for probing LLM knowledge, revealing fundamental limitations in existing methods and complex encoding patterns in language models.

Abstract: The public often attributes human-like qualities to large language models (LLMs) and assumes they “know” certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies several flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages internal activations of LLMs to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, on three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform worse than zero-shot prompting; (2) truth and falsehood are not encoded symmetrically; and (3) LLMs encode a third type of signal that is distinct from both true and false.

[92] Generalizing Verifiable Instruction Following

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: IFBench is a new benchmark for evaluating precise instruction following generalization on 58 diverse constraints, showing current models overfit on existing benchmarks. RLVR training with verifiable rewards significantly improves performance.

Details

Motivation: Current language models struggle with precise instruction following, particularly with output constraints like 'only answer yes/no' or 'mention specific words'. Models overfit on limited benchmark constraints and fail to generalize to unseen constraints.

Method: Created IFBench with 58 new verifiable out-of-domain constraints. Used constraint verification modules and reinforcement learning with verifiable rewards (RLVR) to train models for better instruction following.

Result: RLVR training significantly improves precise instruction following generalization. Models trained with verifiable rewards show better performance on diverse, unseen constraints compared to standard training approaches.

Conclusion: Precise instruction following requires better generalization capabilities. RLVR with constraint verification is an effective approach, and IFBench provides a more comprehensive evaluation framework for this important skill.

Abstract: A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like only answer with yes or no" or mention the word `abrakadabra’ at least 3 times" that the user adds to craft a more useful answer. Even today’s strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.

[93] Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

Nhi Hoai Doan, Tatsuya Hiraoka, Kentaro Inui

Main category: cs.CL

TL;DR: This paper examines how LLMs’ repetition pattern recognition affects in-context learning performance, focusing on repetition neurons rather than attention heads, and shows their impact varies by layer depth.

Details

Motivation: Prior work has focused on attention heads for understanding ICL, but this paper investigates the relationship from the perspective of skill neurons, specifically repetition neurons, to better understand how pattern recognition affects learning performance.

Method: The researchers examined repetition neurons in LLMs and compared their effects with induction heads, analyzing how these components impact ICL performance across different layer depths.

Result: Experiments revealed that the impact of repetition neurons on ICL performance varies depending on the depth of the layer where they reside, and strategies were identified to reduce repetitive outputs while maintaining strong ICL capabilities.

Conclusion: The study provides insights into how repetition neurons function differently from induction heads and offers approaches to optimize LLMs by balancing repetition control with in-context learning performance.

Abstract: This paper investigates the relationship between large language models’ (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.

[94] ControlMed: Adding Reasoning Control to Medical Language Model

Sung-Min Lee, Siyoon Lee, Juyeon Kim, Kyoungmin Roh

Main category: cs.CL

TL;DR: ControlMed is a medical language model that allows users to control reasoning length at inference time, achieving similar/better performance than SOTA models while reducing computational overhead.

Details

Motivation: Existing reasoning LLMs generate unnecessarily long reasoning processes in medical domains, causing computational overhead and latency that hinder practical clinical deployment.

Method: Three-stage training: 1) Pre-training on large-scale synthetic medical instruction data, 2) Supervised fine-tuning with multi-length reasoning data and length-control markers, 3) Reinforcement learning with model-based rewards for accuracy and quality.

Result: Achieves similar or better performance on English and Korean medical benchmarks compared to state-of-the-art models, with flexible control over reasoning length.

Conclusion: ControlMed provides a practical and adaptable solution for clinical question answering that balances accuracy and computational efficiency through user-controlled reasoning length.

Abstract: Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce \textbf{ControlMed}, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textit{direct} and \textit{reasoning responses}; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.

[95] When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

Jin Li, Keyu Wang, Shu Yang, Zhuoran Zhang, Di Wang

Main category: cs.CL

TL;DR: LLMs exhibit sycophantic behavior by agreeing with user opinions even when contradictory to facts, emerging through late-layer preference shifts and deeper representational divergence rather than surface-level artifacts.

Details

Motivation: To understand the internal mechanisms behind LLMs' sycophantic behavior, as prior work documented the tendency but lacked mechanistic explanations.

Method: Systematically studied opinion induction across model families, used logit-lens analysis and causal activation patching to identify emergence stages, and examined grammatical perspective effects.

Result: Simple opinion statements reliably induce sycophancy; user expertise framing has negligible impact; first-person prompts create stronger representational perturbations than third-person; user authority isn’t internally encoded.

Conclusion: Sycophancy emerges from structural override of learned knowledge in deeper layers, not surface-level artifacts, with implications for alignment and truthful AI systems.

Abstract: Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (I believe...'') consistently induce higher sycophancy rates than third-person framings (They believe…’’) by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.

[96] Isolating Culture Neurons in Multilingual Large Language Models

Danial Namazifard, Lukas Galke Poech

Main category: cs.CL

TL;DR: This paper identifies and isolates culture-specific neurons in multilingual LLMs, showing they encode different cultures in distinct neuron populations that can be selectively edited.

Details

Motivation: To understand how multilingual large language models encode cultural information and disentangle it from language-specific encoding.

Method: Built on language-specific neuron identification methodology, introduced MUREL dataset (85.2M tokens across 6 cultures), conducted localization and intervention experiments on culture-specific neurons.

Result: LLMs encode different cultures in distinct neuron populations (mainly upper layers), culture neurons can be modulated independently of language-specific neurons or other culture neurons.

Conclusion: Cultural knowledge in multilingual LLMs can be selectively isolated and edited, with implications for fairness, inclusivity, and alignment.

Abstract: Language and culture are deeply intertwined, yet it has been unclear how and where multilingual large language models encode culture. Here, we build on an established methodology for identifying language-specific neurons to localize and isolate culture-specific neurons, carefully disentangling their overlap and interaction with language-specific neurons. To facilitate our experiments, we introduce MUREL, a curated dataset of 85.2 million tokens spanning six different cultures. Our localization and intervention experiments show that LLMs encode different cultures in distinct neuron populations, predominantly in upper layers, and that these culture neurons can be modulated largely independently of language-specific neurons or those specific to other cultures. These findings suggest that cultural knowledge and propensities in multilingual language models can be selectively isolated and edited, with implications for fairness, inclusivity, and alignment. Code and data are available at https://github.com/namazifard/Culture_Neurons.

[97] Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study

Mahdi Dhaini, Juraj Vladika, Ege Erdogan, Zineb Attaoui, Gjergji Kasneci

Main category: cs.CL

TL;DR: Automated framework using multiple LLMs to generate high-quality textual explanations, showing competitive effectiveness compared to human annotations for improving model performance in NLP tasks.

Details

Motivation: Human annotation for textual explanations is costly and limits scalability in Explainable NLP, creating need for automated approaches.

Method: Leverage multiple state-of-the-art LLMs to generate explanations, evaluate with NLG metrics, and test impact on PLMs and LLMs across natural language inference tasks on benchmark datasets.

Result: LLM-generated explanations demonstrate highly competitive effectiveness compared to human-annotated explanations in improving model performance.

Conclusion: Automated LLM-based textual explanation generation offers a promising scalable approach for extending NLP datasets and enhancing model performance.

Abstract: In the rapidly evolving field of Explainable Natural Language Processing (NLP), textual explanations, i.e., human-like rationales, are pivotal for explaining model predictions and enriching datasets with interpretable labels. Traditional approaches rely on human annotation, which is costly, labor-intensive, and impedes scalability. In this work, we present an automated framework that leverages multiple state-of-the-art large language models (LLMs) to generate high-quality textual explanations. We rigorously assess the quality of these LLM-generated explanations using a comprehensive suite of Natural Language Generation (NLG) metrics. Furthermore, we investigate the downstream impact of these explanations on the performance of pre-trained language models (PLMs) and LLMs across natural language inference tasks on two diverse benchmark datasets. Our experiments demonstrate that automated explanations exhibit highly competitive effectiveness compared to human-annotated explanations in improving model performance. Our findings underscore a promising avenue for scalable, automated LLM-based textual explanation generation for extending NLP datasets and enhancing model performance.

[98] ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models

Biddut Sarker Bijoy, Mohammad Saqib Hasan, Pegah Alipoormolabashi, Avirup Sil, Aruna Balasubramanian, Niranjan Balasubramanian

Main category: cs.CL

TL;DR: Multi-agent systems with smaller language models (SLMs) can be more efficient than single LLM systems, but SLMs struggle with long-trajectory learning. A progressive sub-task training strategy improves multi-agent performance, yielding better effectiveness-efficiency trade-offs.

Details

Motivation: To compare the effectiveness and efficiency of multi-agent SLM systems versus single-agent LLM systems for complex problems, and address SLMs' limitations in long-trajectory learning.

Method: Instantiated single and multi-agent systems in AppWorld environment using different model sizes, and introduced progressive sub-task training that introduces new sub-tasks progressively each epoch.

Result: Progressive training consistently improved multi-agent effectiveness across configurations. Fine-tuned multi-agent systems achieved better effectiveness-efficiency trade-offs and reduced subtask error rates.

Conclusion: Multi-agent SLM systems with progressive training offer superior effectiveness-efficiency trade-offs compared to single-agent LLM systems for complex problems.

Abstract: Multi-agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade-off, we instantiate single and multi-agent systems for the complex problems in the AppWorld environment using different sized language models. We find that difficulties with long-trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub-task training strategy, which introduces new sub-tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi-agents at all configurations. Our Pareto analysis shows that fine-tuned multi-agent systems yield better effectiveness-efficiency trade-offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates.

[99] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder-LLM Integration in Cross-Lingual Reasoning

Kosei Uemura, David Guzmán, Quang Phuoc Nguyen, Jesujoba Oluwadara Alabi, En-shiun Annie Lee, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: MERLIN is a two-stage model-stacking framework that uses curriculum learning and DoRA weight adaptation to improve reasoning in low-resource languages, achieving significant accuracy gains over existing methods.

Details

Motivation: Large language models struggle with complex reasoning in low-resource languages, and existing methods like LangBridge and MindMerger leave a large performance gap on these languages.

Method: Two-stage model-stacking framework with curriculum learning (from general bilingual bitext to task-specific data) and adaptation of only a small set of DoRA weights.

Result: +12.9 pp improvement over MindMerger on AfriMGSM benchmark, outperforms GPT-4o-mini, and consistent gains on MGSM (+0.9 pp) and MSVAMP (+2.8 pp) across both low and high-resource settings.

Conclusion: MERLIN effectively improves reasoning capabilities in low-resource languages while maintaining performance in high-resource settings, demonstrating broad applicability.

Abstract: Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy – from general bilingual bitext to task-specific data – and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.

[100] Trust Me, I Can Convince You: The Contextualized Argument Appraisal Framework

Lynn Greschner, Sabine Weber, Roman Klinger

Main category: cs.CL

TL;DR: Proposes a Contextualized Argument Appraisal Framework to model how subjective evaluations of arguments’ personal impact influence emotions and convincingness, beyond just argument content.

Details

Motivation: Current research only studies binary emotionality of arguments, but cognitive appraisal models haven't been applied to argument convincingness despite evidence that personal impact evaluations affect emotional responses.

Method: Adapted psychological appraisal models to argument mining, developed role-playing annotation setup with 4000 annotations, collected demographic/personality data for both participants and perceived senders.

Result: Analysis shows convincingness positively correlates with positive emotions (trust) and negatively with negative emotions (anger), with familiarity being a key appraisal variable.

Conclusion: The framework successfully models argument-receiver-sender interplay, demonstrating that subjective appraisals significantly influence argument convincingness beyond content alone.

Abstract: Emotions that somebody develops based on an argument do not only depend on the argument itself - they are also influenced by a subjective evaluation of the argument’s potential impact on the self. For instance, an argument to ban plastic bottles might cause fear of losing a job for a bottle industry worker, which lowers the convincingness - presumably independent of its content. While binary emotionality of arguments has been studied, such cognitive appraisal models have only been proposed in other subtasks of emotion analysis, but not in the context of arguments and their convincingness. To fill this research gap, we propose the Contextualized Argument Appraisal Framework to model the interplay between the sender, receiver, and argument. We adapt established appraisal models from psychology to argument mining, including argument pleasantness, familiarity, response urgency, and expected effort, as well as convincingness variables. To evaluate the framework and pave the way for computational modeling, we develop a novel role-playing-based annotation setup, mimicking real-world exposure to arguments. Participants disclose their emotion, explain the main cause, the argument appraisal, and the perceived convincingness. To consider the subjective nature of such annotations, we also collect demographic data and personality traits of both the participants and ask them to disclose the same variables for their perception of the argument sender. The analysis of the resulting ContArgA corpus of 4000 annotations reveals that convincingness is positively correlated with positive emotions (e.g., trust) and negatively correlated with negative emotions (e.g., anger). The appraisal variables particularly point to the importance of the annotator’s familiarity with the argument.

[101] MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages

Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi, Janice Lam, Nicolò Busetto, Denise Diaz, Francisco Guzmán

Main category: cs.CL

TL;DR: MENLO is a framework for evaluating native-like quality of LLM responses across languages using audience design principles, with a dataset of 6,423 human-annotated preference pairs in 47 languages.

Details

Motivation: Ensuring native-like quality of LLM responses across many languages is challenging, requiring systematic evaluation frameworks.

Method: Created MENLO framework with human-annotated dataset, evaluated zero-shot LLM judges, and improved through fine-tuning with reinforcement learning, reward shaping, and multi-task learning.

Result: Zero-shot LLM judges benefit from pairwise evaluation and structured rubrics but underperform humans. RL-trained judges can enhance LLMs’ multilingual proficiency but still show discrepancies with human judgment.

Conclusion: Promising directions for scalable multilingual evaluation and preference alignment, with dataset and framework released for further research.

Abstract: Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs’ multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.

[102] Epistemic Diversity and Knowledge Collapse in Large Language Models

Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Peter Ebert Christensen, Chan Young Park, Isabelle Augenstein

Main category: cs.CL

TL;DR: LLMs generate homogenous texts risking knowledge collapse. New methodology measures epistemic diversity across 27 LLMs, 155 topics, 12 countries. Newer models show more diversity but less than web search. Model size reduces diversity, RAG helps but varies by culture. Claims reflect English bias over local languages.

Details

Motivation: Address the risk of knowledge collapse where homogenous LLM outputs shrink accessible information range over time, overcoming limitations of existing homogenization studies that focus on closed-ended setups or fuzzy semantic features without temporal or cultural analysis.

Method: Developed new methodology to measure epistemic diversity (variation in real-world claims) and conducted broad empirical study testing 27 LLMs across 155 topics covering 12 countries with 200 prompt variations from real user chats.

Result: Newer models generate more diverse claims but nearly all models are less epistemically diverse than basic web search. Model size negatively impacts diversity, RAG positively impacts diversity but improvement varies by cultural context. Country-specific claims reflect English language more than local languages.

Conclusion: LLMs exhibit epistemic homogenization with English bias, highlighting gaps in epistemic representation. While newer models and RAG show improvements, significant diversity gaps persist compared to traditional knowledge sources like web search and Wikipedia.

Abstract: Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation

[103] TraceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration

Mucheng Ren, He Chen, Yuchen Yan, Danqing Hu, Jun Xu, Xian Zeng

Main category: cs.CL

TL;DR: TraceCoder is a novel framework that integrates multi-source external knowledge (UMLS, Wikipedia, LLMs) to enhance automated ICD coding by addressing semantic gaps, improving rare code performance, and increasing interpretability through traceable evidence.

Details

Motivation: Existing automated ICD coding methods face challenges with semantic gaps between clinical text and codes, poor performance on rare/long-tail codes, and limited interpretability, which hinders clinical adoption.

Method: Proposes TraceCoder framework that dynamically incorporates diverse knowledge sources (UMLS, Wikipedia, LLMs) to enrich code representations and uses a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge.

Result: Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets show state-of-the-art performance, with ablation studies validating the effectiveness of the framework components.

Conclusion: TraceCoder provides a scalable and robust solution for automated ICD coding that meets clinical needs for accuracy, interpretability, and reliability by grounding predictions in external evidence.

Abstract: Automated International Classification of Diseases (ICD) coding assigns standardized diagnosis and procedure codes to clinical records, playing a critical role in healthcare systems. However, existing methods face challenges such as semantic gaps between clinical text and ICD codes, poor performance on rare and long-tail codes, and limited interpretability. To address these issues, we propose TraceCoder, a novel framework integrating multi-source external knowledge to enhance traceability and explainability in ICD coding. TraceCoder dynamically incorporates diverse knowledge sources, including UMLS, Wikipedia, and large language models (LLMs), to enrich code representations, bridge semantic gaps, and handle rare and ambiguous codes. It also introduces a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge, improving long-tail code recognition and making predictions interpretable by grounding them in external evidence. Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets demonstrate that TraceCoder achieves state-of-the-art performance, with ablation studies validating the effectiveness of its components. TraceCoder offers a scalable and robust solution for automated ICD coding, aligning with clinical needs for accuracy, interpretability, and reliability.

[104] TACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding

Mucheng Ren, Yucheng Yan, He Chen, Danqing Hu, Jun Xu, Xian Zeng

Main category: cs.CL

TL;DR: TACL is a threshold-adaptive curriculum learning framework that dynamically adjusts training based on medical text complexity, improving performance on clinical NLP tasks like ICD coding and readmission prediction.

Details

Motivation: Medical texts like EMRs are unstructured and domain-specific, making automated understanding challenging. Existing methods treat all data equally, ignoring complexity differences, which limits model generalization on rare/complex cases.

Method: TACL categorizes data into difficulty levels and uses progressive learning - prioritizing simpler cases early in training to build foundation before tackling complex records. Applied to multilingual medical data (English and Chinese).

Result: Significant improvements across diverse clinical tasks including automatic ICD coding, readmission prediction, and TCM syndrome differentiation.

Conclusion: TACL enhances automated system performance and demonstrates potential to unify approaches across medical domains, enabling more accurate and globally applicable medical text understanding.

Abstract: Medical texts, particularly electronic medical records (EMRs), are a cornerstone of modern healthcare, capturing critical information about patient care, diagnoses, and treatments. These texts hold immense potential for advancing clinical decision-making and healthcare analytics. However, their unstructured nature, domain-specific language, and variability across contexts make automated understanding an intricate challenge. Despite the advancements in natural language processing, existing methods often treat all data as equally challenging, ignoring the inherent differences in complexity across clinical records. This oversight limits the ability of models to effectively generalize and perform well on rare or complex cases. In this paper, we present TACL (Threshold-Adaptive Curriculum Learning), a novel framework designed to address these challenges by rethinking how models interact with medical texts during training. Inspired by the principle of progressive learning, TACL dynamically adjusts the training process based on the complexity of individual samples. By categorizing data into difficulty levels and prioritizing simpler cases early in training, the model builds a strong foundation before tackling more complex records. By applying TACL to multilingual medical data, including English and Chinese clinical records, we observe significant improvements across diverse clinical tasks, including automatic ICD coding, readmission prediction and TCM syndrome differentiation. TACL not only enhances the performance of automated systems but also demonstrates the potential to unify approaches across disparate medical domains, paving the way for more accurate, scalable, and globally applicable medical text understanding solutions.

[105] Language over Content: Tracing Cultural Understanding in Multilingual Large Language Models

Seungho Cho, Changgeon Ko, Eui Jun Hwang, Junmyeong Lee, Huije Lee, Jong C. Park

Main category: cs.CL

TL;DR: The paper analyzes LLMs’ internal cultural understanding mechanisms by measuring activation path overlaps across different language and country conditions, revealing strong language-specific patterns and that linguistic similarity doesn’t guarantee aligned internal representations.

Details

Motivation: LLMs are increasingly used across diverse cultural contexts, but prior evaluations focused mostly on output-level performance without examining internal mechanisms. Circuit analysis studies have covered few languages and rarely focused on culture.

Method: Trace LLMs’ internal cultural understanding by measuring activation path overlaps when answering semantically equivalent questions under: (1) varying target country while fixing question language, and (2) varying question language while fixing country. Use same-language country pairs to disentangle language from cultural aspects.

Result: Internal paths overlap more for same-language, cross-country questions than for cross-language, same-country questions, indicating strong language-specific patterns. The South Korea-North Korea pair shows low overlap and high variability, demonstrating linguistic similarity doesn’t guarantee aligned internal representation.

Conclusion: LLMs exhibit strong language-specific patterns in cultural understanding, and linguistic similarity between countries doesn’t necessarily lead to aligned internal representations, highlighting the complexity of cultural modeling in multilingual contexts.

Abstract: Large language models (LLMs) are increasingly used across diverse cultural contexts, making accurate cultural understanding essential. Prior evaluations have mostly focused on output-level performance, obscuring the factors that drive differences in responses, while studies using circuit analysis have covered few languages and rarely focused on culture. In this work, we trace LLMs’ internal cultural understanding mechanisms by measuring activation path overlaps when answering semantically equivalent questions under two conditions: varying the target country while fixing the question language, and varying the question language while fixing the country. We also use same-language country pairs to disentangle language from cultural aspects. Results show that internal paths overlap more for same-language, cross-country questions than for cross-language, same-country questions, indicating strong language-specific patterns. Notably, the South Korea-North Korea pair exhibits low overlap and high variability, showing that linguistic similarity does not guarantee aligned internal representation.

[106] How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli

Main category: cs.CL

TL;DR: This paper introduces source-aware metrics for speech-to-text translation evaluation using ASR transcripts and back-translations as synthetic sources, with a novel cross-lingual re-segmentation algorithm to address alignment issues.

Details

Motivation: Current ST evaluation relies on reference-based metrics that ignore source audio information, while source-aware metrics in MT show better correlation with human judgments. However, extending this to ST is challenging due to audio sources and lack of reliable transcripts/alignments.

Method: Two strategies for generating textual proxies: ASR transcripts and back-translations of reference translations, plus a novel two-step cross-lingual re-segmentation algorithm to handle alignment mismatches between synthetic sources and references.

Result: ASR transcripts are more reliable than back-translations when WER < 20%, while back-translations offer a computationally cheaper alternative. The re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation across 79 language pairs and 6 diverse ST systems.

Conclusion: The proposed approach enables more accurate and principled evaluation methodologies for speech translation by effectively incorporating source information through synthetic proxies and addressing alignment challenges.

Abstract: Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

[107] DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning

Yaxuan Wang, Chris Yuhao Liu, Quan Liu, Jinglong Pang, Wei Wei, Yujia Bao, Yang Liu

Main category: cs.CL

TL;DR: DRAGON is a reasoning-based framework for LLM unlearning that uses in-context chain-of-thought instructions to protect models without requiring retain data or model fine-tuning.

Details

Motivation: Existing unlearning methods require training data and retain data, which are often unavailable in real-world scenarios, limiting their practical applicability.

Method: Uses a lightweight detection module to identify forget-worthy prompts, then routes them through a CoT guard model for safe in-context intervention without modifying the base model.

Result: Extensive experiments across three unlearning tasks show strong unlearning capability, scalability, and practical applicability with novel evaluation metrics.

Conclusion: DRAGON provides an effective data-free solution for LLM unlearning that works in practical scenarios without requiring retain data or model modifications.

Abstract: Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.

Lionel Z. Wang, Shihan Ben, Yulu Huang, Simeng Qin

Main category: cs.CL

TL;DR: SugarTextNet is a transformer-based framework that effectively detects sugar dating content on social media by addressing class imbalance and subtle linguistic cues, outperforming traditional methods.

Details

Motivation: Sugar dating content is proliferating on social media with serious societal concerns, but detection is challenging due to euphemisms, ambiguous language, and extreme class imbalance in real data.

Method: SugarTextNet integrates a pretrained transformer encoder, attention-based cue extractor, and contextual phrase encoder, with Context-Aware Focal Loss to handle class imbalance.

Result: The framework outperforms traditional ML models, deep learning baselines, and LLMs on a dataset of 3,067 Chinese Weibo posts, with ablation studies confirming each component’s importance.

Conclusion: Domain-specific, context-aware modeling is crucial for sensitive content detection, providing a robust solution for real-world content moderation challenges.

Abstract: Sugar dating-related content has rapidly proliferated on mainstream social media platforms, giving rise to serious societal and regulatory concerns, including commercialization of intimate relationships and the normalization of transactional relationships.~Detecting such content is highly challenging due to the prevalence of subtle euphemisms, ambiguous linguistic cues, and extreme class imbalance in real-world data.~In this work, we present SugarTextNet, a novel transformer-based framework specifically designed to identify sugar dating-related posts on social media.~SugarTextNet integrates a pretrained transformer encoder, an attention-based cue extractor, and a contextual phrase encoder to capture both salient and nuanced features in user-generated text.~To address class imbalance and enhance minority-class detection, we introduce Context-Aware Focal Loss, a tailored loss function that combines focal loss scaling with contextual weighting.~We evaluate SugarTextNet on a newly curated, manually annotated dataset of 3,067 Chinese social media posts from Sina Weibo, demonstrating that our approach substantially outperforms traditional machine learning models, deep learning baselines, and large language models across multiple metrics.~Comprehensive ablation studies confirm the indispensable role of each component.~Our findings highlight the importance of domain-specific, context-aware modeling for sensitive content detection, and provide a robust solution for content moderation in complex, real-world scenarios.

[109] SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

Ruiheng Liu, XiaoBing Chen, Jinyu Zhang, Qiongwen Zhang, Yu Zhang, Bailong Yang

Main category: cs.CL

TL;DR: SafeNlidb is a privacy-security alignment framework for LLM-based NLIDB that generates security-aware SQL through automated hybrid chain-of-thought reasoning without human-annotated data, achieving better security than larger LLMs while maintaining utility.

Details

Motivation: Current LLM-based NLIDB systems face privacy and security risks where LLMs may unintentionally expose confidential database contents or be manipulated to exfiltrate data through benign queries. Existing mitigation methods struggle with complex inference attacks, have high false positives, and compromise SQL reliability.

Method: Proposes SafeNlidb framework with automated pipeline generating hybrid chain-of-thought interaction data combining implicit security reasoning with SQL generation. Uses reasoning warm-up and alternating preference optimization to overcome multi-preference oscillations of DPO, enabling security-aware SQL through fine-grained reasoning without human-annotated data.

Result: Extensive experiments show the method outperforms both larger-scale LLMs and ideal-setting baselines, achieving significant security improvements while preserving high utility.

Conclusion: SafeNlidb provides an effective privacy-security alignment solution for LLM-based NLIDB systems, addressing critical security vulnerabilities through automated security reasoning without compromising SQL reliability.

Abstract: The rapid advancement of Large Language Models (LLMs) has driven significant progress in Natural Language Interface to Database (NLIDB). However, the widespread adoption of LLMs has raised critical privacy and security concerns. During interactions, LLMs may unintentionally expose confidential database contents or be manipulated by attackers to exfiltrate data through seemingly benign queries. While current efforts typically rely on rule-based heuristics or LLM agents to mitigate this leakage risk, these methods still struggle with complex inference-based attacks, suffer from high false positive rates, and often compromise the reliability of SQL queries. To address these challenges, we propose \textsc{SafeNlidb}, a novel privacy-security alignment framework for LLM-based NLIDB. The framework features an automated pipeline that generates hybrid chain-of-thought interaction data from scratch, seamlessly combining implicit security reasoning with SQL generation. Additionally, we introduce reasoning warm-up and alternating preference optimization to overcome the multi-preference oscillations of Direct Preference Optimization (DPO), enabling LLMs to produce security-aware SQL through fine-grained reasoning without the need for human-annotated preference data. Extensive experiments demonstrate that our method outperforms both larger-scale LLMs and ideal-setting baselines, achieving significant security improvements while preserving high utility. WARNING: This work may contain content that is offensive and harmful!

[110] SCOPE: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs

Zhenliang Zhang, Xinyu Hu, Xiaojun Wan

Main category: cs.CL

TL;DR: SCOPE is an inference-time method that mitigates copyright infringement in LLMs by identifying and clamping copyright-sensitive activations in a semantic subspace, without parameter updates or external filters.

Details

Motivation: LLMs sometimes reproduce copyrighted passages, creating legal risks. Existing defenses rely on surface-level token matching and external filters, which are complex and miss semantic paraphrasing.

Method: Uses sparse autoencoder (SAE) to project hidden states into high-dimensional semantic space, identifies copyright-sensitive subspace, and clamps its activations during decoding.

Result: Effectively mitigates copyright infringement on benchmarks without degrading general model utility.

Conclusion: The isolated subspace captures high-level semantics, enabling intrinsic semantic-space control for copyright protection.

Abstract: Large language models sometimes inadvertently reproduce passages that are copyrighted, exposing downstream applications to legal risk. Most existing studies for inference-time defences focus on surface-level token matching and rely on external blocklists or filters, which add deployment complexity and may overlook semantically paraphrased leakage. In this work, we reframe copyright infringement mitigation as intrinsic semantic-space control and introduce SCOPE, an inference-time method that requires no parameter updates or auxiliary filters. Specifically, the sparse autoencoder (SAE) projects hidden states into a high-dimensional, near-monosemantic space; benefiting from this representation, we identify a copyright-sensitive subspace and clamp its activations during decoding. Experiments on widely recognized benchmarks show that SCOPE mitigates copyright infringement without degrading general utility. Further interpretability analyses confirm that the isolated subspace captures high-level semantics.

[111] FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation

Song Jin, Shuqi Li, Shukun Zhang, Rui Yan

Main category: cs.CL

TL;DR: First formulation of Equity Research Report generation task with FinRpt benchmark, multi-agent FinRpt-Gen framework, and comprehensive evaluation system.

Details

Motivation: LLMs have succeeded in financial tasks but haven't been applied to fully automate Equity Research Report generation, which remains uncharted territory.

Method: Created FinRpt benchmark with dataset construction pipeline integrating 7 financial data types, proposed multi-agent FinRpt-Gen framework using SFT and RL, and introduced 11-metric evaluation system.

Result: Experimental results show high data quality and metric effectiveness of FinRpt benchmark, with FinRpt-Gen demonstrating strong performance in ERR generation.

Conclusion: FinRpt and FinRpt-Gen have potential to drive innovation in Equity Research Report generation field, with all code and datasets publicly available.

Abstract: While LLMs have shown great success in financial tasks like stock prediction and question answering, their application in fully automating Equity Research Report generation remains uncharted territory. In this paper, we formulate the Equity Research Report (ERR) Generation task for the first time. To address the data scarcity and the evaluation metrics absence, we present an open-source evaluation benchmark for ERR generation - FinRpt. We frame a Dataset Construction Pipeline that integrates 7 financial data types and produces a high-quality ERR dataset automatically, which could be used for model training and evaluation. We also introduce a comprehensive evaluation system including 11 metrics to assess the generated ERRs. Moreover, we propose a multi-agent framework specifically tailored to address this task, named FinRpt-Gen, and train several LLM-based agents on the proposed datasets using Supervised Fine-Tuning and Reinforcement Learning. Experimental results indicate the data quality and metrics effectiveness of the benchmark FinRpt and the strong performance of FinRpt-Gen, showcasing their potential to drive innovation in the ERR generation field. All code and datasets are publicly available.

[112] Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction

Hyeryun Park, Byung Mo Gu, Jun Hee Lee, Byeong Hyeon Choi, Sekeun Kim, Hyun Koo Kim, Kyungsang Kim

Main category: cs.CL

TL;DR: A voice-controlled Surgical Agent Orchestrator Platform (SAOP) using LLM-based agents to help surgeons access and manipulate patient data during da Vinci robotic surgery without interrupting their workflow.

Details

Motivation: Surgeons' hands and eyes are fully occupied during da Vinci robotic surgery, making it difficult to access multimodal patient data without interruption, which creates workflow inefficiencies.

Method: Hierarchical multi-agent framework with orchestration agent and three task-specific agents driven by LLMs that autonomously plan, refine, validate, and reason to map voice commands into specific tasks like retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models.

Result: SAOP achieved high accuracy and success rates across 240 voice commands, with LLM-based agents improving robustness against speech recognition errors and handling diverse or ambiguous free-form commands effectively.

Conclusion: The platform demonstrates strong potential to support minimally invasive da Vinci robotic surgery by enabling voice-controlled access to patient data without interrupting surgical workflow.

Abstract: In da Vinci robotic surgery, surgeons’ hands and eyes are fully engaged in the procedure, making it difficult to access and manipulate multimodal patient data without interruption. We propose a voice-directed Surgical Agent Orchestrator Platform (SAOP) built on a hierarchical multi-agent framework, consisting of an orchestration agent and three task-specific agents driven by Large Language Models (LLMs). These LLM-based agents autonomously plan, refine, validate, and reason to map voice commands into specific tasks such as retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models on the surgical video. We also introduce a Multi-level Orchestration Evaluation Metric (MOEM) to comprehensively assess the performance and robustness from command-level and category-level perspectives. The SAOP achieves high accuracy and success rates across 240 voice commands, while LLM-based agents improve robustness against speech recognition errors and diverse or ambiguous free-form commands, demonstrating strong potential to support minimally invasive da Vinci robotic surgery.

cs.CV

Jiale Liu, Haoming Zhou, Yishu Zhu, Bingzhi Chen, Yuncheng Jiang

Main category: cs.CV

TL;DR: Proposes a unified approach for fine-grained image-text alignment using significance-aware modeling and region-level uncertainty modeling to address limitations in existing methods.

Details

Motivation: Address fundamental limitations in fine-grained image-text alignment: lack of robust intra-modal mechanisms for assessing token significance and absence of fine-grained uncertainty modeling for complex region-word correspondences.

Method: Incorporates significance-aware and granularity-aware modeling with region-level uncertainty modeling using modality-specific biases and Gaussian mixture distributions for region features.

Result: Achieves state-of-the-art performance on Flickr30K and MS-COCO datasets across various backbone architectures, enhancing robustness and interpretability.

Conclusion: The proposed unified approach effectively addresses key challenges in fine-grained image-text alignment and demonstrates superior performance compared to existing methods.

Abstract: Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.

[114] Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLMs

Hari Lee

Main category: cs.CV

TL;DR: TbVAD is a language-driven framework for weakly supervised video anomaly detection that performs detection and explanation entirely in the textual domain using vision-language models and semantic slot reasoning.

Details

Motivation: To create an interpretable video anomaly detection system that moves beyond conventional visual feature-based approaches by leveraging language representations for knowledge-grounded reasoning.

Method: Three-stage framework: (1) transform video content into fine-grained captions using vision-language model, (2) construct structured knowledge by organizing captions into four semantic slots (action, object, context, environment), (3) generate slot-wise explanations for anomaly decisions.

Result: Evaluated on UCF-Crime and XD-Violence benchmarks, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.

Conclusion: Text-based reasoning enables interpretable video anomaly detection with knowledge-grounded explanations, offering a promising approach for surveillance applications.

Abstract: We introduce Text-based Explainable Video Anomaly Detection (TbVAD), a language-driven framework for weakly supervised video anomaly detection that performs anomaly detection and explanation entirely within the textual domain. Unlike conventional WSVAD models that rely on explicit visual features, TbVAD represents video semantics through language, enabling interpretable and knowledge-grounded reasoning. The framework operates in three stages: (1) transforming video content into fine-grained captions using a vision-language model, (2) constructing structured knowledge by organizing the captions into four semantic slots (action, object, context, environment), and (3) generating slot-wise explanations that reveal which semantic factors contribute most to the anomaly decision. We evaluate TbVAD on two public benchmarks, UCF-Crime and XD-Violence, demonstrating that textual knowledge reasoning provides interpretable and reliable anomaly detection for real-world surveillance scenarios.

[115] Two Datasets Are Better Than One: Method of Double Moments for 3-D Reconstruction in Cryo-EM

Joe Kileel, Oscar Mickelin, Amit Singer, Sheng Xu

Main category: cs.CV

TL;DR: MoDM reconstructs molecular structures from second-order moments of cryo-EM projection images under different orientation distributions using convex optimization, enabling accurate recovery without full distribution knowledge.

Details

Motivation: Cryo-EM faces challenges with noisy projections and unknown particle orientations. The paper aims to leverage dataset diversity under different experimental conditions to improve reconstruction quality.

Method: Method of double moments (MoDM) uses second-order moments from two datasets: one with uniform orientation distribution and another with unknown non-uniform distribution. A convex-relaxation-based algorithm reconstructs structures from these statistics.

Result: The method generically uniquely determines molecular structures (up to global rotation/reflection) and achieves accurate recovery using only second-order statistics, demonstrating enhanced reconstruction quality.

Conclusion: Leveraging multiple datasets under different experimental conditions substantially improves cryo-EM reconstruction, showing the power of data fusion and statistical modeling in computational imaging.

Abstract: Cryo-electron microscopy (cryo-EM) is a powerful imaging technique for reconstructing three-dimensional molecular structures from noisy tomographic projection images of randomly oriented particles. We introduce a new data fusion framework, termed the method of double moments (MoDM), which reconstructs molecular structures from two instances of the second-order moment of projection images obtained under distinct orientation distributions–one uniform, the other non-uniform and unknown. We prove that these moments generically uniquely determine the underlying structure, up to a global rotation and reflection, and we develop a convex-relaxation-based algorithm that achieves accurate recovery using only second-order statistics. Our results demonstrate the advantage of collecting and modeling multiple datasets under different experimental conditions, illustrating that leveraging dataset diversity can substantially enhance reconstruction quality in computational imaging tasks.

[116] Modulo Video Recovery via Selective Spatiotemporal Vision Transformer

Tianyu Geng, Feng Ji, Wee Peng Tay

Main category: cs.CV

TL;DR: SSViT is a novel Transformer-based framework for modulo video recovery that uses selective token processing to efficiently reconstruct high-quality HDR videos from folded 8-bit inputs.

Details

Motivation: Conventional image sensors have limited dynamic range, causing saturation in HDR scenes. Modulo cameras fold irradiance but require specialized unwrapping algorithms, and existing HDR methods are unsuitable for modulo recovery.

Method: Proposes Selective Spatiotemporal Vision Transformer (SSViT) that captures global dependencies and spatial-temporal relationships using a token selection strategy to focus on critical regions and improve efficiency.

Result: SSViT produces high-quality reconstructions from 8-bit folded videos and achieves state-of-the-art performance in modulo video recovery.

Conclusion: Transformers with selective token processing are effective for modulo video reconstruction, overcoming limitations of conventional HDR methods and advancing deep learning applications in this domain.

Abstract: Conventional image sensors have limited dynamic range, causing saturation in high-dynamic-range (HDR) scenes. Modulo cameras address this by folding incident irradiance into a bounded range, yet require specialized unwrapping algorithms to reconstruct the underlying signal. Unlike HDR recovery, which extends dynamic range from conventional sampling, modulo recovery restores actual values from folded samples. Despite being introduced over a decade ago, progress in modulo image recovery has been slow, especially in the use of modern deep learning techniques. In this work, we demonstrate that standard HDR methods are unsuitable for modulo recovery. Transformers, however, can capture global dependencies and spatial-temporal relationships crucial for resolving folded video frames. Still, adapting existing Transformer architectures for modulo recovery demands novel techniques. To this end, we present Selective Spatiotemporal Vision Transformer (SSViT), the first deep learning framework for modulo video reconstruction. SSViT employs a token selection strategy to improve efficiency and concentrate on the most critical regions. Experiments confirm that SSViT produces high-quality reconstructions from 8-bit folded videos and achieves state-of-the-art performance in modulo video recovery.

[117] Laplacian Score Sharpening for Mitigating Hallucination in Diffusion Models

Barath Chandran. C, Srinivas Anumasa, Dianbo Liu

Main category: cs.CV

TL;DR: Proposes a post-hoc adjustment to diffusion model score functions using Laplacian approximation to reduce mode interpolation hallucinations during sampling.

Details

Motivation: Diffusion models suffer from hallucinations creating incoherent samples due to mode interpolation and score smoothening, but existing methods lack prevention during sampling.

Method: Post-hoc adjustment to score function using Laplacian (sharpness) approximation via finite-difference variant of Hutchinson trace estimator for higher dimensions.

Result: Significantly reduces hallucinated samples across 1D, 2D distributions and high-dimensional image datasets.

Conclusion: Laplacian-based correction effectively reduces mode interpolation hallucinations and reveals relationship between Laplacian and score uncertainty.

Abstract: Diffusion models, though successful, are known to suffer from hallucinations that create incoherent or unrealistic samples. Recent works have attributed this to the phenomenon of mode interpolation and score smoothening, but they lack a method to prevent their generation during sampling. In this paper, we propose a post-hoc adjustment to the score function during inference that leverages the Laplacian (or sharpness) of the score to reduce mode interpolation hallucination in unconditional diffusion models across 1D, 2D, and high-dimensional image data. We derive an efficient Laplacian approximation for higher dimensions using a finite-difference variant of the Hutchinson trace estimator. We show that this correction significantly reduces the rate of hallucinated samples across toy 1D/2D distributions and a high- dimensional image dataset. Furthermore, our analysis explores the relationship between the Laplacian and uncertainty in the score.

[118] Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance

Kwanyoung Kim

Main category: cs.CV

TL;DR: ASAG is a novel guidance method for diffusion models that uses adversarial optimal transport to intentionally disrupt attention mechanisms, improving sample quality without model retraining.

Details

Motivation: Existing guidance methods like CFG rely on heuristic perturbation functions without principled foundations, lacking systematic approaches to attention manipulation.

Method: ASAG reinterprets attention scores through optimal transport lens and injects adversarial cost in self-attention layers using Sinkhorn algorithm to reduce pixel-wise similarity between queries and keys.

Result: ASAG consistently improves text-to-image diffusion quality, enhances controllability and fidelity in downstream applications like IP-Adapter and ControlNet.

Conclusion: ASAG is a lightweight, plug-and-play method that improves diffusion model reliability without requiring model retraining, providing principled attention guidance.

Abstract: Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.

[119] LiveNeRF: Efficient Face Replacement Through Neural Radiance Fields Integration

Tung Vu, Hai Nguyen, Cong Tran

Main category: cs.CV

TL;DR: LiveNeRF framework enables real-time face replacement at 33 FPS with high visual quality for applications like live streaming and video conferencing, while advocating responsible deployment to prevent misuse.

Details

Motivation: To overcome limitations of existing face replacement methods by achieving real-time performance for practical deployment in entertainment, education, and communication applications, while benefiting content creators, educators, and individuals with speech impairments.

Method: LiveNeRF framework that enables real-time face replacement technology with superior visual quality.

Result: Achieved real-time performance at 33 FPS with superior visual quality, enabling practical deployment in live streaming, video conferencing, and interactive media.

Conclusion: The framework enables significant advancements in various applications but requires responsible deployment with user consent verification and integration with detection systems to ensure positive societal impact while minimizing risks of unauthorized deepfake creation.

Abstract: Face replacement technology enables significant advancements in entertainment, education, and communication applications, including dubbing, virtual avatars, and cross-cultural content adaptation. Our LiveNeRF framework addresses critical limitations of existing methods by achieving real-time performance (33 FPS) with superior visual quality, enabling practical deployment in live streaming, video conferencing, and interactive media. The technology particularly benefits content creators, educators, and individuals with speech impairments through accessible avatar communication. While acknowledging potential misuse in unauthorized deepfake creation, we advocate for responsible deployment with user consent verification and integration with detection systems to ensure positive societal impact while minimizing risks.

[120] T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

Zhitao Wang, Hengyu Man, Wenrui Li, Xingtao Wang, Xiaopeng Fan, Debin Zhao

Main category: cs.CV

TL;DR: T-GVC is a novel generative video coding framework that uses semantic-aware sparse motion sampling and trajectory-aligned diffusion guidance to achieve superior ultra-low bitrate compression while preserving realistic motion details.

Details

Motivation: Existing generative video coding methods are limited by domain specificity or excessive dependence on text guidance, which inadequately capture fine-grained motion details and lead to unrealistic reconstructions.

Method: Proposes T-GVC framework with semantic-aware sparse motion sampling that extracts pixel-wise motion as sparse trajectory points based on semantic importance, and integrates trajectory-aligned loss constraints into diffusion processes for training-free guidance.

Result: Outperforms both traditional and neural video codecs under ultra-low bitrate conditions, and achieves more precise motion control than existing text-guided methods.

Conclusion: Paves the way for a novel direction of generative video coding guided by geometric motion modeling, bridging low-level motion tracking with high-level semantic understanding.

Abstract: Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on high-level text guidance, which tend to inadequately capture fine-grained motion details, leading to unrealistic or incoherent reconstructions. To address these challenges, we propose Trajectory-Guided Generative Video Coding (dubbed T-GVC), a novel framework that bridges low-level motion tracking with high-level semantic understanding. T-GVC features a semantic-aware sparse motion sampling pipeline that extracts pixel-wise motion as sparse trajectory points based on their semantic importance, significantly reducing the bitrate while preserving critical temporal semantic information. In addition, by integrating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free guidance mechanism in latent space to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that T-GVC outperforms both traditional and neural video codecs under ULB conditions. Furthermore, additional experiments confirm that our framework achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.

[121] TrackStudio: An Integrated Toolkit for Markerless Tracking

Hristo Dimitrov, Giulia Dominijanni, Viktorija Pavalkyte, Tamar R. Makin

Main category: cs.CV

TL;DR: TrackStudio is a user-friendly GUI toolkit that combines existing open-source tools for markerless motion tracking, enabling automatic 2D/3D tracking, calibration, and analysis without programming skills.

Details

Motivation: To bridge the gap between advanced markerless motion tracking technology and non-expert users by providing an accessible, integrated solution that works out of the box.

Method: Combines established open-source tools into a single modular GUI pipeline with automatic tracking, calibration, preprocessing, feature extraction, and visualization capabilities.

Result: Validated across 76 participants with average inter-frame correlations >0.98 and triangulation errors <13.6mm for hand tracking, demonstrating stable performance even in challenging conditions.

Conclusion: TrackStudio provides a practical, accessible route into markerless tracking for researchers and laypeople who need reliable performance without specialist expertise.

Abstract: Markerless motion tracking has advanced rapidly in the past 10 years and currently offers powerful opportunities for behavioural, clinical, and biomechanical research. While several specialised toolkits provide high performance for specific tasks, using existing tools still requires substantial technical expertise. There remains a gap in accessible, integrated solutions that deliver sufficient tracking for non-experts across diverse settings. TrackStudio was developed to address this gap by combining established open-source tools into a single, modular, GUI-based pipeline that works out of the box. It provides automatic 2D and 3D tracking, calibration, preprocessing, feature extraction, and visualisation without requiring any programming skills. We supply a user guide with practical advice for video acquisition, synchronisation, and setup, alongside documentation of common pitfalls and how to avoid them. To validate the toolkit, we tested its performance across three environments using either low-cost webcams or high-resolution cameras, including challenging conditions for body position, lightning, and space and obstructions. Across 76 participants, average inter-frame correlations exceeded 0.98 and average triangulation errors remained low (<13.6mm for hand tracking), demonstrating stable and consistent tracking. We further show that the same pipeline can be extended beyond hand tracking to other body and face regions. TrackStudio provides a practical, accessible route into markerless tracking for researchers or laypeople who need reliable performance without specialist expertise.

[122] UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Jinting Wang, Shan Yang, Chenxing Li, Dong Yu, Li Liu

Main category: cs.CV

TL;DR: UniCUE is a unified framework that directly generates speech from Cued Speech videos without text intermediates, using visual-semantic cues from CS recognition to guide generation.

Details

Motivation: Existing CSV2S approaches rely on text intermediates which cause error propagation and temporal misalignment, while direct methods struggle with multimodal complexity and limited data.

Method: UniCUE integrates CS recognition for visual-semantic cues, uses pose-aware visual processing, semantic alignment pool for precise mapping, and VisioPhonetic adapter to bridge understanding and generation tasks.

Result: UniCUE achieves state-of-the-art performance on the UniCUE-HI dataset containing 11,282 Mandarin CS videos from 14 cuers across multiple evaluation metrics.

Conclusion: The unified framework effectively addresses CSV2S challenges by directly generating speech from CS videos while leveraging fine-grained visual-semantic guidance, demonstrating superior performance over pipeline approaches.

Abstract: Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired. The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals. Most existing research focuses on CS Recognition (CSR), which transcribes video content into text. Consequently, a common solution for CSV2S is to integrate CSR with a text-to-speech (TTS) system. However, this pipeline relies on text as an intermediate medium, which may lead to error propagation and temporal misalignment between speech and CS video dynamics. In contrast, directly generating audio speech from CS video (direct CSV2S) often suffers from the inherent multimodal complexity and the limited availability of CS data. To address these challenges, we propose UniCUE, the first unified framework for CSV2S that directly generates speech from CS videos without relying on intermediate text. The core innovation of UniCUE lies in integrating an understanding task (CSR) that provides fine-grained CS visual-semantic cues to guide speech generation. Specifically, UniCUE incorporates a pose-aware visual processor, a semantic alignment pool that enables precise visual-semantic mapping, and a VisioPhonetic adapter to bridge the understanding and generation tasks within a unified architecture. To support this framework, we construct UniCUE-HI, a large-scale Mandarin CS dataset containing 11282 videos from 14 cuers, including both hearing-impaired and normal-hearing individuals. Extensive experiments on this dataset demonstrate that UniCUE achieves state-of-the-art performance across multiple evaluation metrics.

[123] Predicting Coronary Artery Calcium Severity based on Non-Contrast Cardiac CT images using Deep Learning

Lachlan Nguyen, Aidan Cousins, Arcot Sowmya, Hugh Dixson, Sonit Singh

Main category: cs.CV

TL;DR: Deep learning CNN model accurately classifies coronary artery calcium scores into six clinical categories with 96.5% accuracy, showing high agreement with semiautomatic methods.

Details

Motivation: Cardiovascular disease causes high mortality worldwide, and current CAC scoring requires time-intensive semiautomatic analysis by radiologists, creating need for automated solutions.

Method: Developed a deep learning CNN model using 68 patient cardiac CT scans with semiautomatic CAC scores as reference labels, divided into training/validation/test sets.

Result: Model achieved 96.5% accuracy, Cohen’s kappa of 0.962, with 32 misclassifications out of all cases (tending to overestimate CAC in 26/32 misclassifications), showing high generalizability.

Conclusion: CNN model is viable for stratifying calcium scores into six clinical categories, producing accurate and consistent results comparable to current semiautomatic practice.

Abstract: Cardiovascular disease causes high rates of mortality worldwide. Coronary artery calcium (CAC) scoring is a powerful tool to stratify the risk of atherosclerotic cardiovascular disease. Current scoring practices require time-intensive semiautomatic analysis of cardiac computed tomography by radiologists and trained radiographers. The purpose of this study is to develop a deep learning convolutional neural networks (CNN) model to classify the calcium score in cardiac, non-contrast computed tomography images into one of six clinical categories. A total of 68 patient scans were retrospectively obtained together with their respective reported semiautomatic calcium score using an ECG-gated GE Discovery 570 Cardiac SPECT/CT camera. The dataset was divided into training, validation and test sets. Using the semiautomatic CAC score as the reference label, the model demonstrated high performance on a six-class CAC scoring categorisation task. Of the scans analysed, the model misclassified 32 cases, tending towards overestimating the CAC in 26 out of 32 misclassifications. Overall, the model showed high agreement (Cohen’s kappa of 0.962), an overall accuracy of 96.5% and high generalisability. The results suggest that the model outputs were accurate and consistent with current semiautomatic practice, with good generalisability to test data. The model demonstrates the viability of a CNN model to stratify the calcium score into an expanded set of six clinical categories.

[124] FlowFeat: Pixel-Dense Embedding of Motion Profiles

Nikita Araslanov, Anna Sonnweber, Daniel Cremers

Main category: cs.CV

TL;DR: FlowFeat is a high-resolution, multi-task feature representation that uses motion profile distillation from optical flow networks to enhance spatial detail and temporal consistency for dense prediction tasks.

Details

Motivation: Current state-of-the-art networks like transformers produce low-resolution feature grids that are suboptimal for dense prediction tasks, creating a need for higher-resolution representations with better geometric and semantic cues.

Method: Developed a novel distillation technique that embeds distributions of plausible apparent motions (motion profiles) using optical flow networks and diverse video data through self-supervised training framework.

Result: FlowFeat significantly enhances representational power of five state-of-the-art encoders across three dense tasks (video object segmentation, monocular depth estimation, semantic segmentation), is computationally inexpensive, and robust to inaccurate flow estimation.

Conclusion: FlowFeat represents a step forward towards reliable and versatile dense image representations by providing high-resolution features with compelling geometric and semantic cues while maintaining temporal consistency.

Abstract: Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state-of-the-art networks, such as transformers, produce low-resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high-resolution and multi-task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self-supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state-of-the-art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.

[125] UltraGS: Gaussian Splatting for Ultrasound Novel View Synthesis

Yuezhe Yang, Wenjie Cai, Dexin Yang, Yufang Dong, Xingbo Dong, Zhe Jin

Main category: cs.CV

TL;DR: UltraGS is a Gaussian Splatting framework for ultrasound imaging that enables novel view synthesis through depth-aware Gaussian modeling and ultrasound-specific rendering functions, achieving state-of-the-art performance in image quality metrics.

Details

Motivation: Limited field of view in ultrasound imaging complicates novel view synthesis, which is crucial for clinical diagnostics and requires accurate depth prediction and tissue intensity modeling.

Method: 1) Depth-aware Gaussian splatting with learnable field of view per Gaussian for accurate depth prediction; 2) SH-DARS rendering function combining low-order spherical harmonics with ultrasound wave physics (depth attenuation, reflection, scattering); 3) Clinical Ultrasound Examination Dataset for benchmarking.

Result: Achieves state-of-the-art results: PSNR up to 29.55, SSIM up to 0.89, MSE as low as 0.002, with real-time synthesis at 64.69 fps on three datasets.

Conclusion: UltraGS provides an effective framework for ultrasound novel view synthesis with superior image quality and real-time performance, supported by a new clinical dataset and open-source code.

Abstract: Ultrasound imaging is a cornerstone of non-invasive clinical diagnostics, yet its limited field of view complicates novel view synthesis. We propose \textbf{UltraGS}, a Gaussian Splatting framework optimized for ultrasound imaging. First, we introduce a depth-aware Gaussian splatting strategy, where each Gaussian is assigned a learnable field of view, enabling accurate depth prediction and precise structural representation. Second, we design SH-DARS, a lightweight rendering function combining low-order spherical harmonics with ultrasound-specific wave physics, including depth attenuation, reflection, and scattering, to model tissue intensity accurately. Third, we contribute the Clinical Ultrasound Examination Dataset, a benchmark capturing diverse anatomical scans under real-world clinical protocols. Extensive experiments on three datasets demonstrate UltraGS’s superiority, achieving state-of-the-art results in PSNR (up to 29.55), SSIM (up to 0.89), and MSE (as low as 0.002) while enabling real-time synthesis at 64.69 fps. The code and dataset are open-sourced at: https://github.com/Bean-Young/UltraGS.

[126] VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics

Daniel Cher, Brian Wei, Srikumar Sastry, Nathan Jacobs

Main category: cs.CV

TL;DR: VectorSynth is a diffusion-based framework for pixel-accurate satellite image synthesis conditioned on polygonal geographic annotations with semantic attributes, enabling fine-grained spatial edits.

Details

Motivation: To enable fine-grained, spatially grounded edits in satellite imagery by learning dense cross-modal correspondences between imagery and semantic vector geometry, overcoming limitations of prior text- or layout-conditioned models.

Method: Uses a vision language alignment module to produce pixel-level embeddings from polygon semantics, which guide a conditional image generation framework to respect both spatial extents and semantic cues. Supports interactive workflows mixing language prompts with geometry-aware conditioning.

Result: Shows strong improvements over prior methods in semantic fidelity and structural realism. The trained vision language model demonstrates fine-grained spatial grounding. Enables rapid what-if simulations, spatial edits, and map-informed content generation.

Conclusion: VectorSynth provides an effective framework for pixel-accurate satellite image synthesis with fine-grained spatial control, supporting interactive editing workflows and demonstrating superior performance compared to existing methods.

Abstract: We introduce VectorSynth, a diffusion-based framework for pixel-accurate satellite image synthesis conditioned on polygonal geographic annotations with semantic attributes. Unlike prior text- or layout-conditioned models, VectorSynth learns dense cross-modal correspondences that align imagery and semantic vector geometry, enabling fine-grained, spatially grounded edits. A vision language alignment module produces pixel-level embeddings from polygon semantics; these embeddings guide a conditional image generation framework to respect both spatial extents and semantic cues. VectorSynth supports interactive workflows that mix language prompts with geometry-aware conditioning, allowing rapid what-if simulations, spatial edits, and map-informed content generation. For training and evaluation, we assemble a collection of satellite scenes paired with pixel-registered polygon annotations spanning diverse urban scenes with both built and natural features. We observe strong improvements over prior methods in semantic fidelity and structural realism, and show that our trained vision language model demonstrates fine-grained spatial grounding. The code and data are available at https://github.com/mvrl/VectorSynth.

[127] Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs

Yuezhe Yang, Yiyue Guo, Wenjie Cai, Qingqing Ruan, Siying Wang, Xingbo Dong, Zhe Jin, Yong Dai

Main category: cs.CV

TL;DR: Auto-US is an AI-assisted ultrasound diagnosis system that combines ultrasound videos with clinical text, achieving 86.73% accuracy in video classification and generating clinically validated diagnostic suggestions.

Details

Motivation: To address limitations in existing AI-assisted ultrasound research regarding dataset diversity, diagnostic performance, and clinical applicability.

Method: Developed CTU-Net for ultrasound video classification and integrated large language models to generate diagnostic suggestions from combined ultrasound video data and clinical text.

Result: Achieved 86.73% accuracy in ultrasound video classification across 5 categories and 3 organs, with diagnostic scores exceeding 3/5 validated by professional clinicians.

Conclusion: Auto-US demonstrates effectiveness and clinical potential for real-world ultrasound applications, with publicly available code and data.

Abstract: AI-assisted ultrasound video diagnosis presents new opportunities to enhance the efficiency and accuracy of medical imaging analysis. However, existing research remains limited in terms of dataset diversity, diagnostic performance, and clinical applicability. In this study, we propose \textbf{Auto-US}, an intelligent diagnosis agent that integrates ultrasound video data with clinical diagnostic text. To support this, we constructed \textbf{CUV Dataset} of 495 ultrasound videos spanning five categories and three organs, aggregated from multiple open-access sources. We developed \textbf{CTU-Net}, which achieves state-of-the-art performance in ultrasound video classification, reaching an accuracy of 86.73% Furthermore, by incorporating large language models, Auto-US is capable of generating clinically meaningful diagnostic suggestions. The final diagnostic scores for each case exceeded 3 out of 5 and were validated by professional clinicians. These results demonstrate the effectiveness and clinical potential of Auto-US in real-world ultrasound applications. Code and data are available at: https://github.com/Bean-Young/Auto-US.

[128] Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation

Shengqian Zhu, Chengrong Yu, Qiang Wang, Ying Song, Guangjun Li, Jiafei Wu, Xiaogang Xu, Zhang Yi, Junjie Hu

Main category: cs.CV

TL;DR: Proposes PGCD and DAPD methods for class incremental medical image segmentation that use prototype-guided calibration and dual-aligned prototype distillation to better preserve old knowledge while learning new classes.

Details

Motivation: Existing CIMIS methods either use one-size-fits-all strategies that treat all regions equally, or focus only on global prototype alignment while ignoring local representations, leading to knowledge degradation.

Method: PGCD uses prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions. DAPD aligns local prototypes of old classes with both global and local prototypes to enhance old category segmentation.

Result: Comprehensive evaluations on two multi-organ segmentation benchmarks show the method outperforms state-of-the-art approaches.

Conclusion: The proposed PGCD and DAPD methods effectively address knowledge degradation in CIMIS and demonstrate superior robustness and generalization capabilities compared to existing methods.

Abstract: Class incremental medical image segmentation (CIMIS) aims to preserve knowledge of previously learned classes while learning new ones without relying on old-class labels. However, existing methods 1) either adopt one-size-fits-all strategies that treat all spatial regions and feature channels equally, which may hinder the preservation of accurate old knowledge, 2) or focus solely on aligning local prototypes with global ones for old classes while overlooking their local representations in new data, leading to knowledge degradation. To mitigate the above issues, we propose Prototype-Guided Calibration Distillation (PGCD) and Dual-Aligned Prototype Distillation (DAPD) for CIMIS in this paper. Specifically, PGCD exploits prototype-to-feature similarity to calibrate class-specific distillation intensity in different spatial regions, effectively reinforcing reliable old knowledge and suppressing misleading information from old classes. Complementarily, DAPD aligns the local prototypes of old classes extracted from the current model with both global prototypes and local prototypes, further enhancing segmentation performance on old categories. Comprehensive evaluations on two widely used multi-organ segmentation benchmarks demonstrate that our method outperforms state-of-the-art methods, highlighting its robustness and generalization capabilities.

[129] FaSDiff: Balancing Perception and Semantics in Face Compression via Stable Diffusion Priors

Yimin Zhou, Yichong Xia, Bin Chen, Mingyao Hong, Jiawei Li, Zhi Wang, Yaowei Wang

Main category: cs.CV

TL;DR: FaSDiff is a novel diffusion-based facial image compression framework that enhances both visual fidelity and semantic consistency by incorporating high-frequency-sensitive compression and low-frequency enhancement modules.

Details

Motivation: Traditional learning-based face compression methods degrade at low bit rates, and direct application of diffusion priors leads to poor preservation of high-frequency details and suboptimal machine vision performance.

Method: FaSDiff uses a high-frequency-sensitive compressor to capture fine details and generate visual prompts, plus a hybrid low-frequency enhancement module that disentangles semantic structures to enable stable diffusion prior modulation during reconstruction.

Result: Extensive experiments show FaSDiff outperforms state-of-the-art methods in both perceptual metrics and downstream task performance.

Conclusion: FaSDiff effectively balances human visual fidelity and machine vision accuracy by jointly optimizing perceptual quality and semantic preservation through diffusion-driven compression.

Abstract: With the increasing deployment of facial image data across a wide range of applications, efficient compression tailored to facial semantics has become critical for both storage and transmission. While recent learning-based face image compression methods have achieved promising results, they often suffer from degraded reconstruction quality at low bit rates. Directly applying diffusion-based generative priors to this task leads to suboptimal performance in downstream machine vision tasks, primarily due to poor preservation of high-frequency details. In this work, we propose FaSDiff (\textbf{Fa}cial Image Compression with a \textbf{S}table \textbf{Diff}usion Prior), a novel diffusion-driven compression framework designed to enhance both visual fidelity and semantic consistency. FaSDiff incorporates a high-frequency-sensitive compressor to capture fine-grained details and generate robust visual prompts for guiding the diffusion model. To address low-frequency degradation, we further introduce a hybrid low-frequency enhancement module that disentangles and preserves semantic structures, enabling stable modulation of the diffusion prior during reconstruction. By jointly optimizing perceptual quality and semantic preservation, FaSDiff effectively balances human visual fidelity and machine vision accuracy. Extensive experiments demonstrate that FaSDiff outperforms state-of-the-art methods in both perceptual metrics and downstream task performance.

[130] Filtered-ViT: A Robust Defense Against Multiple Adversarial Patch Attacks

Aja Khanal, Ahmed Faid, Apurva Narayan

Main category: cs.CV

TL;DR: Filtered-ViT is a vision transformer with SMART-VMF filtering that defends against multiple adversarial patches while maintaining performance on clean images and real-world medical artifacts.

Details

Motivation: Current deep learning vision systems are vulnerable to multiple adversarial patches, especially in safety-critical domains like healthcare, and existing defenses fail against multi-patch attacks.

Method: Proposes Filtered-ViT architecture integrating SMART Vector Median Filtering - a spatially adaptive, multi-scale mechanism that selectively suppresses corrupted regions while preserving semantic details.

Result: Achieves 79.8% clean accuracy and 46.3% robust accuracy on ImageNet with four simultaneous 1% patches, outperforming existing defenses. Also effective on real-world medical imagery with natural artifacts.

Conclusion: Filtered-ViT is the first transformer demonstrating unified robustness against both adversarial and natural patch-like disruptions, enabling reliable vision systems in high-stakes environments.

Abstract: Deep learning vision systems are increasingly deployed in safety-critical domains such as healthcare, yet they remain vulnerable to small adversarial patches that can trigger misclassifications. Most existing defenses assume a single patch and fail when multiple localized disruptions occur, the type of scenario adversaries and real-world artifacts often exploit. We propose Filtered-ViT, a new vision transformer architecture that integrates SMART Vector Median Filtering (SMART-VMF), a spatially adaptive, multi-scale, robustness-aware mechanism that enables selective suppression of corrupted regions while preserving semantic detail. On ImageNet with LaVAN multi-patch attacks, Filtered-ViT achieves 79.8% clean accuracy and 46.3% robust accuracy under four simultaneous 1% patches, outperforming existing defenses. Beyond synthetic benchmarks, a real-world case study on radiographic medical imagery shows that Filtered-ViT mitigates natural artifacts such as occlusions and scanner noise without degrading diagnostic content. This establishes Filtered-ViT as the first transformer to demonstrate unified robustness against both adversarial and naturally occurring patch-like disruptions, charting a path toward reliable vision systems in truly high-stakes environments.

[131] Enhancing Diffusion Model Guidance through Calibration and Regularization

Seyed Alireza Javid, Amirhossein Bagheri, Nuria González-Prelcic

Main category: cs.CV

TL;DR: This paper addresses the issue of overconfident predictions in classifier-guided diffusion models by proposing calibration methods and enhanced sampling guidance that improve image generation quality without requiring diffusion model retraining.

Details

Motivation: Classifier-guided diffusion models suffer from overconfident predictions during early denoising steps, causing guidance gradients to vanish and limiting their effectiveness for conditional image generation.

Method: Two main approaches: 1) Differentiable calibration objective using Smooth Expected Calibration Error for classifier fine-tuning, 2) Enhanced sampling guidance methods including tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling, and novel f-divergence-based sampling strategy.

Result: Achieved FID of 2.13 on ImageNet 128x128 using ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining.

Conclusion: Principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion models, addressing gradient vanishing issues and enhancing conditional image generation quality.

Abstract: Classifier-guided diffusion models have emerged as a powerful approach for conditional image generation, but they suffer from overconfident predictions during early denoising steps, causing the guidance gradient to vanish. This paper introduces two complementary contributions to address this issue. First, we propose a differentiable calibration objective based on the Smooth Expected Calibration Error (Smooth ECE), which improves classifier calibration with minimal fine-tuning and yields measurable improvements in Frechet Inception Distance (FID). Second, we develop enhanced sampling guidance methods that operate on off-the-shelf classifiers without requiring retraining. These include tilted sampling with batch-level reweighting, adaptive entropy-regularized sampling to preserve diversity, and a novel f-divergence-based sampling strategy that strengthens class-consistent guidance while maintaining mode coverage. Experiments on ImageNet 128x128 demonstrate that our divergence-regularized guidance achieves an FID of 2.13 using a ResNet-101 classifier, improving upon existing classifier-guided diffusion methods while requiring no diffusion model retraining. The results show that principled calibration and divergence-aware sampling provide practical and effective improvements for classifier-guided diffusion.

[132] Beyond Randomness: Understand the Order of the Noise in Diffusion

Song Yan, Min Li, Bi Xinliang, Jian Yang, Yusen Zhang, Guanye Xiong, Yunwei Lan, Tao Zhang, Wei Zhai, Zheng-Jun Zha

Main category: cs.CV

TL;DR: The paper reveals that initial noise in text-to-content diffusion models contains analyzable semantic patterns, not just randomness, and proposes a training-free “Semantic Erasure-Injection” method to modulate noise for better generation control.

Details

Motivation: Challenge the conventional view that initial noise in diffusion models is purely random, and demonstrate that noise actually contains rich semantic information that can be analyzed and manipulated.

Method: A two-step training-free process: 1) Semantic Erasure - remove unwanted semantics from noise using information theory principles, 2) Semantic Injection - inject desired semantics into the cleaned noise by leveraging the equivalence between diffusion generation process and semantic injection.

Result: The method is consistently effective across various text-to-content models based on both DiT and UNet architectures, providing universal optimization for consistent generation.

Conclusion: Initial noise in diffusion models contains analyzable semantic patterns, and the proposed Semantic Erasure-Injection approach offers a novel perspective and universal tool for optimizing diffusion model generation.

Abstract: In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model’s generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step “Semantic Erasure-Injection” process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.

Likang Peng, Chao Su, Wenyuan Wu, Yuan Sun, Dezhong Peng, Xi Peng, Xu Wang

Main category: cs.CV

TL;DR: SCBCH is a novel cross-modal hashing framework that addresses label noise and semantic overlap issues in multi-label datasets through semantic-consistent classification and bidirectional soft contrastive learning.

Details

Motivation: Existing cross-modal hashing methods rely on fully annotated datasets and are vulnerable to label noise, while also ignoring partial semantic overlaps in multi-label data, limiting their robustness.

Method: Proposes SCBCH with two modules: (1) Cross-modal Semantic-Consistent Classification (CSCC) that estimates sample reliability using cross-modal semantic consistency to mitigate noisy labels, and (2) Bidirectional Soft Contrastive Hashing (BSCH) that generates soft contrastive pairs based on multi-label semantic overlap for adaptive contrastive learning.

Result: Extensive experiments on four cross-modal retrieval benchmarks show SCBCH consistently outperforms state-of-the-art methods under noisy multi-label conditions, demonstrating effectiveness and robustness.

Conclusion: SCBCH effectively addresses label noise and semantic overlap challenges in cross-modal hashing, providing a robust solution for real-world multi-label retrieval scenarios with noisy annotations.

Abstract: Cross-modal hashing (CMH) facilitates efficient retrieval across different modalities (e.g., image and text) by encoding data into compact binary representations. While recent methods have achieved remarkable performance, they often rely heavily on fully annotated datasets, which are costly and labor-intensive to obtain. In real-world scenarios, particularly in multi-label datasets, label noise is prevalent and severely degrades retrieval performance. Moreover, existing CMH approaches typically overlook the partial semantic overlaps inherent in multi-label data, limiting their robustness and generalization. To tackle these challenges, we propose a novel framework named Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). The framework comprises two complementary modules: (1) Cross-modal Semantic-Consistent Classification (CSCC), which leverages cross-modal semantic consistency to estimate sample reliability and reduce the impact of noisy labels; (2) Bidirectional Soft Contrastive Hashing (BSCH), which dynamically generates soft contrastive sample pairs based on multi-label semantic overlap, enabling adaptive contrastive learning between semantically similar and dissimilar samples across modalities. Extensive experiments on four widely-used cross-modal retrieval benchmarks validate the effectiveness and robustness of our method, consistently outperforming state-of-the-art approaches under noisy multi-label conditions.

[134] Divide-and-Conquer Decoupled Network for Cross-Domain Few-Shot Segmentation

Runmin Cong, Anpeng Wang, Bin Wan, Cong Zhang, Xiaofei Zhou, Wei Zhang

Main category: cs.CV

TL;DR: DCDNet addresses cross-domain few-shot segmentation by decoupling entangled features into category-relevant private and domain-relevant shared representations, then adaptively fusing them to improve generalization and adaptation with limited annotations.

Details

Motivation: Encoder features often entangle domain-relevant and category-relevant information, limiting both generalization and rapid adaptation to new domains in cross-domain few-shot segmentation tasks.

Method: Proposes Divide-and-Conquer Decoupled Network (DCDNet) with three modules: Adversarial-Contrastive Feature Decomposition (ACFD) for feature decoupling, Matrix-Guided Dynamic Fusion (MGDF) for adaptive feature integration, and Cross-Adaptive Modulation (CAM) for enhanced generalization during fine-tuning.

Result: Extensive experiments on four challenging datasets show that DCDNet outperforms existing CD-FSS methods, setting a new state-of-the-art for cross-domain generalization and few-shot adaptation.

Conclusion: DCDNet effectively addresses feature entanglement in cross-domain few-shot segmentation through its decoupled approach and adaptive fusion mechanisms, achieving superior performance across multiple domains.

Abstract: Cross-domain few-shot segmentation (CD-FSS) aims to tackle the dual challenge of recognizing novel classes and adapting to unseen domains with limited annotations. However, encoder features often entangle domain-relevant and category-relevant information, limiting both generalization and rapid adaptation to new domains. To address this issue, we propose a Divide-and-Conquer Decoupled Network (DCDNet). In the training stage, to tackle feature entanglement that impedes cross-domain generalization and rapid adaptation, we propose the Adversarial-Contrastive Feature Decomposition (ACFD) module. It decouples backbone features into category-relevant private and domain-relevant shared representations via contrastive learning and adversarial learning. Then, to mitigate the potential degradation caused by the disentanglement, the Matrix-Guided Dynamic Fusion (MGDF) module adaptively integrates base, shared, and private features under spatial guidance, maintaining structural coherence. In addition, in the fine-tuning stage, to enhanced model generalization, the Cross-Adaptive Modulation (CAM) module is placed before the MGDF, where shared features guide private features via modulation ensuring effective integration of domain-relevant information. Extensive experiments on four challenging datasets show that DCDNet outperforms existing CD-FSS methods, setting a new state-of-the-art for cross-domain generalization and few-shot adaptation.

[135] Learning Sparse Label Couplings for Multilabel Chest X-Ray Diagnosis

Utkarsh Prakash Srivastava, Kaushik Gupta, Kaushik Nath

Main category: cs.CV

TL;DR: A strong multilabel chest X-ray classification pipeline using SE-ResNeXt101 with Label-Graph Refinement module that learns inter-label relationships to improve performance.

Details

Motivation: To develop a practical and effective multilabel classification system for chest X-rays that addresses class imbalance and leverages label co-occurrence patterns.

Method: Fine-tuned SE-ResNeXt101 with sigmoid head, trained using Multilabel Iterative Stratification, Asymmetric Loss, and various optimization techniques. Added a lightweight Label-Graph Refinement module that learns sparse inter-label coupling matrix.

Result: Baseline achieves 92.64% macro AUC, and Label-Graph Refinement consistently improves validation macro AUC across folds with negligible computational overhead.

Conclusion: The method provides a reproducible, hardware-friendly approach for stronger multilabel CXR classifiers without requiring extra annotations.

Abstract: We study multilabel classification of chest X-rays and present a simple, strong pipeline built on SE-ResNeXt101 $(32 \times 4d)$. The backbone is finetuned for 14 thoracic findings with a sigmoid head, trained using Multilabel Iterative Stratification (MIS) for robust cross-validation splits that preserve label co-occurrence. To address extreme class imbalance and asymmetric error costs, we optimize with Asymmetric Loss, employ mixed-precision (AMP), cosine learning-rate decay with warm-up, gradient clipping, and an exponential moving average (EMA) of weights. We propose a lightweight Label-Graph Refinement module placed after the classifier: given per-label probabilities, it learns a sparse, trainable inter-label coupling matrix that refines logits via a single message-passing step while adding only an L1-regularized parameter head. At inference, we apply horizontal flip test-time augmentation (TTA) and average predictions across MIS folds (a compact deep ensemble). Evaluation uses macro AUC averaging classwise ROC-AUC and skipping single-class labels in a fold to reflect balanced performance across conditions. On our dataset, a strong SE-ResNeXt101 baseline attains competitive macro AUC (e.g., 92.64% in our runs). Adding the Label-Graph Refinement consistently improves validation macro AUC across folds with negligible compute. The resulting method is reproducible, hardware-friendly, and requires no extra annotations, offering a practical route to stronger multilabel CXR classifiers.

[136] PC-Diffusion: Aligning Diffusion Models with Human Preferences via Preference Classifier

Shaomeng Wang, He Wang, Xiaolu Wei, Longquan Dai, Jinhui Tang

Main category: cs.CV

TL;DR: PC-Diffusion is a lightweight framework that uses a trainable Preference Classifier to align diffusion models with human preferences, eliminating the need for full model fine-tuning and reference models while achieving comparable performance to DPO at lower computational cost.

Details

Motivation: Address limitations of DPO-like methods in diffusion models: high computational cost from full model fine-tuning and sensitivity to reference model quality causing instability and bias.

Method: Proposes a lightweight Preference Classifier that directly models relative preference between samples, decoupling preference alignment from the generative model. Uses preference-guided correction to steer generation toward preferred regions.

Result: Achieves comparable preference consistency to DPO while significantly reducing training costs. Enables efficient and stable preference-guided generation without reference model reliance.

Conclusion: PC-Diffusion provides an effective alternative to DPO for human preference alignment in diffusion models, offering theoretical guarantees and practical efficiency advantages.

Abstract: Diffusion models have achieved remarkable success in conditional image generation, yet their outputs often remain misaligned with human preferences. To address this, recent work has applied Direct Preference Optimization (DPO) to diffusion models, yielding significant improvements.~However, DPO-like methods exhibit two key limitations: 1) High computational cost,due to the entire model fine-tuning; 2) Sensitivity to reference model quality}, due to its tendency to introduce instability and bias. To overcome these limitations, we propose a novel framework for human preference alignment in diffusion models (PC-Diffusion), using a lightweight, trainable Preference Classifier that directly models the relative preference between samples. By restricting preference learning to this classifier, PC-Diffusion decouples preference alignment from the generative model, eliminating the need for entire model fine-tuning and reference model reliance.~We further provide theoretical guarantees for PC-Diffusion:1) PC-Diffusion ensures that the preference-guided distributions are consistently propagated across timesteps. 2)The training objective of the preference classifier is equivalent to DPO, but does not require a reference model.3) The proposed preference-guided correction can progressively steer generation toward preference-aligned regions.~Empirical results show that PC-Diffusion achieves comparable preference consistency to DPO while significantly reducing training costs and enabling efficient and stable preference-guided generation.

[137] DI3CL: Contrastive Learning With Dynamic Instances and Contour Consistency for SAR Land-Cover Classification Foundation Model

Zhongle Ren, Hui Ding, Kai Wang, Biao Hou, Xingyu Luo, Weibin Li, Licheng Jiao

Main category: cs.CV

TL;DR: A foundation model for SAR land-cover classification using contrastive learning with dynamic instance and contour consistency modules, trained on a large-scale SARSense dataset.

Details

Motivation: Current SAR classification methods rely heavily on supervised learning with extensive labeled data, limiting scalability, generalization, and adaptability to diverse applications.

Method: Dynamic Instance and Contour Consistency Contrastive Learning (DI3CL) framework with DI module for global contextual awareness and CC module for geometric contour focus, pre-trained on 460,532 SAR images from SARSense dataset.

Result: Outperforms existing methods across various SAR land-cover classification tasks including land-cover mapping, water body detection, and road extraction.

Conclusion: DI3CL serves as a robust foundation model that accelerates development and deployment of downstream SAR classification models with improved generalization capability.

Abstract: Although significant advances have been achieved in SAR land-cover classification, recent methods remain predominantly focused on supervised learning, which relies heavily on extensive labeled datasets. This dependency not only limits scalability and generalization but also restricts adaptability to diverse application scenarios. In this paper, a general-purpose foundation model for SAR land-cover classification is developed, serving as a robust cornerstone to accelerate the development and deployment of various downstream models. Specifically, a Dynamic Instance and Contour Consistency Contrastive Learning (DI3CL) pre-training framework is presented, which incorporates a Dynamic Instance (DI) module and a Contour Consistency (CC) module. DI module enhances global contextual awareness by enforcing local consistency across different views of the same region. CC module leverages shallow feature maps to guide the model to focus on the geometric contours of SAR land-cover objects, thereby improving structural discrimination. Additionally, to enhance robustness and generalization during pre-training, a large-scale and diverse dataset named SARSense, comprising 460,532 SAR images, is constructed to enable the model to capture comprehensive and representative features. To evaluate the generalization capability of our foundation model, we conducted extensive experiments across a variety of SAR land-cover classification tasks, including SAR land-cover mapping, water body detection, and road extraction. The results consistently demonstrate that the proposed DI3CL outperforms existing methods. Our code and pre-trained weights are publicly available at: https://github.com/SARpre-train/DI3CL.

[138] Revisiting MLLM Based Image Quality Assessment: Errors and Remedy

Zhenchen Tang, Songlin Yang, Bo Peng, Zichuan Wang, Jing Dong

Main category: cs.CV

TL;DR: Q-Scorer is a framework that addresses the mismatch between MLLMs’ discrete outputs and IQA’s continuous score requirements by adding a regression module and IQA-specific tokens, achieving SOTA performance.

Details

Motivation: The inherent mismatch between MLLMs' discrete token outputs and IQA's continuous quality scores hinders performance, with previous conversion methods suffering from errors and semantic confusion from level tokens.

Method: Proposes Q-Scorer framework with a lightweight regression module and IQA-specific score tokens integrated into the MLLM pipeline, based on theoretical analysis of previous approaches’ errors.

Result: Achieves state-of-the-art performance across multiple IQA benchmarks, shows good generalization to mixed datasets, and further improves when combined with other methods.

Conclusion: Q-Scorer effectively bridges the gap between MLLMs and IQA tasks, overcoming limitations of previous approaches while maintaining MLLMs’ original capabilities.

Abstract: The rapid progress of multi-modal large language models (MLLMs) has boosted the task of image quality assessment (IQA). However, a key challenge arises from the inherent mismatch between the discrete token outputs of MLLMs and the continuous nature of quality scores required by IQA tasks. This discrepancy significantly hinders the performance of MLLM-based IQA methods. Previous approaches that convert discrete token predictions into continuous scores often suffer from conversion errors. Moreover, the semantic confusion introduced by level tokens (e.g., ``good’’) further constrains the performance of MLLMs on IQA tasks and degrades their original capabilities for related tasks. To tackle these problems, we provide a theoretical analysis of the errors inherent in previous approaches and, motivated by this analysis, propose a simple yet effective framework, Q-Scorer. This framework incorporates a lightweight regression module and IQA-specific score tokens into the MLLM pipeline. Extensive experiments demonstrate that Q-Scorer achieves state-of-the-art performance across multiple IQA benchmarks, generalizes well to mixed datasets, and further improves when combined with other methods.

[139] Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views

Haida Feng, Hao Wei, Zewen Xu, Haolin Wang, Chade Li, Yihong Wu

Main category: cs.CV

TL;DR: Sparse3DPR is a training-free framework for 3D scene understanding that uses sparse-view RGB inputs and leverages LLMs’ reasoning capabilities through a hierarchical plane-enhanced scene graph and task-adaptive subgraph extraction.

Details

Motivation: Current training-free approaches for 3D scene understanding struggle with accuracy and efficiency in practical deployment, despite their flexibility and generalization advantages over training-based methods.

Method: Proposes a hierarchical plane-enhanced scene graph using dominant planar structures as spatial anchors, and a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically.

Result: Achieves 28.7% EM@1 improvement and 78.2% speedup compared to ConceptGraphs on Space3D-Bench, and comparable performance to training-based methods on ScanQA with confirmed robustness in real-world experiments.

Conclusion: Sparse3DPR demonstrates that training-free approaches can achieve competitive performance in 3D scene understanding while maintaining efficiency and generalization capabilities.

Abstract: Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.

[140] Cancer-Net PCa-MultiSeg: Multimodal Enhancement of Prostate Cancer Lesion Segmentation Using Synthetic Correlated Diffusion Imaging

Jarett Dewbury, Chi-en Amy Tai, Alexander Wong

Main category: cs.CV

TL;DR: Synthetic correlated diffusion imaging (CDI^s) enhances prostate cancer lesion segmentation, achieving up to 72.5% improvement over baseline methods without requiring additional scan time.

Details

Motivation: Current deep learning approaches for prostate cancer lesion segmentation achieve limited performance (Dice scores ≤0.32), necessitating improved imaging techniques.

Method: Comprehensive evaluation of CDI^s integration across six state-of-the-art segmentation architectures using 200 patients with co-registered CDI^s, DWI and ADC sequences.

Result: CDI^s integration enhances or preserves segmentation performance in 94% of configurations, with CDI^s + DWI achieving significant improvements in half of architectures with zero degradation.

Conclusion: CDI^s enables immediate clinical deployment as a practical drop-in enhancement for prostate cancer lesion segmentation across diverse deep learning architectures.

Abstract: Current deep learning approaches for prostate cancer lesion segmentation achieve limited performance, with Dice scores of 0.32 or lower in large patient cohorts. To address this limitation, we investigate synthetic correlated diffusion imaging (CDI$^s$) as an enhancement to standard diffusion-based protocols. We conduct a comprehensive evaluation across six state-of-the-art segmentation architectures using 200 patients with co-registered CDI$^s$, diffusion-weighted imaging (DWI) and apparent diffusion coefficient (ADC) sequences. We demonstrate that CDI$^s$ integration reliably enhances or preserves segmentation performance in 94% of evaluated configurations, with individual architectures achieving up to 72.5% statistically significant relative improvement over baseline modalities. CDI$^s$ + DWI emerges as the safest enhancement pathway, achieving significant improvements in half of evaluated architectures with zero instances of degradation. Since CDI$^s$ derives from existing DWI acquisitions without requiring additional scan time or architectural modifications, it enables immediate deployment in clinical workflows. Our results establish validated integration pathways for CDI$^s$ as a practical drop-in enhancement for PCa lesion segmentation tasks across diverse deep learning architectures.

[141] Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

Gong Jingyu, Tong Kunkun, Chen Zhuoran, Yuan Chuanhan, Chen Mingang, Zhang Zhizhong, Tan Xin, Xie Yuan

Main category: cs.CV

TL;DR: SSOMotion is a human motion synthesis framework that uses unified Scene Semantic Occupancy (SSO) for better scene understanding, combining structural and semantic information through bi-directional tri-plane decomposition and CLIP encoding.

Details

Motivation: Current human motion synthesis methods focus mainly on scene structure but ignore semantic understanding, limiting their ability to generate realistic motions in complex environments.

Method: Proposes SSOMotion framework with bi-directional tri-plane decomposition for compact SSO representation, CLIP encoding for semantic mapping, and frame-wise scene query for motion control using scene hints and movement directions.

Result: Extensive experiments on ShapeNet furniture, PROX, and Replica datasets demonstrate cutting-edge performance, validating effectiveness and generalization ability in cluttered scenes.

Conclusion: SSOMotion successfully integrates semantic understanding with structural scene representation for improved human motion synthesis, achieving state-of-the-art results while reducing computational redundancy.

Abstract: Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at https://github.com/jingyugong/SSOMotion.

[142] CloudMamba: Grouped Selective State Spaces for Point Cloud Analysis

Kanglin Qu, Pan Gao, Qun Dai, Zhanzhi Ye, Rui Ye, Yuanhao Sun

Main category: cs.CV

TL;DR: CloudMamba is an SSM-based point cloud network that addresses challenges in point cloud serialization, geometric perception, and S6 overfitting through sequence expanding/merging, chainedMamba, and grouped S6.

Details

Motivation: Existing Mamba-based point cloud analysis suffers from imperfect point cloud serialization, insufficient high-level geometric perception, and overfitting of the selective state space model (S6).

Method: Proposes sequence expanding (serializing points along each axis separately) and sequence merging (fusing higher-order features), chainedMamba for bidirectional geometric scanning, and grouped S6 with parameter sharing to reduce overfitting.

Result: Achieves state-of-the-art results on various point cloud tasks with significantly less complexity.

Conclusion: CloudMamba effectively addresses the key challenges in Mamba-based point cloud analysis and demonstrates superior performance with reduced computational complexity.

Abstract: Due to the long-range modeling ability and linear complexity property, Mamba has attracted considerable attention in point cloud analysis. Despite some interesting progress, related work still suffers from imperfect point cloud serialization, insufficient high-level geometric perception, and overfitting of the selective state space model (S6) at the core of Mamba. To this end, we resort to an SSM-based point cloud network termed CloudMamba to address the above challenges. Specifically, we propose sequence expanding and sequence merging, where the former serializes points along each axis separately and the latter serves to fuse the corresponding higher-order features causally inferred from different sequences, enabling unordered point sets to adapt more stably to the causal nature of Mamba without parameters. Meanwhile, we design chainedMamba that chains the forward and backward processes in the parallel bidirectional Mamba, capturing high-level geometric information during scanning. In addition, we propose a grouped selective state space model (GS6) via parameter sharing on S6, alleviating the overfitting problem caused by the computational mode in S6. Experiments on various point cloud tasks validate CloudMamba’s ability to achieve state-of-the-art results with significantly less complexity.

[143] MonoCLUE : Object-Aware Clustering Enhances Monocular 3D Object Detection

Sunghun Yang, Minhyeok Lee, Jungho Lee, Sangyoun Lee

Main category: cs.CV

TL;DR: MonoCLUE enhances monocular 3D object detection by combining local clustering of visual features with generalized scene memory to address geometric ambiguity and improve detection in occluded/truncated scenes.

Details

Motivation: Monocular 3D detection suffers from ill-posed depth and limited field of view, causing lack of geometric cues and reduced accuracy in occluded/truncated scenes. Existing approaches overlook visual cues crucial for robust recognition.

Method: 1) K-means clustering on visual features to capture distinct object-level appearance parts; 2) Construct generalized scene memory by aggregating clustered features across images; 3) Integrate both local cluster features and scene memory into object queries to guide attention.

Result: Achieves state-of-the-art performance on KITTI benchmark, enabling robust detection under occlusion and limited visibility.

Conclusion: MonoCLUE’s unified local clustering and generalized scene memory strategy effectively addresses geometric ambiguity and improves monocular 3D detection robustness in challenging scenarios.

Abstract: Monocular 3D object detection offers a cost-effective solution for autonomous driving but suffers from ill-posed depth and limited field of view. These constraints cause a lack of geometric cues and reduced accuracy in occluded or truncated scenes. While recent approaches incorporate additional depth information to address geometric ambiguity, they overlook the visual cues crucial for robust recognition. We propose MonoCLUE, which enhances monocular 3D detection by leveraging both local clustering and generalized scene memory of visual features. First, we perform K-means clustering on visual features to capture distinct object-level appearance parts (e.g., bonnet, car roof), improving detection of partially visible objects. The clustered features are propagated across regions to capture objects with similar appearances. Second, we construct a generalized scene memory by aggregating clustered features across images, providing consistent representations that generalize across scenes. This improves object-level feature consistency, enabling stable detection across varying environments. Lastly, we integrate both local cluster features and generalized scene memory into object queries, guiding attention toward informative regions. Exploiting a unified local clustering and generalized scene memory strategy, MonoCLUE enables robust monocular 3D detection under occlusion and limited visibility, achieving state-of-the-art performance on the KITTI benchmark.

[144] Visual Bridge: Universal Visual Perception Representations Generating

Yilin Gao, Shuguang Dou, Junzhou Li, Zhiheng Yu, Yin Li, Dongsheng Jiang, Shugong Xu

Main category: cs.CV

TL;DR: A universal visual perception framework based on flow matching that generates diverse visual representations across multiple tasks, bridging the gap between heterogeneous vision tasks.

Details

Motivation: Overcome the limitations of "single-task-single-model" paradigm in diffusion models by leveraging cross-domain generalization ability similar to large language models for multi-task visual scenarios.

Method: Formulates visual perception as a universal flow-matching problem from image patch tokens to task-specific representations, using a self-supervised foundation model as anchor with multi-scale circular task embedding mechanism.

Result: Achieves competitive performance in classification, detection, segmentation, depth estimation, and image-text retrieval, outperforming prior generalist and specialist models in both zero-shot and fine-tuned settings.

Conclusion: Provides a significant step towards general-purpose visual perception and establishes a solid foundation for future universal vision modeling research.

Abstract: Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model’’ paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.

[145] Generating Sketches in a Hierarchical Auto-Regressive Process for Flexible Sketch Drawing Manipulation at Stroke-Level

Sicong Zang, Shuhui Gao, Zhijun Fang

Main category: cs.CV

TL;DR: Proposes a hierarchical auto-regressive sketch generation method that allows flexible stroke-level manipulation during the generation process, unlike previous methods that require all conditions to be set before generation starts.

Details

Motivation: Existing sketch generation methods require all stroke-level conditions to be set simultaneously before generation begins, preventing further manipulation during the drawing process. The goal is to enable more flexible sketch manipulation.

Method: Uses a three-stage hierarchical auto-regressive process: 1) predict stroke embeddings, 2) anchor strokes on canvas, 3) translate embeddings to drawing actions. Each stroke generation considers previously generated strokes and their positions.

Result: Enables flexible manipulation of stroke-level sketch drawing at any time during generation by adjusting editable stroke embeddings, providing more control over the sketch generation process.

Conclusion: The hierarchical auto-regressive approach successfully achieves flexible sketch manipulation during generation, overcoming limitations of previous methods that required all conditions to be set upfront.

Abstract: Generating sketches with specific patterns as expected, i.e., manipulating sketches in a controllable way, is a popular task. Recent studies control sketch features at stroke-level by editing values of stroke embeddings as conditions. However, in order to provide generator a global view about what a sketch is going to be drawn, all these edited conditions should be collected and fed into generator simultaneously before generation starts, i.e., no further manipulation is allowed during sketch generating process. In order to realize sketch drawing manipulation more flexibly, we propose a hierarchical auto-regressive sketch generating process. Instead of generating an entire sketch at once, each stroke in a sketch is generated in a three-staged hierarchy: 1) predicting a stroke embedding to represent which stroke is going to be drawn, and 2) anchoring the predicted stroke on the canvas, and 3) translating the embedding to a sequence of drawing actions to form the full sketch. Moreover, the stroke prediction, anchoring and translation are proceeded auto-regressively, i.e., both the recently generated strokes and their positions are considered to predict the current one, guiding model to produce an appropriate stroke at a suitable position to benefit the full sketch generation. It is flexible to manipulate stroke-level sketch drawing at any time during generation by adjusting the exposed editable stroke embeddings.

[146] Theoretical Analysis of Power-law Transformation on Images for Text Polarity Detection

Narendra Singh Yadav, Pavan Kumar Perepu

Main category: cs.CV

TL;DR: This paper provides a theoretical analysis of text polarity detection in images, focusing on how maximum between-class variance changes with power-law transformations for different text-background contrasts.

Details

Motivation: Text polarity detection is crucial for image binarization in applications like license plate recognition and character recognition. Existing intuitive approaches based on power-law transformations need theoretical validation.

Method: The authors conduct a theoretical analysis of the phenomenon where maximum between-class variance increases for dark text on bright background and decreases for bright text on dark background after power-law transformations.

Result: The paper presents theoretical validation of the empirical observation that between-class variance behavior can indicate text polarity in images.

Conclusion: The theoretical analysis confirms the intuitive approach for text polarity detection, providing a solid foundation for image binarization preprocessing tasks.

Abstract: Several computer vision applications like vehicle license plate recognition, captcha recognition, printed or handwriting character recognition from images etc., text polarity detection and binarization are the important preprocessing tasks. To analyze any image, it has to be converted to a simple binary image. This binarization process requires the knowledge of polarity of text in the images. Text polarity is defined as the contrast of text with respect to background. That means, text is darker than the background (dark text on bright background) or vice-versa. The binarization process uses this polarity information to convert the original colour or gray scale image into a binary image. In the literature, there is an intuitive approach based on power-law transformation on the original images. In this approach, the authors have illustrated an interesting phenomenon from the histogram statistics of the transformed images. Considering text and background as two classes, they have observed that maximum between-class variance between two classes is increasing (decreasing) for dark (bright) text on bright (dark) background. The corresponding empirical results have been presented. In this paper, we present a theoretical analysis of the above phenomenon.

[147] Exploring the Underwater World Segmentation without Extra Training

Bingyu Li, Tao Huo, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Main category: cs.CV

TL;DR: Introduces AquaOV255, the first large-scale underwater segmentation dataset with 255 categories, and Earth2Ocean, a training-free framework that transfers terrestrial vision-language models to underwater domains using geometric priors and semantic alignment.

Details

Motivation: Existing segmentation datasets and models are largely limited to terrestrial scenes, creating a gap for marine biodiversity monitoring and ecological assessment.

Method: Earth2Ocean framework with two components: Geometric-guided Visual Mask Generator (GMG) for local structure perception using self-similarity geometric priors, and Category-visual Semantic Alignment (CSA) module that enhances text embeddings through multimodal reasoning and scene-aware templates.

Result: Extensive experiments on the UOVSBench benchmark show significant performance improvement while maintaining efficient inference.

Conclusion: The proposed dataset and framework successfully bridge the gap between terrestrial and underwater segmentation, enabling effective open-vocabulary segmentation for marine organisms without additional underwater training.

Abstract: Accurate segmentation of marine organisms is vital for biodiversity monitoring and ecological assessment, yet existing datasets and models remain largely limited to terrestrial scenes. To bridge this gap, we introduce \textbf{AquaOV255}, the first large-scale and fine-grained underwater segmentation dataset containing 255 categories and over 20K images, covering diverse categories for open-vocabulary (OV) evaluation. Furthermore, we establish the first underwater OV segmentation benchmark, \textbf{UOVSBench}, by integrating AquaOV255 with five additional underwater datasets to enable comprehensive evaluation. Alongside, we present \textbf{Earth2Ocean}, a training-free OV segmentation framework that transfers terrestrial vision–language models (VLMs) to underwater domains without any additional underwater training. Earth2Ocean consists of two core components: a Geometric-guided Visual Mask Generator (\textbf{GMG}) that refines visual features via self-similarity geometric priors for local structure perception, and a Category-visual Semantic Alignment (\textbf{CSA}) module that enhances text embeddings through multimodal large language model reasoning and scene-aware template construction. Extensive experiments on the UOVSBench benchmark demonstrate that Earth2Ocean achieves significant performance improvement on average while maintaining efficient inference.

[148] HD$^2$-SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving

Zhiwen Yang, Yuxin Peng

Main category: cs.CV

TL;DR: HD²-SSC framework addresses dimension and density gaps in camera-based 3D semantic scene completion for autonomous driving by expanding pixel semantics and refining voxel occupancies.

Details

Motivation: Existing SSC methods suffer from input-output dimension gap (2D planner view vs 3D stereoscopic view) and annotation-reality density gap (sparse labels vs dense real-world occupancy), leading to inferior predictions.

Method: Two main modules: 1) High-dimension Semantic Decoupling expands 2D image features along pseudo third dimension to decouple coarse pixel semantics from occlusions and identify focal regions; 2) High-density Occupancy Refinement uses detect-and-refine architecture to leverage contextual structures for completing missing voxels and correcting erroneous ones.

Result: Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 datasets validate the effectiveness of the HD²-SSC framework.

Conclusion: The proposed framework successfully bridges dimension and density gaps in 3D semantic scene completion, improving scene understanding for autonomous driving applications.

Abstract: Camera-based 3D semantic scene completion (SSC) plays a crucial role in autonomous driving, enabling voxelized 3D scene understanding for effective scene perception and decision-making. Existing SSC methods have shown efficacy in improving 3D scene representations, but suffer from the inherent input-output dimension gap and annotation-reality density gap, where the 2D planner view from input images with sparse annotated labels leads to inferior prediction of real-world dense occupancy with a 3D stereoscopic view. In light of this, we propose the corresponding High-Dimension High-Density Semantic Scene Completion (HD$^2$-SSC) framework with expanded pixel semantics and refined voxel occupancies. To bridge the dimension gap, a High-dimension Semantic Decoupling module is designed to expand 2D image features along a pseudo third dimension, decoupling coarse pixel semantics from occlusions, and then identify focal regions with fine semantics to enrich image features. To mitigate the density gap, a High-density Occupancy Refinement module is devised with a “detect-and-refine” architecture to leverage contextual geometric and semantic structures for enhanced semantic density with the completion of missing voxels and correction of erroneous ones. Extensive experiments and analyses on the SemanticKITTI and SSCBench-KITTI-360 datasets validate the effectiveness of our HD$^2$-SSC framework.

[149] An Image-Based Path Planning Algorithm Using a UAV Equipped with Stereo Vision

Selim Ahmet Iz, Mustafa Unel

Main category: cs.CV

TL;DR: A novel image-based path planning algorithm using computer vision techniques, compared with A* and PRM algorithms, using disparity maps from UAVs for terrain-aware navigation.

Details

Motivation: Traditional 2D images cannot distinguish terrain depth features like craters and hills, which are critical for path safety in navigation systems.

Method: Uses disparity maps generated by UAVs, applies computer vision techniques (edge/line/corner detection, stereo depth reconstruction) to define way-points, and employs ArUco marker pose estimation and circle detection for automatic start/end point detection.

Result: The algorithm was tested in V-REP simulations and physical laboratory setups, showing promising results and effectiveness compared to A* and PRM algorithms.

Conclusion: The proposed image-based path planning algorithm successfully addresses terrain depth challenges and demonstrates competitive performance against established deterministic and probabilistic planning methods.

Abstract: This paper presents a novel image-based path planning algorithm that was developed using computer vision techniques, as well as its comparative analysis with well-known deterministic and probabilistic algorithms, namely A* and Probabilistic Road Map algorithm (PRM). The terrain depth has a significant impact on the calculated path safety. The craters and hills on the surface cannot be distinguished in a two-dimensional image. The proposed method uses a disparity map of the terrain that is generated by using a UAV. Several computer vision techniques, including edge, line and corner detection methods, as well as the stereo depth reconstruction technique, are applied to the captured images and the found disparity map is used to define candidate way-points of the trajectory. The initial and desired points are detected automatically using ArUco marker pose estimation and circle detection techniques. After presenting the mathematical model and vision techniques, the developed algorithm is compared with well-known algorithms on different virtual scenes created in the V-REP simulation program and a physical setup created in a laboratory environment. Results are promising and demonstrate effectiveness of the proposed algorithm.

[150] Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification

Yihang Wu, Ahmad Chaddad

Main category: cs.CV

TL;DR: FedMedCLIP is a federated learning approach using CLIP for medical image classification that reduces communication and computational costs while maintaining performance.

Details

Motivation: Privacy concerns limit deep model training in medical imaging, and federated learning faces challenges with data heterogeneity and resource costs when using vision language models.

Method: Uses masked feature adaptation module for communication, freezes CLIP encoders, employs masked MLP as local classifier, adaptive KL divergence distillation, and model compression with ensemble predictions.

Result: Achieves 8% higher performance than second best baseline on ISIC2019 dataset and 120x faster than FedAVG with reasonable resource cost.

Conclusion: FedMedCLIP provides a feasible federated learning solution for medical image classification with improved performance and reduced resource requirements.

Abstract: Despite the remarkable performance of deep models in medical imaging, they still require source data for training, which limits their potential in light of privacy concerns. Federated learning (FL), as a decentralized learning framework that trains a shared model with multiple hospitals (a.k.a., FL clients), provides a feasible solution. However, data heterogeneity and resource costs hinder the deployment of FL models, especially when using vision language models (VLM). To address these challenges, we propose a novel contrastive language-image pre-training (CLIP) based FL approach for medical image classification (FedMedCLIP). Specifically, we introduce a masked feature adaptation module (FAM) as a communication module to reduce the communication load while freezing the CLIP encoders to reduce the computational overhead. Furthermore, we propose a masked multi-layer perceptron (MLP) as a private local classifier to adapt to the client tasks. Moreover, we design an adaptive Kullback-Leibler (KL) divergence-based distillation regularization method to enable mutual learning between FAM and MLP. Finally, we incorporate model compression to transmit the FAM parameters while using ensemble predictions for classification. Extensive experiments on four publicly available medical datasets demonstrate that our model provides feasible performance (e.g., 8% higher compared to second best baseline on ISIC2019) with reasonable resource cost (e.g., 120$\times$ faster than FedAVG).

[151] Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers

Sida Huang, Siqi Huang, Ping Luo, Hongyuan Zhang

Main category: cs.CV

TL;DR: Proposes Laytrol Network for layout-to-image generation that preserves pretrained knowledge by inheriting parameters from MM-DiT and using a specialized initialization scheme to avoid distribution shift.

Details

Motivation: Existing layout-to-image methods suffer from low visual quality and stylistic inconsistency with base models due to loss of pretrained knowledge when integrating layout conditions.

Method: Uses Layout Synthesis dataset from base model images, Laytrol Network with parameter inheritance from MM-DiT, zero-initialized outputs, layout encoder initialized as text encoder, and Object-level Rotary Position Embedding.

Result: Qualitative and quantitative experiments show improved effectiveness in generating spatially consistent images while maintaining visual quality and style consistency.

Conclusion: The proposed approach effectively addresses the knowledge loss problem in layout-to-image generation by preserving pretrained model capabilities through careful parameter inheritance and initialization strategies.

Abstract: With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.

[152] DiffRegCD: Integrated Registration and Change Detection with Diffusion Features

Seyedehnanita Madani, Rama Chellappa, Vishal M. Patel

Main category: cs.CV

TL;DR: DiffRegCD is a unified framework that integrates dense registration and change detection in a single model, achieving superior performance on various datasets under large displacements and viewpoint variations.

Details

Motivation: Real-world change detection faces challenges with misalignment due to parallax, viewpoint shifts, and long temporal gaps, which existing methods struggle to handle effectively.

Method: Reformulates correspondence estimation as Gaussian smoothed classification for sub-pixel accuracy, leverages frozen multi-scale features from pretrained diffusion models, and uses controlled affine perturbations for supervision without pseudo labels.

Result: Extensive experiments show DiffRegCD consistently surpasses recent baselines on aerial and ground-level datasets, remaining reliable under wide temporal and geometric variations.

Conclusion: Diffusion features and classification-based correspondence provide a strong foundation for unified change detection, establishing DiffRegCD as an effective solution for misaligned imagery.

Abstract: Change detection (CD) is fundamental to computer vision and remote sensing, supporting applications in environmental monitoring, disaster response, and urban development. Most CD models assume co-registered inputs, yet real-world imagery often exhibits parallax, viewpoint shifts, and long temporal gaps that cause severe misalignment. Traditional two stage methods that first register and then detect, as well as recent joint frameworks (e.g., BiFA, ChangeRD), still struggle under large displacements, relying on regression only flow, global homographies, or synthetic perturbations. We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model. DiffRegCD reformulates correspondence estimation as a Gaussian smoothed classification task, achieving sub-pixel accuracy and stable training. It leverages frozen multi-scale features from a pretrained denoising diffusion model, ensuring robustness to illumination and viewpoint variation. Supervision is provided through controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo labels. Extensive experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation, establishing diffusion features and classification based correspondence as a strong foundation for unified change detection.

[153] Is It Truly Necessary to Process and Fit Minutes-Long Reference Videos for Personalized Talking Face Generation?

Rui-Qing Sun, Ang Li, Zhijing Wu, Tian Lan, Qianyu Lu, Xingshan Yao, Chen Xu, Xian-Ling Mao

Main category: cs.CV

TL;DR: ISExplore is a segment selection strategy that identifies informative 5-second video segments for talking face generation, achieving 5x faster processing while maintaining quality.

Details

Motivation: Current TFG methods require minutes of reference video processing, taking hours to train. Exploratory studies show short informative segments can achieve comparable or better performance than full videos.

Method: Proposes ISExplore strategy that automatically selects 5-second segments based on audio feature diversity, lip movement amplitude, and number of camera views.

Result: Achieves 5x faster data processing and training for NeRF and 3DGS methods while maintaining high-fidelity output quality.

Conclusion: Video informative quality is more important than length; short well-selected segments can replace long reference videos in talking face generation.

Abstract: Talking Face Generation (TFG) aims to produce realistic and dynamic talking portraits, with broad applications in fields such as digital education, film and television production, e-commerce live streaming, and other related areas. Currently, TFG methods based on Neural Radiated Field (NeRF) or 3D Gaussian sputtering (3DGS) are received widespread attention. They learn and store personalized features from reference videos of each target individual to generate realistic speaking videos. To ensure models can capture sufficient 3D information and successfully learns the lip-audio mapping, previous studies usually require meticulous processing and fitting several minutes of reference video, which always takes hours. The computational burden of processing and fitting long reference videos severely limits the practical application value of these methods.However, is it really necessary to fit such minutes of reference video? Our exploratory case studies show that using some informative reference video segments of just a few seconds can achieve performance comparable to or even better than the full reference video. This indicates that video informative quality is much more important than its length. Inspired by this observation, we propose the ISExplore (short for Informative Segment Explore), a simple-yet-effective segment selection strategy that automatically identifies the informative 5-second reference video segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and number of camera views. Extensive experiments demonstrate that our approach increases data processing and training speed by more than 5x for NeRF and 3DGS methods, while maintaining high-fidelity output. Project resources are available at xx.

[154] Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification

Zhenfeng Zhuang, Fangyu Zhou, Liansheng Wang

Main category: cs.CV

TL;DR: Proposes Multimodal Prototype-based Multi-Instance Learning (MP-MIL) for computational pathology using LLMs, featuring bidirectional interaction between vision and text modalities through pathological entity prototypes and Stereoscopic Optimal Transport fusion.

Details

Motivation: Address computational cost of giga-pixel WSIs and bias in LLM-generated instance descriptions by creating task-specific pathological entity prototypes for better generalization and interpretability in pathology tasks with bag-level labels.

Method: Uses frozen LLM to generate task-specific pathological entity descriptions as text prototypes, learns instance-level vision prototypes, and employs Stereoscopic Optimal Transport algorithm for bidirectional cross-modal fusion in higher-dimensional semantic space.

Result: Demonstrates superior generalization capabilities in few-shot classification and explainability experiments on three distinct cancer datasets compared to existing methods.

Conclusion: The proposed MP-MIL framework effectively enables bidirectional multimodal interaction in pathology analysis, improving generalization and interpretability while addressing computational challenges of WSIs and LLM bias issues.

Abstract: While Large Language Models (LLMs) are emerging as a promising direction in computational pathology, the substantial computational cost of giga-pixel Whole Slide Images (WSIs) necessitates the use of Multi-Instance Learning (MIL) to enable effective modeling. A key challenge is that pathological tasks typically provide only bag-level labels, while instance-level descriptions generated by LLMs often suffer from bias due to a lack of fine-grained medical knowledge. To address this, we propose that constructing task-specific pathological entity prototypes is crucial for learning generalizable features and enhancing model interpretability. Furthermore, existing vision-language MIL methods often employ unidirectional guidance, limiting cross-modal synergy. In this paper, we introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction through a balanced information compression scheme. Specifically, we leverage a frozen LLM to generate task-specific pathological entity descriptions, which are learned as text prototypes. Concurrently, the vision branch learns instance-level prototypes to mitigate the model’s reliance on redundant data. For the fusion stage, we employ the Stereoscopic Optimal Transport (SOT) algorithm, which is based on a similarity metric, thereby facilitating broader semantic alignment in a higher-dimensional space. We conduct few-shot classification and explainability experiments on three distinct cancer datasets, and the results demonstrate the superior generalization capabilities of our proposed method.

[155] ReIDMamba: Learning Discriminative Features with Visual State Space Model for Person Re-Identification

Hongyang Gu, Qisong Yang, Lei Pu, Siming Han, Yao Ding

Main category: cs.CV

TL;DR: ReIDMamba is a pure Mamba-based person re-identification framework that overcomes Transformer scalability issues by using Mamba architecture with multiple class tokens, achieving SOTA performance with fewer parameters and faster inference.

Details

Motivation: Transformers in person ReID face quadratic memory/computational scaling issues, while CNNs suffer from local processing and information loss from convolution/downsampling operations.

Method: Proposes Mamba-based baseline with multiple class tokens, multi-granularity feature extractor (MGFE) with multi-branch architecture, and ranking-aware triplet regularization (RATR) for feature diversity.

Result: Achieves state-of-the-art performance on five person ReID benchmarks with only one-third parameters of TransReID, lower GPU memory usage, and faster inference throughput.

Conclusion: ReIDMamba demonstrates superior performance as a pioneering pure Mamba-driven approach for person re-identification, effectively addressing scalability while maintaining robust feature learning.

Abstract: Extracting robust discriminative features is a critical challenge in person re-identification (ReID). While Transformer-based methods have successfully addressed some limitations of convolutional neural networks (CNNs), such as their local processing nature and information loss resulting from convolution and downsampling operations, they still face the scalability issue due to the quadratic increase in memory and computational requirements with the length of the input sequence. To overcome this, we propose a pure Mamba-based person ReID framework named ReIDMamba. Specifically, we have designed a Mamba-based strong baseline that effectively leverages fine-grained, discriminative global features by introducing multiple class tokens. To further enhance robust features learning within Mamba, we have carefully designed two novel techniques. First, the multi-granularity feature extractor (MGFE) module, designed with a multi-branch architecture and class token fusion, effectively forms multi-granularity features, enhancing both discrimination ability and fine-grained coverage. Second, the ranking-aware triplet regularization (RATR) is introduced to reduce redundancy in features from multiple branches, enhancing the diversity of multi-granularity features by incorporating both intra-class and inter-class diversity constraints, thus ensuring the robustness of person features. To our knowledge, this is the pioneering work that integrates a purely Mamba-driven approach into ReID research. Our proposed ReIDMamba model boasts only one-third the parameters of TransReID, along with lower GPU memory usage and faster inference throughput. Experimental results demonstrate ReIDMamba’s superior and promising performance, achieving state-of-the-art performance on five person ReID benchmarks. Code is available at https://github.com/GuHY777/ReIDMamba.

[156] Burst Image Quality Assessment: A New Benchmark and Unified Framework for Multiple Downstream Tasks

Xiaoye Liang, Lai Jiang, Minglang Qiao, Yichen Guo, Yue Zhang, Xin Deng, Shengxi Li, Yufan Liu, Mai Xu

Main category: cs.CV

TL;DR: Proposes Burst Image Quality Assessment (BuIQA) to evaluate frame quality in burst sequences for downstream task efficiency, introduces first benchmark dataset, and presents a unified framework with task-driven prompts and knowledge distillation.

Details

Motivation: Address redundancy in burst images that increases storage/transmission demands and reduces downstream task efficiency by developing quality assessment for burst frame selection.

Method: Establishes benchmark dataset with 7,346 burst sequences and 191,572 quality scores. Proposes unified framework with task-driven prompt generation using heterogeneous knowledge distillation and task-aware quality assessment network.

Result: Outperforms state-of-the-art across 10 downstream scenarios. Achieves 0.33 dB PSNR improvement in denoising and super-resolution tasks when used for burst frame selection.

Conclusion: BuIQA effectively addresses burst image redundancy, improves downstream task performance, and provides a comprehensive benchmark for future research in burst image quality assessment.

Abstract: In recent years, the development of burst imaging technology has improved the capture and processing capabilities of visual data, enabling a wide range of applications. However, the redundancy in burst images leads to the increased storage and transmission demands, as well as reduced efficiency of downstream tasks. To address this, we propose a new task of Burst Image Quality Assessment (BuIQA), to evaluate the task-driven quality of each frame within a burst sequence, providing reasonable cues for burst image selection. Specifically, we establish the first benchmark dataset for BuIQA, consisting of $7,346$ burst sequences with $45,827$ images and $191,572$ annotated quality scores for multiple downstream scenarios. Inspired by the data analysis, a unified BuIQA framework is proposed to achieve an efficient adaption for BuIQA under diverse downstream scenarios. Specifically, a task-driven prompt generation network is developed with heterogeneous knowledge distillation, to learn the priors of the downstream task. Then, the task-aware quality assessment network is introduced to assess the burst image quality based on the task prompt. Extensive experiments across 10 downstream scenarios demonstrate the impressive BuIQA performance of the proposed approach, outperforming the state-of-the-art. Furthermore, it can achieve $0.33$ dB PSNR improvement in the downstream tasks of denoising and super-resolution, by applying our approach to select the high-quality burst frames.

Shenao Zhao, Pengpeng Liang, Zhoufan Yang

Main category: cs.CV

TL;DR: MMAssist improves 3D unsupervised domain adaptation for LiDAR object detection by using image and text features as bridges to align 3D features between domains, achieving state-of-the-art performance.

Details

Motivation: Current 3D UDA methods using teacher-student architectures with pseudo labels have shown improvements, but they largely ignore the potential of image data that is commonly collected alongside LiDAR point clouds.

Method: Projects 3D labels to 2D images, extracts image features from pre-trained vision backbone, uses LVLM for text descriptions, aligns 3D features with image/text features, fuses them with learned weights, and enhances pseudo labels with 2D detector assistance.

Result: Achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets.

Conclusion: Multi-modal assistance through image and text features effectively improves 3D UDA performance for LiDAR-based object detection.

Abstract: Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box’s text description, and a pre-trained text encoder is used to obtain its text feature. During the training of the model in the source domain and the student model in the target domain, we align the 3D features of the predicted boxes with their corresponding image and text features, and the 3D features and the aligned features are fused with learned weights for the final prediction. The features between the student branch and the teacher branch in the target domain are aligned as well. To enhance the pseudo labels, we use an off-the-shelf 2D object detector to generate 2D bounding boxes from images and estimate their corresponding 3D boxes with the aid of point cloud, and these 3D boxes are combined with the pseudo labels generated by the teacher model. Experimental results show that our approach achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets. The code is available at https://github.com/liangp/MMAssist.

[158] Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection

Seyedehanita Madani, Vishal M. Patel

Main category: cs.CV

TL;DR: A modular pipeline that improves spatial and temporal robustness in remote sensing change detection without modifying existing networks, using diffusion-based semantic morphing and registration refinement.

Details

Motivation: Remote sensing change detection faces challenges with spatial misalignment in bi-temporal images, especially with long seasonal gaps. Existing models rely on precise co-registration and lack robustness in real-world conditions.

Method: Integrates diffusion-based semantic morphing, dense registration, and residual flow refinement. Uses diffusion to synthesize intermediate morphing frames, estimates stepwise correspondences, and refines the flow with a lightweight U-Net for high-fidelity warping.

Result: Extensive experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show consistent gains in both registration accuracy and downstream change detection across multiple backbones.

Conclusion: The proposed approach demonstrates generality and effectiveness in improving spatial and temporal robustness for remote sensing change detection without altering existing networks.

Abstract: Remote sensing change detection is often challenged by spatial misalignment between bi-temporal images, especially when acquisitions are separated by long seasonal or multi-year gaps. While modern convolutional and transformer-based models perform well on aligned data, their reliance on precise co-registration limits their robustness in real-world conditions. Existing joint registration-detection frameworks typically require retraining and transfer poorly across domains. We introduce a modular pipeline that improves spatial and temporal robustness without altering existing change detection networks. The framework integrates diffusion-based semantic morphing, dense registration, and residual flow refinement. A diffusion module synthesizes intermediate morphing frames that bridge large appearance gaps, enabling RoMa to estimate stepwise correspondences between consecutive frames. The composed flow is then refined through a lightweight U-Net to produce a high-fidelity warp that co-registers the original image pair. Extensive experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show consistent gains in both registration accuracy and downstream change detection across multiple backbones, demonstrating the generality and effectiveness of the proposed approach.

[159] DANCE: Density-agnostic and Class-aware Network for Point Cloud Completion

Da-Yeong Kim, Yeong-Jun Cho

Main category: cs.CV

TL;DR: DANCE is a density-agnostic and class-aware network for point cloud completion that preserves observed geometry while completing only missing regions using ray-based sampling and transformer refinement.

Details

Motivation: Existing point cloud completion methods assume fixed input/output densities or rely on image-based representations, making them unsuitable for real-world scenarios with variable sparsity and limited supervision.

Method: Generates candidate points via ray-based sampling from multiple viewpoints, refines positions with transformer decoder, predicts opacity scores for point validity, and uses lightweight classification head for semantic guidance without external image supervision.

Result: Outperforms state-of-the-art methods on PCN and MVP benchmarks in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.

Conclusion: DANCE provides an effective framework for category-consistent point cloud completion that handles real-world challenges like variable sparsity and limited supervision.

Abstract: Point cloud completion aims to recover missing geometric structures from incomplete 3D scans, which often suffer from occlusions or limited sensor viewpoints. Existing methods typically assume fixed input/output densities or rely on image-based representations, making them less suitable for real-world scenarios with variable sparsity and limited supervision. In this paper, we introduce Density-agnostic and Class-aware Network (DANCE), a novel framework that completes only the missing regions while preserving the observed geometry. DANCE generates candidate points via ray-based sampling from multiple viewpoints. A transformer decoder then refines their positions and predicts opacity scores, which determine the validity of each point for inclusion in the final surface. To incorporate semantic guidance, a lightweight classification head is trained directly on geometric features, enabling category-consistent completion without external image supervision. Extensive experiments on the PCN and MVP benchmarks show that DANCE outperforms state-of-the-art methods in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.

[160] ChexFract: From General to Specialized - Enhancing Fracture Description Generation

Nikolay Nechaev, Evgeniia Przhezdzetskaia, Dmitry Umerenkov, Dmitry V. Dylov

Main category: cs.CV

TL;DR: Specialized vision-language models for fracture detection and description in chest X-rays outperform general-purpose models, with analysis of performance by fracture type, location, and age.

Details

Motivation: General radiology report generation models often fail to adequately describe rare but clinically important pathologies like fractures in chest X-rays.

Method: Train fracture-specific vision-language models using encoders from MAIRA-2 and CheXagent for specialized fracture detection and description.

Result: Significant improvements over general-purpose models in generating accurate fracture descriptions, with analysis revealing strengths and limitations by fracture characteristics.

Conclusion: Specialized models are needed for accurate reporting of rare pathologies, and the best-performing fracture-reporting model is publicly released to facilitate future research.

Abstract: Generating accurate and clinically meaningful radiology reports from chest X-ray images remains a significant challenge in medical AI. While recent vision-language models achieve strong results in general radiology report generation, they often fail to adequately describe rare but clinically important pathologies like fractures. This work addresses this gap by developing specialized models for fracture pathology detection and description. We train fracture-specific vision-language models with encoders from MAIRA-2 and CheXagent, demonstrating significant improvements over general-purpose models in generating accurate fracture descriptions. Analysis of model outputs by fracture type, location, and age reveals distinct strengths and limitations of current vision-language model architectures. We publicly release our best-performing fracture-reporting model, facilitating future research in accurate reporting of rare pathologies.

[161] CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting

Chae-Yeon Heo, Yeong-Jun Cho

Main category: cs.CV

TL;DR: CSF-Net is a semantic-guided framework for large-mask image inpainting that uses structure-aware candidates from a pretrained AC model and fuses them with contextual features via transformer-based fusion to improve inpainting quality.

Details

Motivation: To address the challenge of large-mask image inpainting where essential visual content is missing and contextual cues are limited, requiring better semantic guidance for accurate completion.

Method: Leverages pretrained Amodal Completion model to generate structure-aware candidates, then uses Context-Semantic Fusion Network (CSF-Net) - a transformer-based fusion framework that fuses candidates with contextual features to produce semantic guidance for inpainting.

Result: Extensive experiments on Places365 and COCOA datasets show CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment, and can be integrated into existing inpainting models without architectural changes.

Conclusion: CSF-Net provides an effective semantic-guided framework that consistently enhances inpainting performance across diverse masking conditions by promoting structural accuracy and semantic consistency.

Abstract: In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based fusion framework that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment. The code for CSF-Net is available at https://github.com/chaeyeonheo/CSF-Net.

[162] Hardware-Aware YOLO Compression for Low-Power Edge AI on STM32U5 for Weeds Detection in Digital Agriculture

Charalampos S. Kouzinopoulos, Yuri Manna

Main category: cs.CV

TL;DR: An optimized low-power edge AI system for weed detection using YOLOv8n deployed on STM32U575ZI microcontroller with compression techniques, achieving 51.8mJ per inference for real-time agricultural use.

Details

Motivation: Traditional weed management methods using chemical herbicides cause environmental contamination and herbicide resistance, while existing precision weeding solutions require high-power computational platforms unsuitable for scalable agricultural deployment.

Method: Deployed YOLOv8n object detector on STM32U575ZI microcontroller with structured pruning, integer quantization, and input image resolution scaling to meet hardware constraints, trained on CropAndWeed dataset with 74 plant species.

Result: Achieved balanced trade-off between detection accuracy and efficiency with minimal energy consumption of 51.8mJ per inference, enabling real-time in-situ weed detection.

Conclusion: The system enables scalable deployment in power-constrained agricultural environments as an eco-friendly alternative to traditional weed management methods.

Abstract: Weeds significantly reduce crop yields worldwide and pose major challenges to sustainable agriculture. Traditional weed management methods, primarily relying on chemical herbicides, risk environmental contamination and lead to the emergence of herbicide-resistant species. Precision weeding, leveraging computer vision and machine learning methods, offers a promising eco-friendly alternative but is often limited by reliance on high-power computational platforms. This work presents an optimized, low-power edge AI system for weeds detection based on the YOLOv8n object detector deployed on the STM32U575ZI microcontroller. Several compression techniques are applied to the detection model, including structured pruning, integer quantization and input image resolution scaling in order to meet strict hardware constraints. The model is trained and evaluated on the CropAndWeed dataset with 74 plant species, achieving a balanced trade-off between detection accuracy and efficiency. Our system supports real-time, in-situ weeds detection with a minimal energy consumption of 51.8mJ per inference, enabling scalable deployment in power-constrained agricultural environments.

[163] Sharp Eyes and Memory for VideoLLMs: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

Jialong Qin, Xin Zou, Di Lu, Yibo Yan, Xuming Hu

Main category: cs.CV

TL;DR: SharpV is an efficient method that adaptively prunes visual tokens and KV cache in VideoLLMs to reduce quadratic computational complexity, achieving performance gains while maintaining hardware compatibility.

Details

Motivation: Current VideoLLMs suffer from quadratic computational complexity and key-value cache scaling due to processing excessive redundant visual tokens, creating efficiency bottlenecks.

Method: SharpV uses adaptive pruning of visual tokens based on spatial-temporal information and hierarchical KV cache pruning via self-calibration guided by similarity to original features, operating without requiring access to attention scores.

Result: Experiments on multiple benchmarks show SharpV’s superiority, achieving performance gains over dense models while being the first two-stage pruning framework compatible with hardware acceleration techniques.

Conclusion: SharpV offers a novel paradigm for adaptive pruning in VideoLLMs, providing hierarchical cache pruning from an information bottleneck perspective and ensuring full hardware compatibility.

Abstract: Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs’ information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.

[164] EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision

Yifei Cao, Yu Liu, Guolong Wang, Zhu Liu, Kai Wang, Xianjie Zhang, Jizhe Yu, Xun Tu

Main category: cs.CV

TL;DR: EAGLE is a novel framework for egocentric visual query localization that uses episodic appearance- and geometry-aware memory to achieve unified 2D-3D localization, achieving state-of-the-art performance on Ego4D-VQ benchmark.

Details

Motivation: Egocentric visual query localization is challenging due to camera motion, viewpoint changes, and appearance variations, making it vital for embodied AI and VR/AR applications.

Method: EAGLE integrates segmentation guided by appearance-aware meta-learning memory (AMM) with tracking driven by geometry-aware localization memory (GLM), using memory consolidation with structured appearance and geometry memory banks. It also uses visual geometry grounded Transformer (VGGT) to unify 2D and 3D tasks.

Result: The method achieves state-of-the-art performance on the Ego4D-VQ benchmark, enabling precise contour delineation with robust spatial discrimination and significantly improved retrieval accuracy.

Conclusion: EAGLE’s memory consolidation mechanism effectively supports both long- and short-term modeling of target appearance variations, enabling efficient unification of 2D and 3D visual query localization tasks.

Abstract: Egocentric visual query localization is vital for embodied AI and VR/AR, yet remains challenging due to camera motion, viewpoint changes, and appearance variations. We present EAGLE, a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision. Inspired by avian memory consolidation, EAGLE synergistically integrates segmentation guided by an appearance-aware meta-learning memory (AMM), with tracking driven by a geometry-aware localization memory (GLM). This memory consolidation mechanism, through structured appearance and geometry memory banks, stores high-confidence retrieval samples, effectively supporting both long- and short-term modeling of target appearance variations. This enables precise contour delineation with robust spatial discrimination, leading to significantly improved retrieval accuracy. Furthermore, by integrating the VQL-2D output with a visual geometry grounded Transformer (VGGT), we achieve a efficient unification of 2D and 3D tasks, enabling rapid and accurate back-projection into 3D space. Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.

[165] Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving

Jian Wang, Lijun He, Yixing Yong, Haixia Bi, Fan Li

Main category: cs.CV

TL;DR: AdvRoad generates natural-looking road-style adversarial posters that stealthily attack 3D object detectors in autonomous driving systems by making them perceive non-existent objects.

Details

Motivation: Current visual 3D detection systems are vulnerable to adversarial attacks, and existing adversarial posters are easily noticeable due to unnatural appearances and fixed content, limiting their practicality.

Method: Two-stage approach: Road-Style Adversary Generation creates natural road-like appearances, and Scenario-Associated Adaptation optimizes attack effectiveness for specific scenes while maintaining stealth.

Result: AdvRoad successfully attacks various detectors across different scenes and locations, with physical experiments confirming real-world threats.

Conclusion: The method demonstrates significant security vulnerabilities in autonomous driving perception systems through stealthy, natural-looking adversarial attacks that bypass human detection.

Abstract: Modern autonomous driving (AD) systems leverage 3D object detection to perceive foreground objects in 3D environments for subsequent prediction and planning. Visual 3D detection based on RGB cameras provides a cost-effective solution compared to the LiDAR paradigm. While achieving promising detection accuracy, current deep neural network-based models remain highly susceptible to adversarial examples. The underlying safety concerns motivate us to investigate realistic adversarial attacks in AD scenarios. Previous work has demonstrated the feasibility of placing adversarial posters on the road surface to induce hallucinations in the detector. However, the unnatural appearance of the posters makes them easily noticeable by humans, and their fixed content can be readily targeted and defended. To address these limitations, we propose the AdvRoad to generate diverse road-style adversarial posters. The adversaries have naturalistic appearances resembling the road surface while compromising the detector to perceive non-existent objects at the attack locations. We employ a two-stage approach, termed Road-Style Adversary Generation and Scenario-Associated Adaptation, to maximize the attack effectiveness on the input scene while ensuring the natural appearance of the poster, allowing the attack to be carried out stealthily without drawing human attention. Extensive experiments show that AdvRoad generalizes well to different detectors, scenes, and spoofing locations. Moreover, physical attacks further demonstrate the practical threats in real-world environments.

[166] High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection

Zhiyuan Chen, Yuelin Guo, Zitong Huang, Haoyu He, Renhao Lu, Weizhe Zhang

Main category: cs.CV

TL;DR: Cascade HQP-DETR addresses limitations in Imaginary Supervised Object Detection by introducing high-quality synthetic datasets, image-specific query initialization, and cascade denoising to achieve state-of-the-art performance on real-world datasets.

Details

Motivation: Object detection requires large annotated datasets that are expensive to create. ISOD trains on synthetic images but faces issues with dataset quality, DETR convergence problems, and overfitting to noisy labels.

Method: Three key innovations: (1) High-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to create FluxVOC/FluxCOCO datasets; (2) High-Quality Proposal guided query encoding using SAM proposals and RoI features; (3) Cascade denoising with progressive IoU thresholds across decoder layers.

Result: Achieves SOTA 61.04% mAP@0.5 on PASCAL VOC 2007 after only 12 epochs of training on FluxVOC, outperforming strong baselines and demonstrating competitive real-data performance.

Conclusion: Cascade HQP-DETR successfully advances ISOD from weak to full supervision, accelerates convergence, prevents overfitting to synthetic patterns, and achieves strong generalization to real-world data with minimal training.

Abstract: Object detection models demand large-scale annotated datasets, which are costly and labor-intensive to create. This motivated Imaginary Supervised Object Detection (ISOD), where models train on synthetic images and test on real images. However, existing methods face three limitations: (1) synthetic datasets suffer from simplistic prompts, poor image quality, and weak supervision; (2) DETR-based detectors, due to their random query initialization, struggle with slow convergence and overfitting to synthetic patterns, hindering real-world generalization; (3) uniform denoising pressure promotes model overfitting to pseudo-label noise. We propose Cascade HQP-DETR to address these limitations. First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets, advancing ISOD from weak to full supervision. Second, our High-Quality Proposal guided query encoding initializes object queries with image-specific priors from SAM-generated proposals and RoI-pooled features, accelerating convergence while steering the model to learn transferable features instead of overfitting to synthetic patterns. Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers, guiding the model to learn robust boundaries from reliable visual cues rather than overfitting to noisy labels. Trained for just 12 epochs solely on FluxVOC, Cascade HQP-DETR achieves a SOTA 61.04% mAP@0.5 on PASCAL VOC 2007, outperforming strong baselines, with its competitive real-data performance confirming the architecture’s universal applicability.

Chende Zheng, Ruiqi Suo, Zhoulin Ji, Jingyi Deng, Fangbin Yi, Chenhao Lin, Chao Shen

Main category: cs.CV

TL;DR: A multi-modal deepfake detection and localization framework using Feature Pyramid-Transformer achieves cross-modal generalization and precise temporal localization of manipulated segments.

Details

Motivation: Current unimodal detection methods fail to leverage cross-modal correlations and precisely localize forged segments, limiting their effectiveness against sophisticated deepfake manipulations.

Method: Uses pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract features, constructs multi-scale feature pyramid with R-TLM blocks and localized attention, and employs dual-branch prediction for forgery probability and temporal offset refinement.

Result: Achieved a score of 0.7535 on IJCAI'25 DDL-AV benchmark, demonstrating effective cross-modal deepfake detection and localization in challenging environments.

Conclusion: The approach provides a novel solution for generalized deepfake detection by effectively leveraging cross-modal correlations and achieving precise temporal localization.

Abstract: The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI'25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at https://github.com/Zig-HS/MM-DDL

[168] Perceptual Quality Assessment of 3D Gaussian Splatting: A Subjective Dataset and Prediction Metric

Zhaolin Wan, Yining Diao, Jingqi Xu, Hao Wang, Zhiyang Li, Xiaopeng Fan, Wangmeng Zuo, Debin Zhao

Main category: cs.CV

TL;DR: This paper introduces 3DGS-QA, the first subjective quality assessment dataset for 3D Gaussian Splatting, and proposes a no-reference quality prediction model that operates directly on 3D Gaussian primitives without requiring rendered images.

Details

Motivation: The perceptual quality of 3DGS-rendered content under varying reconstruction conditions remains largely underexplored, despite factors like viewpoint sparsity, limited training iterations, and distortions significantly degrading visual quality.

Method: Created a dataset with 225 degraded reconstructions across 15 object types, and developed a no-reference quality prediction model that extracts spatial and photometric cues directly from 3D Gaussian primitives in a structure-aware manner.

Result: Experimental results show the proposed method consistently achieves superior performance compared to existing quality assessment methods, demonstrating robustness and effectiveness for 3DGS content evaluation.

Conclusion: The work bridges the gap in 3DGS perceptual quality assessment and provides a publicly available dataset and code to facilitate future research in this area.

Abstract: With the rapid advancement of 3D visualization, 3D Gaussian Splatting (3DGS) has emerged as a leading technique for real-time, high-fidelity rendering. While prior research has emphasized algorithmic performance and visual fidelity, the perceptual quality of 3DGS-rendered content, especially under varying reconstruction conditions, remains largely underexplored. In practice, factors such as viewpoint sparsity, limited training iterations, point downsampling, noise, and color distortions can significantly degrade visual quality, yet their perceptual impact has not been systematically studied. To bridge this gap, we present 3DGS-QA, the first subjective quality assessment dataset for 3DGS. It comprises 225 degraded reconstructions across 15 object types, enabling a controlled investigation of common distortion factors. Based on this dataset, we introduce a no-reference quality prediction model that directly operates on native 3D Gaussian primitives, without requiring rendered images or ground-truth references. Our model extracts spatial and photometric cues from the Gaussian representation to estimate perceived quality in a structure-aware manner. We further benchmark existing quality assessment methods, spanning both traditional and learning-based approaches. Experimental results show that our method consistently achieves superior performance, highlighting its robustness and effectiveness for 3DGS content evaluation. The dataset and code are made publicly available at https://github.com/diaoyn/3DGSQA to facilitate future research in 3DGS quality assessment.

[169] WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation

Gongshu Wang, Zhirui Wang, Kan Yang

Main category: cs.CV

TL;DR: WEDepth adapts Vision Foundation Models for monocular depth estimation without modifying their structure or weights, using them as multi-level feature enhancers to inject prior knowledge at different representation levels.

Details

Motivation: Monocular depth estimation is challenging due to its ill-posed nature. Vision Foundation Models have shown remarkable world understanding capabilities that could benefit depth estimation, and recent studies demonstrated improvements through fine-tuning these models.

Method: Uses Vision Foundation Models as multi-level feature enhancers without modifying their structures or pretrained weights, systematically injecting prior knowledge at different representation levels.

Result: Achieves new state-of-the-art performance on NYU-Depth v2 and KITTI datasets, with competitive results compared to diffusion-based approaches and methods pre-trained on relative depth. Also demonstrates strong zero-shot transfer capability.

Conclusion: WEDepth effectively adapts Vision Foundation Models for monocular depth estimation while preserving their original structure and weights, establishing new SOTA performance with strong generalization capabilities.

Abstract: Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.

[170] ProSona: Prompt-Guided Personalization for Multi-Expert Medical Image Segmentation

Aya Elgebaly, Nikolaos Delopoulos, Juliane Hörner-Rieber, Carolin Rippke, Sebastian Klüter, Luca Boldrini, Lorenzo Placidi, Riccardo Dal Bello, Nicolaus Andratschke, Michael Baumgartl, Claus Belka, Christopher Kurz, Guillaume Landry, Shadi Albarqouni

Main category: cs.CV

TL;DR: ProSona is a two-stage framework that learns continuous latent space of annotation styles for medical image segmentation, enabling personalized segmentations via natural language prompts.

Details

Motivation: Address high inter-observer variability in medical image segmentation where experts often disagree, moving beyond consensus masks or separate model branches.

Method: Two-stage framework with probabilistic U-Net backbone to capture diverse expert hypotheses, prompt-guided projection mechanism to navigate latent space, and multi-level contrastive objective to align textual and visual representations.

Result: Reduces Generalized Energy Distance by 17% and improves mean Dice by more than one point compared with DPersona on LIDC-IDRI lung nodule and multi-institutional prostate MRI datasets.

Conclusion: Natural-language prompts provide flexible, accurate, and interpretable control over personalized medical image segmentation.

Abstract: Automated medical image segmentation suffers from high inter-observer variability, particularly in tasks such as lung nodule delineation, where experts often disagree. Existing approaches either collapse this variability into a consensus mask or rely on separate model branches for each annotator. We introduce ProSona, a two-stage framework that learns a continuous latent space of annotation styles, enabling controllable personalization via natural language prompts. A probabilistic U-Net backbone captures diverse expert hypotheses, while a prompt-guided projection mechanism navigates this latent space to generate personalized segmentations. A multi-level contrastive objective aligns textual and visual representations, promoting disentangled and interpretable expert styles. Across the LIDC-IDRI lung nodule and multi-institutional prostate MRI datasets, ProSona reduces the Generalized Energy Distance by 17% and improves mean Dice by more than one point compared with DPersona. These results demonstrate that natural-language prompts can provide flexible, accurate, and interpretable control over personalized medical image segmentation. Our implementation is available online 1 .

[171] Generalized-Scale Object Counting with Gradual Query Aggregation

Jer Pelhan, Alan Lukezic, Matej Kristan

Main category: cs.CV

TL;DR: GECO2 is a novel few-shot counting and detection method that addresses object scale issues through dense query representation, outperforming state-of-the-art methods by 10% in accuracy while being 3x faster with lower GPU memory usage.

Details

Motivation: Existing few-shot counters struggle with diverse-sized objects and densely populated regions due to ad-hoc solutions like feature merging and image upsampling/tiling, which don't effectively handle scale variations.

Method: Proposes GECO2 with a new dense query representation that gradually aggregates exemplar-specific feature information across scales, creating high-resolution dense queries for detecting both large and small objects.

Result: GECO2 surpasses state-of-the-art few-shot counters by 10% in both counting and detection accuracy, while running 3x faster with smaller GPU memory footprint.

Conclusion: The proposed dense query representation effectively addresses object scale issues in few-shot counting and detection, achieving superior performance and efficiency compared to existing methods.

Abstract: Few-shot detection-based counters estimate the number of instances in the image specified only by a few test-time exemplars. A common approach to localize objects across multiple sizes is to merge backbone features of different resolutions. Furthermore, to enable small object detection in densely populated regions, the input image is commonly upsampled and tiling is applied to cope with the increased computational and memory requirements. Because of these ad-hoc solutions, existing counters struggle with images containing diverse-sized objects and densely populated regions of small objects. We propose GECO2, an end-to-end few-shot counting and detection method that explicitly addresses the object scale issues. A new dense query representation gradually aggregates exemplar-specific feature information across scales that leads to high-resolution dense queries that enable detection of large as well as small objects. GECO2 surpasses state-of-the-art few-shot counters in counting as well as detection accuracy by 10% while running 3x times faster at smaller GPU memory footprint.

[172] Taming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching

Aditi Singhania, Arushi Jain, Krutik Malani, Riddhi Dhawan, Souymodip Chakraborty, Vineet Batra, Ankit Phogat

Main category: cs.CV

TL;DR: A LoRA fine-tuned diffusion model with latent concatenation and masked CFM objective for subject-driven image generation, trained using a two-stage distilled data curation framework and evaluated with CHARIS framework.

Details

Motivation: Address the fundamental trade-off between strong identity consistency and high prompt diversity in subject-driven image generation.

Method: LoRA fine-tuned diffusion model with latent concatenation strategy, masked Conditional Flow Matching objective, and two-stage Distilled Data Curation Framework for large-scale training.

Result: Enables robust identity preservation without architectural modifications and scales generation capability across various subjects and contexts.

Conclusion: The proposed approach achieves effective subject-driven image generation with improved identity preservation and diversity, validated through the CHARIS evaluation framework.

Abstract: Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.

[173] I2E: Real-Time Image-to-Event Conversion for High-Performance Spiking Neural Networks

Ruichen Ma, Liwei Meng, Guanchao Qiao, Ning Ning, Yang Liu, Shaogang Hu

Main category: cs.CV

TL;DR: I2E converts static images into event streams for SNN training, achieving 300x faster conversion than prior methods and enabling state-of-the-art accuracy of 60.50% on ImageNet and 92.5% on CIFAR10-DVS.

Details

Motivation: Address the critical scarcity of event-stream data that hinders adoption of energy-efficient spiking neural networks (SNNs).

Method: I2E framework simulates microsaccadic eye movements using highly parallelized convolution to convert static images into high-fidelity event streams.

Result: Achieves 300x faster conversion speed, 60.50% accuracy on I2E-ImageNet, and 92.5% accuracy on CIFAR10-DVS through sim-to-real pre-training and fine-tuning.

Conclusion: I2E provides a scalable solution to SNN data scarcity, establishes synthetic event data as high-fidelity proxy for real sensor data, and offers foundational toolkit for neuromorphic systems.

Abstract: Spiking neural networks (SNNs) promise highly energy-efficient computing, but their adoption is hindered by a critical scarcity of event-stream data. This work introduces I2E, an algorithmic framework that resolves this bottleneck by converting static images into high-fidelity event streams. By simulating microsaccadic eye movements with a highly parallelized convolution, I2E achieves a conversion speed over 300x faster than prior methods, uniquely enabling on-the-fly data augmentation for SNN training. The framework’s effectiveness is demonstrated on large-scale benchmarks. An SNN trained on the generated I2E-ImageNet dataset achieves a state-of-the-art accuracy of 60.50%. Critically, this work establishes a powerful sim-to-real paradigm where pre-training on synthetic I2E data and fine-tuning on the real-world CIFAR10-DVS dataset yields an unprecedented accuracy of 92.5%. This result validates that synthetic event data can serve as a high-fidelity proxy for real sensor data, bridging a long-standing gap in neuromorphic engineering. By providing a scalable solution to the data problem, I2E offers a foundational toolkit for developing high-performance neuromorphic systems. The open-source algorithm and all generated datasets are provided to accelerate research in the field.

[174] Radar-APLANC: Unsupervised Radar-based Heartbeat Sensing via Augmented Pseudo-Label and Noise Contrast

Ying Wang, Zhaodong Sun, Xu Cheng, Zuxian He, Xiaobai Li

Main category: cs.CV

TL;DR: Radar-APLANC is the first unsupervised framework for radar-based heartbeat sensing that uses augmented pseudo-labels and noise contrastive learning to achieve performance comparable to supervised methods without requiring expensive ground-truth physiological signals.

Details

Motivation: Traditional radar-based heartbeat sensing methods suffer from performance degradation due to noise, while learning-based methods require costly labeled signals for supervised training. There's a need for unsupervised approaches that can handle noise without expensive ground-truth data.

Method: Proposes Radar-APLANC framework using heartbeat range and noise range within radar range matrix to construct positive/negative samples. Uses Noise-Contrastive Triplet (NCT) loss with pseudo-label signals from traditional radar methods. Includes pseudo-label augmentation with adaptive noise-aware label selection to improve signal quality.

Result: Extensive experiments on Equipleth dataset and collected radar dataset show that the unsupervised method achieves performance comparable to state-of-the-art supervised methods.

Conclusion: Radar-APLANC successfully enables unsupervised radar-based heartbeat sensing with noise robustness, eliminating the need for expensive ground-truth physiological signals while maintaining competitive performance with supervised approaches.

Abstract: Frequency Modulated Continuous Wave (FMCW) radars can measure subtle chest wall oscillations to enable non-contact heartbeat sensing. However, traditional radar-based heartbeat sensing methods face performance degradation due to noise. Learning-based radar methods achieve better noise robustness but require costly labeled signals for supervised training. To overcome these limitations, we propose the first unsupervised framework for radar-based heartbeat sensing via Augmented Pseudo-Label and Noise Contrast (Radar-APLANC). We propose to use both the heartbeat range and noise range within the radar range matrix to construct the positive and negative samples, respectively, for improved noise robustness. Our Noise-Contrastive Triplet (NCT) loss only utilizes positive samples, negative samples, and pseudo-label signals generated by the traditional radar method, thereby avoiding dependence on expensive ground-truth physiological signals. We further design a pseudo-label augmentation approach featuring adaptive noise-aware label selection to improve pseudo-label signal quality. Extensive experiments on the Equipleth dataset and our collected radar dataset demonstrate that our unsupervised method achieves performance comparable to state-of-the-art supervised methods. Our code, dataset, and supplementary materials can be accessed from https://github.com/RadarHRSensing/Radar-APLANC.

[175] CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion

Cameron Braunstein, Mariya Toneva, Eddy Ilg

Main category: cs.CV

TL;DR: The study investigates semantic understanding in Stable Diffusion, finding that CLIP text encoding, not the diffusion process, provides human-like semantic representation, while diffusion acts as a visual decoder.

Details

Motivation: To understand whether latent diffusion models like Stable Diffusion have meaningful semantic understanding of generated images, and to identify which components contribute to semantic representation.

Method: Probing Stable Diffusion with regression layers to predict semantic attributes from internal representations, comparing predictions against human annotations, and analyzing different stages of the generation process.

Result: Semantic understanding primarily comes from CLIP text encoding rather than the reverse diffusion process. Different semantic attributes have varying decoding accuracy, and attributes become harder to disambiguate during diffusion.

Conclusion: CLIP vision-language model provides human-like semantic representation, while the diffusion process serves as a visual decoder without strong semantic understanding.

Abstract: Latent diffusion models such as Stable Diffusion achieve state-of-the-art results on text-to-image generation tasks. However, the extent to which these models have a semantic understanding of the images they generate is not well understood. In this work, we investigate whether the internal representations used by these models during text-to-image generation contain semantic information that is meaningful to humans. To do so, we perform probing on Stable Diffusion with simple regression layers that predict semantic attributes for objects and evaluate these predictions against human annotations. Surprisingly, we find that this success can actually be attributed to the text encoding occurring in CLIP rather than the reverse diffusion process. We demonstrate that groups of specific semantic attributes have markedly different decoding accuracy than the average, and are thus represented to different degrees. Finally, we show that attributes become more difficult to disambiguate from one another during the inverse diffusion process, further demonstrating the strongest semantic representation of object attributes in CLIP. We conclude that the separately trained CLIP vision-language model is what determines the human-like semantic representation, and that the diffusion process instead takes the role of a visual decoder.

[176] Beyond the Pixels: VLM-based Evaluation of Identity Preservation in Reference-Guided Synthesis

Aditi Singhania, Krutik Malani, Riddhi Dhawan, Arushi Jain, Garv Tandon, Nippun Sharma, Souymodip Chakraborty, Vineet Batra, Ankit Phogat

Main category: cs.CV

TL;DR: Hierarchical evaluation framework for identity preservation in generative models using structured VLM reasoning and feature-level analysis.

Details

Motivation: Existing metrics fail to capture fine-grained identity changes and provide limited diagnostic insight for evaluating identity preservation in generative models.

Method: Decomposes identity assessment into feature-level transformations using hierarchical decision trees (type, style -> attribute -> feature) and prompts for concrete transformations rather than abstract similarity scores.

Result: Validated across four state-of-the-art generative models, showing strong alignment with human judgments in measuring identity consistency. Introduced new benchmark with 1,078 image-prompt pairs spanning diverse subject types.

Conclusion: The framework grounds VLM analysis in verifiable visual evidence, reducing hallucinations and improving consistency in identity preservation evaluation.

Abstract: Evaluating identity preservation in generative models remains a critical yet unresolved challenge. Existing metrics rely on global embeddings or coarse VLM prompting, failing to capture fine-grained identity changes and providing limited diagnostic insight. We introduce Beyond the Pixels, a hierarchical evaluation framework that decomposes identity assessment into feature-level transformations. Our approach guides VLMs through structured reasoning by (1) hierarchically decomposing subjects into (type, style) -> attribute -> feature decision tree, and (2) prompting for concrete transformations rather than abstract similarity scores. This decomposition grounds VLM analysis in verifiable visual evidence, reducing hallucinations and improving consistency. We validate our framework across four state-of-the-art generative models, demonstrating strong alignment with human judgments in measuring identity consistency. Additionally, we introduce a new benchmark specifically designed to stress-test generative models. It comprises 1,078 image-prompt pairs spanning diverse subject types, including underrepresented categories such as anthropomorphic and animated characters, and captures an average of six to seven transformation axes per prompt.

[177] StableMorph: High-Quality Face Morph Generation with Stable Diffusion

Wassim Kabbani, Kiran Raja, Raghavendra Ramachandra, Christoph Busch

Main category: cs.CV

TL;DR: StableMorph is a diffusion-based method that generates high-quality, artifact-free morphed face images to improve morphing attack detection evaluation.

Details

Motivation: Existing morph generation methods produce blurry, artifact-ridden images that are easy to detect and don't represent real-world attack challenges, limiting the development of effective MAD systems.

Method: Uses modern diffusion-based image synthesis to generate full-head morphed face images with sharp details, avoiding visual flaws while offering control over visual attributes.

Result: StableMorph produces images that rival or exceed genuine face image quality and effectively fool face recognition systems, posing greater challenges to existing MAD solutions.

Conclusion: StableMorph sets a new standard for morph quality, improves biometric security evaluation by creating more realistic attacks, and supports development of more robust detection systems.

Abstract: Face morphing attacks threaten the integrity of biometric identity systems by enabling multiple individuals to share a single identity. To develop and evaluate effective morphing attack detection (MAD) systems, we need access to high-quality, realistic morphed images that reflect the challenges posed in real-world scenarios. However, existing morph generation methods often produce images that are blurry, riddled with artifacts, or poorly constructed making them easy to detect and not representative of the most dangerous attacks. In this work, we introduce StableMorph, a novel approach that generates highly realistic, artifact-free morphed face images using modern diffusion-based image synthesis. Unlike prior methods, StableMorph produces full-head images with sharp details, avoids common visual flaws, and offers unmatched control over visual attributes. Through extensive evaluation, we show that StableMorph images not only rival or exceed the quality of genuine face images but also maintain a strong ability to fool face recognition systems posing a greater challenge to existing MAD solutions and setting a new standard for morph quality in research and operational testing. StableMorph improves the evaluation of biometric security by creating more realistic and effective attacks and supports the development of more robust detection systems.

[178] Introducing Nylon Face Mask Attacks: A Dataset for Evaluating Generalised Face Presentation Attack Detection

Manasa, Sushrut Patwardhan, Narayan Vetrekar, Pavan Kumar, R. S. Gad, Raghavendra Ramachandra

Main category: cs.CV

TL;DR: A new dataset for face recognition security testing introduces Nylon Face Masks (NFMs) as realistic 3D spoofing instruments, collected using iPhone 11 Pro with over 55,000 samples, revealing significant vulnerabilities in current presentation attack detection methods.

Details

Motivation: Face recognition systems are widely deployed but vulnerable to presentation attacks, particularly concerning are advanced 3D spoofing methods like Nylon Face Masks that can closely mimic victims' facial geometry with elastic structure and photorealistic appearance.

Method: Created a novel dataset using iPhone 11 Pro capturing 3,760 bona fide samples from 100 subjects and 51,281 NFM attack samples across four presentation scenarios involving humans and mannequins, then benchmarked with five state-of-the-art PAD methods.

Result: Evaluation showed significant performance variability across PAD methods, demonstrating that current techniques struggle with NFM attacks and highlighting challenges in detecting these emerging spoofing threats.

Conclusion: NFMs pose serious security threats to face recognition systems, and there’s an urgent need to develop PAD techniques that can effectively generalize to handle such advanced and realistic presentation attacks.

Abstract: Face recognition systems are increasingly deployed across a wide range of applications, including smartphone authentication, access control, and border security. However, these systems remain vulnerable to presentation attacks (PAs), which can significantly compromise their reliability. In this work, we introduce a new dataset focused on a novel and realistic presentation attack instrument called Nylon Face Masks (NFMs), designed to simulate advanced 3D spoofing scenarios. NFMs are particularly concerning due to their elastic structure and photorealistic appearance, which enable them to closely mimic the victim’s facial geometry when worn by an attacker. To reflect real-world smartphone-based usage conditions, we collected the dataset using an iPhone 11 Pro, capturing 3,760 bona fide samples from 100 subjects and 51,281 NFM attack samples across four distinct presentation scenarios involving both humans and mannequins. We benchmark the dataset using five state-of-the-art PAD methods to evaluate their robustness under unseen attack conditions. The results demonstrate significant performance variability across methods, highlighting the challenges posed by NFMs and underscoring the importance of developing PAD techniques that generalise effectively to emerging spoofing threats.

[179] LatentPrintFormer: A Hybrid CNN-Transformer with Spatial Attention for Latent Fingerprint identification

Arnab Maity, Manasa, Pavan Kumar C, Raghavendra Ramachandra

Main category: cs.CV

TL;DR: LatentPrintFormer combines CNN and Transformer backbones with spatial attention for latent fingerprint identification, outperforming state-of-the-art methods.

Details

Motivation: Latent fingerprint identification is challenging due to low image quality, background noise, and partial impressions.

Method: Integrates EfficientNet-B0 (CNN) and Swin Tiny (Transformer) backbones with spatial attention module to extract local and global features, then fuses them into 512D embeddings for cosine similarity matching.

Result: Outperforms three state-of-the-art latent fingerprint recognition techniques on two public datasets, achieving higher identification rates across Rank-10.

Conclusion: The proposed LatentPrintFormer effectively addresses challenges in latent fingerprint identification through multi-backbone feature extraction and attention mechanisms.

Abstract: Latent fingerprint identification remains a challenging task due to low image quality, background noise, and partial impressions. In this work, we propose a novel identification approach called LatentPrintFormer. The proposed model integrates a CNN backbone (EfficientNet-B0) and a Transformer backbone (Swin Tiny) to extract both local and global features from latent fingerprints. A spatial attention module is employed to emphasize high-quality ridge regions while suppressing background noise. The extracted features are fused and projected into a unified 512-dimensional embedding, and matching is performed using cosine similarity in a closed-set identification setting. Extensive experiments on two publicly available datasets demonstrate that LatentPrintFormer consistently outperforms three state-of-the-art latent fingerprint recognition techniques, achieving higher identification rates across Rank-10.

[180] Foam Segmentation in Wastewater Treatment Plants: A Federated Learning Approach with Segment Anything Model 2

Mehmet Batuhan Duman, Alejandro Carnero, Cristian Martín, Daniel Garrido, Manuel Díaz

Main category: cs.CV

TL;DR: Proposes a federated learning framework with SAM2 for foam segmentation in wastewater treatment plants, enabling privacy-preserving collaborative training across multiple plants without sharing sensitive data.

Details

Motivation: Foam formation in WTPs reduces treatment efficiency and increases costs. Standard ML models require large labeled datasets, but data scarcity, heterogeneity, and privacy concerns between different plants hinder development.

Method: Combines Federated Learning with SAM2 image segmentation model. Uses Flower framework for distributed training across edge nodes with a central Fog server aggregating model weights without accessing private data. Fine-tunes SAM2 on distributed clients using real-world WTP images, synthetic foam data, and public datasets.

Result: The framework accelerates training convergence and improves segmentation performance even with limited local datasets by leveraging SAM2’s pre-trained weights. Successfully trained and validated using various data collections including real WTP images from Granada, Spain.

Conclusion: Provides a practical, scalable, and privacy-aware solution for automatic foam tracking in WTPs. Demonstrates significant potential of integrating large-scale foundational models into FL systems for industrial challenges with distributed sensitive data.

Abstract: Foam formation in Wastewater Treatment Plants (WTPs) is a major challenge that can reduce treatment efficiency and increase costs. The ability to automatically examine changes in real-time with respect to the percentage of foam can be of great benefit to the plant. However, large amounts of labeled data are required to train standard Machine Learning (ML) models. The development of these systems is slow due to the scarcity and heterogeneity of labeled data. Additionally, the development is often hindered by the fact that different WTPs do not share their data due to privacy concerns. This paper proposes a new framework to address these challenges by combining Federated Learning (FL) with the state-of-the-art base model for image segmentation, Segment Anything Model 2 (SAM2). The FL paradigm enables collaborative model training across multiple WTPs without centralizing sensitive operational data, thereby ensuring privacy. The framework accelerates training convergence and improves segmentation performance even with limited local datasets by leveraging SAM2’s strong pre-trained weights for initialization. The methodology involves fine-tuning SAM2 on distributed clients (edge nodes) using the Flower framework, where a central Fog server orchestrates the process by aggregating model weights without accessing private data. The model was trained and validated using various data collections, including real-world images captured at a WTPs in Granada, Spain, a synthetically generated foam dataset, and images from publicly available datasets to improve generalization. This research offers a practical, scalable, and privacy-aware solution for automatic foam tracking in WTPs. The findings highlight the significant potential of integrating large-scale foundational models into FL systems to solve real-world industrial challenges characterized by distributed and sensitive data.

[181] OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

Lixu Sun, Nurmemet Yolwas, Wushour Silamu

Main category: cs.CV

TL;DR: OTSNet is a neurocognitive-inspired three-stage STR model that achieves state-of-the-art performance through unified visual-linguistic optimization, addressing cross-modal misalignment and irregular text recognition challenges.

Details

Motivation: Existing STR frameworks suffer from decoupled visual-linguistic optimization that amplifies error propagation through cross-modal misalignment, with visual encoders biased toward background distractors and decoders struggling with geometrically deformed text.

Method: Proposes OTSNet with three core components: Dual Attention Macaron Encoder (DAME) for refined visual features, Position-Aware Module (PAM) and Semantic Quantizer (SQ) for spatial-semantic integration, and Multi-Modal Collaborative Verifier (MMCV) for cross-modal self-correction.

Result: Achieves 83.5% average accuracy on Union14M-L benchmark and 79.1% on heavily occluded OST dataset, establishing new records across 9 out of 14 evaluation scenarios.

Conclusion: OTSNet’s neurocognitive-inspired Observation-Thinking-Spelling pipeline effectively addresses STR challenges through unified modeling and cross-modal alignment, demonstrating superior performance on irregular text recognition.

Abstract: Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.

[182] PEOD: A Pixel-Aligned Event-RGB Benchmark for Object Detection under Challenging Conditions

Luoping Cui, Hanqing Liu, Mingjie Liu, Endian Lin, Donghong Jiang, Yuhao Wang, Chuang Zhu

Main category: cs.CV

TL;DR: PEOD is the first large-scale, high-resolution (1280x720) Event-RGB dataset for object detection under challenging conditions, addressing limitations of existing datasets with sparse extreme condition coverage and low resolution.

Details

Motivation: Existing Event-RGB datasets have sparse coverage of extreme conditions and low spatial resolution (≤640x480), preventing comprehensive evaluation of detectors in challenging scenarios.

Method: Created PEOD dataset with 130+ spatiotemporal-aligned sequences and 340k manual bounding boxes, with 57% data captured under low-light, overexposure, and high-speed motion. Benchmarked 14 methods across three input configurations: Event-based, RGB-based, and Event-RGB fusion.

Result: Fusion-based models achieve excellent performance on full test set and normal subset. In illumination challenge subset, top event-based model outperforms all fusion models, while fusion models still outperform RGB-based counterparts, revealing limits of existing fusion methods when frame modality is severely degraded.

Conclusion: PEOD establishes a realistic, high-quality benchmark for multimodal perception and facilitates future research in robust object detection under challenging conditions.

Abstract: Robust object detection for challenging scenarios increasingly relies on event cameras, yet existing Event-RGB datasets remain constrained by sparse coverage of extreme conditions and low spatial resolution (<= 640 x 480), which prevents comprehensive evaluation of detectors under challenging scenarios. To address these limitations, we propose PEOD, the first large-scale, pixel-aligned and high-resolution (1280 x 720) Event-RGB dataset for object detection under challenge conditions. PEOD contains 130+ spatiotemporal-aligned sequences and 340k manual bounding boxes, with 57% of data captured under low-light, overexposure, and high-speed motion. Furthermore, we benchmark 14 methods across three input configurations (Event-based, RGB-based, and Event-RGB fusion) on PEOD. On the full test set and normal subset, fusion-based models achieve the excellent performance. However, in illumination challenge subset, the top event-based model outperforms all fusion models, while fusion models still outperform their RGB-based counterparts, indicating limits of existing fusion methods when the frame modality is severely degraded. PEOD establishes a realistic, high-quality benchmark for multimodal perception and facilitates future research.

[183] Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation

Jun Sun, Xinxin Zhang, Simin Hong, Jian Zhu, Xiang Gao

Main category: cs.CV

TL;DR: Boomda is a multimodal domain adaptation method that addresses varying domain shifts across modalities through balanced multi-objective optimization and information bottleneck representations.

Details

Motivation: Address the challenge of expensive manual annotation in multimodal learning by exploring unsupervised domain adaptation, which remains understudied in multimodal settings compared to unimodal ones.

Method: Use information bottleneck to learn modality representations independently, align source and target domains with correlation alignment, formulate as multi-objective optimization for Pareto optimal solution, and solve via quadratic programming with closed-form solution.

Result: Extensive empirical results show Boomda outperforms competing schemes in multimodal domain adaptation tasks.

Conclusion: Boomda provides an effective modality-balanced approach for multimodal domain adaptation with demonstrated superior performance over existing methods.

Abstract: Multimodal learning, while contributing to numerous success stories across various fields, faces the challenge of prohibitively expensive manual annotation. To address the scarcity of annotated data, a popular solution is unsupervised domain adaptation, which has been extensively studied in unimodal settings yet remains less explored in multimodal settings. In this paper, we investigate heterogeneous multimodal domain adaptation, where the primary challenge is the varying domain shifts of different modalities from the source to the target domain. We first introduce the information bottleneck method to learn representations for each modality independently, and then match the source and target domains in the representation space with correlation alignment. To balance the domain alignment of all modalities, we formulate the problem as a multi-objective task, aiming for a Pareto optimal solution. By exploiting the properties specific to our model, the problem can be simplified to a quadratic programming problem. Further approximation yields a closed-form solution, leading to an efficient modality-balanced multimodal domain adaptation algorithm. The proposed method features \textbf{B}alanced multi-\textbf{o}bjective \textbf{o}ptimization for \textbf{m}ultimodal \textbf{d}omain \textbf{a}daptation, termed \textbf{Boomda}. Extensive empirical results showcase the effectiveness of the proposed approach and demonstrate that Boomda outperforms the competing schemes. The code is is available at: https://github.com/sunjunaimer/Boomda.git.

[184] Non-Aligned Reference Image Quality Assessment for Novel View Synthesis

Abhijay Ghildyal, Rajesh Sureddi, Nabajeet Barman, Saman Zadtootaghaj, Alan Bovik

Main category: cs.CV

TL;DR: Proposes a Non-Aligned Reference (NAR-IQA) framework for evaluating perceptual quality of Novel View Synthesis images when reference views are not pixel-aligned, using contrastive learning with synthetic distortions to achieve better generalization.

Details

Motivation: Existing Full-Reference IQA methods fail with misaligned references, while No-Reference methods struggle with generalization for Novel View Synthesis quality assessment.

Method: Built a large-scale dataset with synthetic distortions targeting Temporal Regions of Interest, trained a contrastive learning model using LoRA-enhanced DINOv2 embeddings supervised by existing IQA methods, avoiding overfitting to specific real samples.

Result: Outperforms state-of-the-art FR-IQA, NR-IQA, and NAR-IQA methods on both aligned and non-aligned references, with strong correlation to human preferences from user study.

Conclusion: The proposed NAR-IQA framework effectively addresses the challenge of quality assessment for Novel View Synthesis with non-aligned references, demonstrating robust performance and good correlation with human perception.

Abstract: Evaluating the perceptual quality of Novel View Synthesis (NVS) images remains a key challenge, particularly in the absence of pixel-aligned ground truth references. Full-Reference Image Quality Assessment (FR-IQA) methods fail under misalignment, while No-Reference (NR-IQA) methods struggle with generalization. In this work, we introduce a Non-Aligned Reference (NAR-IQA) framework tailored for NVS, where it is assumed that the reference view shares partial scene content but lacks pixel-level alignment. We constructed a large-scale image dataset containing synthetic distortions targeting Temporal Regions of Interest (TROI) to train our NAR-IQA model. Our model is built on a contrastive learning framework that incorporates LoRA-enhanced DINOv2 embeddings and is guided by supervision from existing IQA methods. We train exclusively on synthetically generated distortions, deliberately avoiding overfitting to specific real NVS samples and thereby enhancing the model’s generalization capability. Our model outperforms state-of-the-art FR-IQA, NR-IQA, and NAR-IQA methods, achieving robust performance on both aligned and non-aligned references. We also conducted a novel user study to gather data on human preferences when viewing non-aligned references in NVS. We find strong correlation between our proposed quality prediction model and the collected subjective ratings. For dataset and code, please visit our project page: https://stootaghaj.github.io/nova-project/

[185] LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping

Chenying Liu, Wei Huang, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: LandSegmenter is a Land Use and Land Cover foundation model framework that addresses data scarcity through weak supervision, cross-modal feature extraction, and zero-shot transfer learning across diverse remote sensing datasets.

Details

Motivation: Current LULC models are limited by modality specificity, fixed taxonomies, and heavy reliance on expensive labeled data, which is impractical in remote sensing. Foundation models offer potential but face challenges with fine-tuning requirements and data demands.

Method: Proposes LandSegmenter framework with three components: 1) LAS dataset using weak labels from existing LULC products for scalable training, 2) RS-specific adapter and text encoder for cross-modal features and semantic awareness, 3) class-wise confidence-guided fusion for zero-shot performance improvement.

Result: Evaluated on six LULC datasets across diverse modalities and taxonomies. Achieves competitive or superior performance in transfer learning and zero-shot settings, particularly excelling when transferred to unseen datasets without additional training.

Conclusion: LandSegmenter demonstrates the efficacy of weak supervision for building task-specific foundation models in remote sensing, enabling scalable LULC mapping with strong zero-shot transfer capabilities across diverse domains and modalities.

Abstract: Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation (EO). However, current LULC models are typically developed for a specific modality and a fixed class taxonomy, limiting their generability and broader applicability. Recent advances in foundation models (FMs) offer promising opportunities for building universal models. Yet, task-agnostic FMs often require fine-tuning for downstream applications, whereas task-specific FMs rely on massive amounts of labeled data for training, which is costly and impractical in the remote sensing (RS) domain. To address these challenges, we propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels. From the input side, to alleviate the heavy demand on labeled data for FM training, we introduce LAnd Segment (LAS), a large-scale, multi-modal, multi-source dataset built primarily with globally sampled weak labels from existing LULC products. LAS provides a scalable, cost-effective alternative to manual annotation, enabling large-scale FM training across diverse LULC domains. For model architecture, LandSegmenter integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement. At the output stage, we introduce a class-wise confidence-guided fusion strategy to mitigate semantic omissions and further improve LandSegmenter’s zero-shot performance. We evaluate LandSegmenter on six precisely annotated LULC datasets spanning diverse modalities and class taxonomies. Extensive transfer learning and zero-shot experiments demonstrate that LandSegmenter achieves competitive or superior performance, particularly in zero-shot settings when transferred to unseen datasets. These results highlight the efficacy of our proposed framework and the utility of weak supervision for building task-specific FMs.

Ning Wang, Long Yu, Cong Hua, Guangming Zhu, Lin Mei, Syed Afaq Ali Shah, Mohammed Bennamoun, Liang Zhang

Main category: cs.CV

TL;DR: Proposes Mg-MRN, a multi-granularity mutual refinement network for zero-shot learning that enhances visual feature discriminability through decoupled multi-granularity feature learning and cross-granularity interactions.

Details

Motivation: Current ZSL methods overlook intrinsic interactions between local region features, limiting the acquisition of transferable and explicit visual features needed for recognizing unseen classes.

Method: Uses multi-granularity feature extraction to learn decoupled region-level features, followed by cross-granularity feature fusion that strengthens interactions between region features of varying granularities.

Result: Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of the proposed Mg-MRN method.

Conclusion: The Mg-MRN network effectively refines discriminative and transferable visual features through multi-granularity learning and cross-granularity interactions, improving ZSL recognition performance.

Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes with zero samples by transferring semantic knowledge from seen classes. Current approaches typically correlate global visual features with semantic information (i.e., attributes) or align local visual region features with corresponding attributes to enhance visual-semantic interactions. Although effective, these methods often overlook the intrinsic interactions between local region features, which can further improve the acquisition of transferable and explicit visual features. In this paper, we propose a network named Multi-Granularity Mutual Refinement Network (Mg-MRN), which refine discriminative and transferable visual features by learning decoupled multi-granularity features and cross-granularity feature interactions. Specifically, we design a multi-granularity feature extraction module to learn region-level discriminative features through decoupled region feature mining. Then, a cross-granularity feature fusion module strengthens the inherent interactions between region features of varying granularities. This module enhances the discriminability of representations at each granularity level by integrating region representations from adjacent hierarchies, further improving ZSL recognition performance. Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of our proposed Mg-MRN method. Our code is available at https://github.com/NingWang2049/Mg-MRN.

[187] KPLM-STA: Physically-Accurate Shadow Synthesis for Human Relighting via Keypoint-Based Light Modeling

Xinhui Yin, Qifei Li, Yilin Guo, Hongxia Xie, Xiaoli Zhang

Main category: cs.CV

TL;DR: Proposes a novel shadow generation framework using Keypoints Linear Model and Shadow Triangle Algorithm to create realistic and geometrically accurate shadows in image composition.

Details

Motivation: Existing diffusion-based methods like IC-Light still struggle with producing shadows that have both high appearance realism and geometric precision, especially for articulated human bodies in composite images.

Method: Uses Keypoints Linear Model (KPLM) with 9 keypoints and one bounding block to model articulated human bodies for physically plausible shadow projection, combined with Shadow Triangle Algorithm (STA) for explicit geometric computation of shadow angles, lengths, and positions.

Result: Achieves state-of-the-art performance on shadow realism benchmarks, particularly for complex human poses, and generalizes well to multi-directional relighting scenarios.

Conclusion: The proposed framework successfully addresses the limitations of existing methods by combining physical modeling with geometric computations to generate realistic and accurate shadows in image composition.

Abstract: Image composition aims to seamlessly integrate a foreground object into a background, where generating realistic and geometrically accurate shadows remains a persistent challenge. While recent diffusion-based methods have outperformed GAN-based approaches, existing techniques, such as the diffusion-based relighting framework IC-Light, still fall short in producing shadows with both high appearance realism and geometric precision, especially in composite images. To address these limitations, we propose a novel shadow generation framework based on a Keypoints Linear Model (KPLM) and a Shadow Triangle Algorithm (STA). KPLM models articulated human bodies using nine keypoints and one bounding block, enabling physically plausible shadow projection and dynamic shading across joints, thereby enhancing visual realism. STA further improves geometric accuracy by computing shadow angles, lengths, and spatial positions through explicit geometric formulations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on shadow realism benchmarks, particularly under complex human poses, and generalizes effectively to multi-directional relighting scenarios such as those supported by IC-Light.

[188] Distributed Zero-Shot Learning for Visual Recognition

Zhi Chen, Yadan Luo, Zi Huang, Jingjing Li, Sen Wang, Xin Yu

Main category: cs.CV

TL;DR: Proposes DistZSL framework for distributed zero-shot learning using cross-node attribute regularization and global attribute-to-visual consensus to handle data heterogeneity.

Details

Motivation: To effectively learn zero-shot learning models from decentralized data while addressing data heterogeneity issues across distributed nodes.

Method: Uses cross-node attribute regularizer to stabilize attribute feature space and global attribute-to-visual consensus to mitigate biased visual-to-attribute mappings across nodes.

Result: Achieves superior performance compared to state-of-the-art methods in learning from distributed data.

Conclusion: DistZSL framework effectively handles distributed data heterogeneity and enhances zero-shot learning performance across decentralized nodes.

Abstract: In this paper, we propose a Distributed Zero-Shot Learning (DistZSL) framework that can fully exploit decentralized data to learn an effective model for unseen classes. Considering the data heterogeneity issues across distributed nodes, we introduce two key components to ensure the effective learning of DistZSL: a cross-node attribute regularizer and a global attribute-to-visual consensus. Our proposed cross-node attribute regularizer enforces the distances between attribute features to be similar across different nodes. In this manner, the overall attribute feature space would be stable during learning, and thus facilitate the establishment of visual-to-attribute(V2A) relationships. Then, we introduce the global attribute-tovisual consensus to mitigate biased V2A mappings learned from individual nodes. Specifically, we enforce the bilateral mapping between the attribute and visual feature distributions to be consistent across different nodes. Thus, the learned consistent V2A mapping can significantly enhance zero-shot learning across different nodes. Extensive experiments demonstrate that DistZSL achieves superior performance to the state-of-the-art in learning from distributed data.

[189] VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada

Main category: cs.CV

TL;DR: VLMDiff is a novel unsupervised multi-class visual anomaly detection framework that combines Latent Diffusion Models with Vision-Language Models to improve anomaly localization and detection without requiring per-class training.

Details

Motivation: Current diffusion-based anomaly detection methods rely on synthetic noise generation and require per-class model training, which limits generalization and scalability for multi-class real-world applications.

Method: Integrates pre-trained Vision-Language Model (VLM) with Latent Diffusion Model (LDM), using VLM-generated normal image descriptions as conditioning for LDM training to learn robust normal feature representations.

Result: Achieves competitive performance with significant improvements: up to 25 points higher PRO metric on Real-IAD dataset and 8 points higher on COCO-AD dataset compared to state-of-the-art diffusion-based methods.

Conclusion: The proposed VLMDiff framework effectively leverages vision-language models to enhance multi-class anomaly detection without manual annotations or per-class training, demonstrating superior performance over existing diffusion-based approaches.

Abstract: Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.

[190] WarpGAN: Warping-Guided 3D GAN Inversion with Style-Based Novel View Inpainting

Kaitao Huang, Yan Yan, Jing-Hao Xue, Hanzi Wang

Main category: cs.CV

TL;DR: WarpGAN improves 3D GAN inversion for single-shot novel view synthesis by introducing a warping-and-inpainting strategy that addresses poor quality in occluded regions through symmetry priors and multi-view consistency.

Details

Motivation: Existing 3D GAN inversion methods focus on visible regions but generate poor quality occluded regions due to information loss from low bit-rate latent codes, limiting realistic novel view synthesis.

Method: Proposes WarpGAN with three steps: 1) 3D GAN inversion encoder projects single-view image to latent code, 2) warping to novel view using 3D GAN depth map, 3) SVINet uses symmetry prior and multi-view correspondence to inpaint occluded regions.

Result: Quantitative and qualitative experiments show WarpGAN consistently outperforms state-of-the-art methods in novel view synthesis quality.

Conclusion: The warping-and-inpainting strategy effectively improves occluded region generation in 3D GAN inversion, achieving better multi-view consistency and realism in single-shot novel view synthesis.

Abstract: 3D GAN inversion projects a single image into the latent space of a pre-trained 3D GAN to achieve single-shot novel view synthesis, which requires visible regions with high fidelity and occluded regions with realism and multi-view consistency. However, existing methods focus on the reconstruction of visible regions, while the generation of occluded regions relies only on the generative prior of 3D GAN. As a result, the generated occluded regions often exhibit poor quality due to the information loss caused by the low bit-rate latent code. To address this, we introduce the warping-and-inpainting strategy to incorporate image inpainting into 3D GAN inversion and propose a novel 3D GAN inversion method, WarpGAN. Specifically, we first employ a 3D GAN inversion encoder to project the single-view image into a latent code that serves as the input to 3D GAN. Then, we perform warping to a novel view using the depth map generated by 3D GAN. Finally, we develop a novel SVINet, which leverages the symmetry prior and multi-view image correspondence w.r.t. the same latent code to perform inpainting of occluded regions in the warped image. Quantitative and qualitative experiments demonstrate that our method consistently outperforms several state-of-the-art methods.

[191] Pixel-level Quality Assessment for Oriented Object Detection

Yunhui Zhu, Buliao Huang

Main category: cs.CV

TL;DR: PQA replaces box-level IoU prediction with pixel-level spatial consistency assessment to eliminate structural coupling bias in oriented object detection.

Details

Motivation: Box-level IoU prediction suffers from structural coupling where predicted IoU can be overestimated for poorly localized boxes due to inherent similarity between predicted and estimated ground-truth boxes.

Method: Proposes Pixel-level Quality Assessment (PQA) framework that measures alignment between pixel positions relative to predicted and ground-truth boxes, avoiding direct box comparison. Introduces integration metric to aggregate pixel-level consistency into unified quality score.

Result: Extensive experiments on HRSC2016 and DOTA datasets show consistent performance improvements: +5.96% AP50:95 on Rotated RetinaNet and +2.32% on STD detector.

Conclusion: PQA effectively eliminates similarity bias in box-level IoU prediction and can be seamlessly integrated into various oriented object detectors to improve localization quality assessment.

Abstract: Modern oriented object detectors typically predict a set of bounding boxes and select the top-ranked ones based on estimated localization quality. Achieving high detection performance requires that the estimated quality closely aligns with the actual localization accuracy. To this end, existing approaches predict the Intersection over Union (IoU) between the predicted and ground-truth (GT) boxes as a proxy for localization quality. However, box-level IoU prediction suffers from a structural coupling issue: since the predicted box is derived from the detector’s internal estimation of the GT box, the predicted IoU–based on their similarity–can be overestimated for poorly localized boxes. To overcome this limitation, we propose a novel Pixel-level Quality Assessment (PQA) framework, which replaces box-level IoU prediction with the integration of pixel-level spatial consistency. PQA measures the alignment between each pixel’s relative position to the predicted box and its corresponding position to the GT box. By operating at the pixel level, PQA avoids directly comparing the predicted box with the estimated GT box, thereby eliminating the inherent similarity bias in box-level IoU prediction. Furthermore, we introduce a new integration metric that aggregates pixel-level spatial consistency into a unified quality score, yielding a more accurate approximation of the actual localization quality. Extensive experiments on HRSC2016 and DOTA demonstrate that PQA can be seamlessly integrated into various oriented object detectors, consistently improving performance (e.g., +5.96% AP$_{50:95}$ on Rotated RetinaNet and +2.32% on STD).

[192] UI2Code$^\text{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang

Main category: cs.CV

TL;DR: UI2Code$^\text{N}$ is a visual language model that introduces an interactive UI-to-code paradigm with multimodal coding capabilities, achieving state-of-the-art performance comparable to leading closed-source models.

Details

Motivation: Current UI programming approaches have limited multimodal coding capabilities and don't effectively use iterative visual feedback, making automatic UI coding complex and inefficient.

Method: Interactive UI-to-code paradigm with staged pretraining, fine-tuning, and reinforcement learning; unifies UI-to-code generation, UI editing, and UI polishing; implements test-time scaling for multi-turn feedback.

Result: Establishes new state-of-the-art among open-source models; achieves performance comparable to Claude-4-Sonnet and GPT-5 on UI-to-code and UI polishing benchmarks.

Conclusion: The interactive paradigm with unified multimodal capabilities significantly advances automatic UI coding, making it more practical and effective for real-world software development workflows.

Abstract: User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.

[193] UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets

Arnav Aditya, Nitin Kumar, Saurabh Shigwan

Main category: cs.CV

TL;DR: The paper proposes a loss function for open-set recognition in medical diagnosis that uses auxiliary datasets to penalize open space regions, achieving state-of-the-art performance on multiple medical datasets.

Details

Motivation: Medical AI faces challenges with limited data due to ethical/legal restrictions and high annotation costs, especially for rare diseases. Open-set recognition is crucial to identify unknown samples not seen during training.

Method: Introduces a loss function that leverages features clustering around class means arranged as simplex vertices, and uses auxiliary datasets to penalize open space regions for effective unknown class rejection.

Result: Achieves significant performance gains across four MedMNIST datasets (BloodMNIST, OCTMNIST, DermaMNIST, TissueMNIST) and a public skin dataset, outperforming state-of-the-art techniques.

Conclusion: The proposed method effectively addresses open-set recognition challenges in medical diagnosis, providing robust performance for identifying unknown classes using auxiliary data and simplex-based feature organization.

Abstract: Driven by advancements in deep learning, computer-aided diagnoses have made remarkable progress. However, outside controlled laboratory settings, algorithms may encounter several challenges. In the medical domain, these difficulties often stem from limited data availability due to ethical and legal restrictions, as well as the high cost and time required for expert annotations-especially in the face of emerging or rare diseases. In this context, open-set recognition plays a vital role by identifying whether a sample belongs to one of the known classes seen during training or should be rejected as an unknown. Recent studies have shown that features learned in the later stages of deep neural networks are observed to cluster around their class means, which themselves are arranged as individual vertices of a regular simplex [32]. The proposed method introduces a loss function designed to reject samples of unknown classes effectively by penalizing open space regions using auxiliary datasets. This approach achieves significant performance gain across four MedMNIST datasets-BloodMNIST, OCTMNIST, DermaMNIST, TissueMNIST and a publicly available skin dataset [29] outperforming state-of-the-art techniques.

[194] Twist and Compute: The Cost of Pose in 3D Generative Diffusion

Kyle Fogarty, Jack Foster, Boqiao Zhang, Jing Yang, Cengiz Öztireli

Main category: cs.CV

TL;DR: Large-scale image-to-3D generative models exhibit strong canonical view bias, struggling with rotated inputs. A lightweight CNN can detect and correct orientation to restore performance without modifying the main model.

Details

Motivation: To identify and address the opacity of inductive biases in large-scale image-to-3D generative models, specifically the canonical view bias that limits generalization across viewpoints.

Method: Conducted controlled experiments using 2D rotations on Hunyuan3D 2.0 model. Proposed a lightweight CNN for detecting and correcting input orientation to mitigate the bias.

Result: State-of-the-art models show degraded performance under rotated inputs. The lightweight CNN successfully restores model performance without modifying the generative backbone.

Conclusion: Scale alone may not be sufficient for robust 3D generation; modular, symmetry-aware designs should be pursued to address fundamental biases in generative models.

Abstract: Despite their impressive results, large-scale image-to-3D generative models remain opaque in their inductive biases. We identify a significant limitation in image-conditioned 3D generative models: a strong canonical view bias. Through controlled experiments using simple 2D rotations, we show that the state-of-the-art Hunyuan3D 2.0 model can struggle to generalize across viewpoints, with performance degrading under rotated inputs. We show that this failure can be mitigated by a lightweight CNN that detects and corrects input orientation, restoring model performance without modifying the generative backbone. Our findings raise an important open question: Is scale enough, or should we pursue modular, symmetry-aware designs?

[195] Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone

Rizal Khoirul Anam

Main category: cs.CV

TL;DR: This paper evaluates a multimodal pipeline combining EfficientNet-B4 for food recognition with Gemini LLM for nutritional analysis, finding that visual classification accuracy is the main bottleneck despite Gemini’s superior generative quality.

Details

Motivation: To address the need for robust automated nutritional analysis and culinary guidance in digital food applications, while evaluating trade-offs between visual classification accuracy, model efficiency, and generative output quality.

Method: Comparative evaluation of a decoupled multimodal pipeline using EfficientNet-B4 visual backbone with Gemini LLM, benchmarked against alternative vision models (VGG-16, ResNet-50, YOLOv8) and lightweight LLM (Gemma), with analysis of Semantic Error Propagation on a new Custom Chinese Food Dataset.

Result: EfficientNet-B4 achieved 89.0% Top-1 accuracy with best efficiency balance, Gemini scored 9.2/10 factual accuracy for superior generative quality, but system utility is bottlenecked by visual front-end accuracy, with high semantic similarity identified as critical failure mode.

Conclusion: The overall system performance is fundamentally limited by the visual classification module’s perceptive accuracy, despite the generative LLM’s capabilities, highlighting the importance of improving visual recognition for food analysis applications.

Abstract: The proliferation of digital food applications necessitates robust methods for automated nutritional analysis and culinary guidance. This paper presents a comprehensive comparative evaluation of a decoupled, multimodal pipeline for food recognition. We evaluate a system integrating a specialized visual backbone (EfficientNet-B4) with a powerful generative large language model (Google’s Gemini LLM). The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output (nutritional data and recipes). We benchmark this pipeline against alternative vision backbones (VGG-16, ResNet-50, YOLOv8) and a lightweight LLM (Gemma). We introduce a formalization for “Semantic Error Propagation” (SEP) to analyze how classification inaccuracies from the visual module cascade into the generative output. Our analysis is grounded in a new Custom Chinese Food Dataset (CCFD) developed to address cultural bias in public datasets. Experimental results demonstrate that while EfficientNet-B4 (89.0% Top-1 Acc.) provides the best balance of accuracy and efficiency, and Gemini (9.2/10 Factual Accuracy) provides superior generative quality, the system’s overall utility is fundamentally bottlenecked by the visual front-end’s perceptive accuracy. We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.

[196] 2D Representation for Unguided Single-View 3D Super-Resolution in Real-Time

Ignasi Mas, Ivan Huerta, Ramon Morros, Javier Ruiz-Hidalgo

Main category: cs.CV

TL;DR: 2Dto3D-SR is a real-time single-view 3D super-resolution framework that converts 3D data into 2D representations using PNCC, enabling direct use of 2D super-resolution methods without requiring high-resolution RGB guidance.

Details

Motivation: To address the limitations of existing 3D super-resolution methods that rely on high-resolution RGB guidance or complex 3D point-based approaches, which are impractical in scenarios where high-resolution RGB data is unavailable.

Method: Encodes 3D data from a single viewpoint into structured 2D representation using Projected Normalized Coordinate Code (PNCC), allowing direct application of 2D image super-resolution architectures. Two implementations: Swin Transformers for accuracy and Vision Mamba for efficiency.

Result: Swin Transformer model achieves state-of-the-art accuracy on standard benchmarks, while Vision Mamba model delivers competitive results at real-time speeds, demonstrating the framework’s effectiveness across different performance requirements.

Conclusion: The geometry-guided pipeline provides a surprisingly simple yet practical solution for real-world 3D super-resolution, particularly useful when high-resolution RGB data is inaccessible, with flexibility to choose between accuracy-focused or efficiency-focused implementations.

Abstract: We introduce 2Dto3D-SR, a versatile framework for real-time single-view 3D super-resolution that eliminates the need for high-resolution RGB guidance. Our framework encodes 3D data from a single viewpoint into a structured 2D representation, enabling the direct application of existing 2D image super-resolution architectures. We utilize the Projected Normalized Coordinate Code (PNCC) to represent 3D geometry from a visible surface as a regular image, thereby circumventing the complexities of 3D point-based or RGB-guided methods. This design supports lightweight and fast models adaptable to various deployment environments. We evaluate 2Dto3D-SR with two implementations: one using Swin Transformers for high accuracy, and another using Vision Mamba for high efficiency. Experiments show the Swin Transformer model achieves state-of-the-art accuracy on standard benchmarks, while the Vision Mamba model delivers competitive results at real-time speeds. This establishes our geometry-guided pipeline as a surprisingly simple yet viable and practical solution for real-world scenarios, especially where high-resolution RGB data is inaccessible.

[197] Accurate and Efficient Surface Reconstruction from Point Clouds via Geometry-Aware Local Adaptation

Eito Ogawa, Taiga Hayami, Hiroshi Watanabe

Main category: cs.CV

TL;DR: Proposes adaptive local region placement and sizing based on point cloud curvature for improved surface reconstruction accuracy and efficiency.

Details

Motivation: Current methods use uniform local regions with fixed sizes, limiting adaptability to varying geometric complexity in point cloud surface reconstruction.

Method: Adaptively modulates spacing and size of local regions based on curvature of input point cloud.

Result: Improved reconstruction accuracy and efficiency compared to uniform region approaches.

Conclusion: Adaptive local region placement and sizing based on curvature enhances surface reconstruction performance.

Abstract: Point cloud surface reconstruction has improved in accuracy with advances in deep learning, enabling applications such as infrastructure inspection. Recent approaches that reconstruct from small local regions rather than entire point clouds have attracted attention for their strong generalization capability. However, prior work typically places local regions uniformly and keeps their size fixed, limiting adaptability to variations in geometric complexity. In this study, we propose a method that improves reconstruction accuracy and efficiency by adaptively modulating the spacing and size of local regions based on the curvature of the input point cloud.

[198] Remodeling Semantic Relationships in Vision-Language Fine-Tuning

Xiangyang Wu, Liu Liu, Baosheng Yu, Jiayan Qiu, Zhenwei Shi

Main category: cs.CV

TL;DR: A method that improves multimodal alignment by extracting multilevel semantic features, grouping related semantics, and fusing visual-textual features using inheritable cross-attention with correlation-based filtering.

Details

Motivation: Existing vision-language fine-tuning methods overlook semantic relationships within images when aligning vision and language, leading to suboptimal performance.

Method: Extract multilevel semantic features from vision encoders, project vision features to group related semantics, and fuse visual-textual features using inheritable cross-attention that discards low-correlation feature pairs.

Result: Outperforms all existing methods on eight foundation models across visual question answering and image captioning tasks.

Conclusion: The proposed approach effectively improves multimodal alignment and fusion by leveraging both semantics and relationships, achieving superior performance on downstream vision-language tasks.

Abstract: Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.

[199] Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning

Chenyu Hu, Xiaotong Li, Hao Zhu, Biao Hou

Main category: cs.CV

TL;DR: DiPVNet is a novel point cloud processing framework that addresses rotational variations through direction-perceptive vector networks, achieving state-of-the-art performance on classification and segmentation tasks.

Details

Motivation: Arbitrary rotations disrupt point clouds' intrinsic directional characteristics, and existing methods fail to fully exploit multiscale directional nature for enhanced feature representations.

Method: Proposes Direction-Perceptive Vector Network (DiPVNet) with atomic dot-product operators: Learnable Local Dot-Product (L2DP) for adaptive local structure capture, and global directional response spectrum via direction-aware spherical Fourier transform (DASFT).

Result: Extensive experiments show DiPVNet achieves state-of-the-art performance on point cloud classification and segmentation tasks, particularly robust to noise and large-angle rotations.

Conclusion: DiPVNet effectively models rotational symmetry while maintaining adaptive directional perception through multiscale directional operators, providing a comprehensive solution for rotation-invariant point cloud processing.

Abstract: Point cloud processing has become a cornerstone technology in many 3D vision tasks. However, arbitrary rotations introduce variations in point cloud orientations, posing a long-standing challenge for effective representation learning. The core of this issue is the disruption of the point cloud’s intrinsic directional characteristics caused by rotational perturbations. Recent methods attempt to implicitly model rotational equivariance and invariance, preserving directional information and propagating it into deep semantic spaces. Yet, they often fall short of fully exploiting the multiscale directional nature of point clouds to enhance feature representations. To address this, we propose the Direction-Perceptive Vector Network (DiPVNet). At its core is an atomic dot-product operator that simultaneously encodes directional selectivity and rotation invariance–endowing the network with both rotational symmetry modeling and adaptive directional perception. At the local level, we introduce a Learnable Local Dot-Product (L2DP) Operator, which enables interactions between a center point and its neighbors to adaptively capture the non-uniform local structures of point clouds. At the global level, we leverage generalized harmonic analysis to prove that the dot-product between point clouds and spherical sampling vectors is equivalent to a direction-aware spherical Fourier transform (DASFT). This leads to the construction of a global directional response spectrum for modeling holistic directional structures. We rigorously prove the rotation invariance of both operators. Extensive experiments on challenging scenarios involving noise and large-angle rotations demonstrate that DiPVNet achieves state-of-the-art performance on point cloud classification and segmentation tasks. Our code is available at https://github.com/wxszreal0/DiPVNet.

[200] NERVE: Neighbourhood & Entropy-guided Random-walk for training free open-Vocabulary sEgmentation

Kunal Mahatha, Jose Dolz, Christian Desrosiers

Main category: cs.CV

TL;DR: NERVE is a training-free open-vocabulary semantic segmentation method that integrates global and local information using stable diffusion’s self-attention, employs stochastic random walks for affinity refinement, and uses entropy-based uncertainty to select relevant attention maps without conventional post-processing.

Details

Motivation: Existing training-free OVSS methods have limitations including expensive affinity refinement, ineffective fusion of transformer attention maps with equal weighting, and reliance on fixed-size Gaussian kernels that enforce isotropic neighborhoods.

Method: Integrates global and local information from stable diffusion’s self-attention, uses stochastic random walk for affinity refinement instead of fixed kernels, and employs entropy-based uncertainty to select the most relevant attention maps without CRF or PAMR post-processing.

Result: Achieves state-of-the-art zero-shot segmentation performance on 7 popular semantic segmentation benchmarks.

Conclusion: NERVE provides an effective training-free approach for open-vocabulary semantic segmentation that overcomes limitations of existing methods and delivers superior performance.

Abstract: Despite recent advances in Open-Vocabulary Semantic Segmentation (OVSS), existing training-free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of transformer attention maps due to equal weighting or reliance on fixed-size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We propose a strong baseline for training-free OVSS termed as NERVE (Neighbourhood & Entropy-guided Random-walk for open-Vocabulary sEgmentation), which uniquely integrates global and fine-grained local information, exploiting the neighbourhood structure from the self-attention layer of a stable diffusion model. We also introduce a stochastic random walk for refining the affinity rather than relying on fixed-size Gaussian kernels for local context. This spatial diffusion process encourages propagation across connected and semantically related areas, enabling it to effectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self-attention maps from different transformer heads or layers equally, our method uses entropy-based uncertainty to select the most relevant maps. Notably, our method does not require any conventional post-processing techniques like Conditional Random Fields (CRF) or Pixel-Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmentation benchmarks, yielding an overall state-of-the-art zero-shot segmentation performance, providing an effective approach to open-vocabulary semantic segmentation.

[201] LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning

Fengyi Fu, Mengqi Huang, Lei Zhang, Zhendong Mao

Main category: cs.CV

TL;DR: LayerEdit is a training-free multi-object image editing framework that addresses attention entanglements in inter-object conflict regions through object-layered decomposition and coherent fusion, enabling conflict-free editing.

Details

Motivation: Existing multi-object image editing methods follow localize-editing paradigm but neglect critical inter-object interactions, leading to editing leakage or constraints due to attention entanglements in conflict regions.

Method: Proposes a “decompose-editing-fusion” framework with: (1) Conflict-aware Layer Decomposition using attention-aware IoU and time-dependent region removal; (2) Object-layered Editing with coordinated intra-layer text guidance and cross-layer geometric mapping; (3) Transparency-guided Layer Fusion for structure-coherent inter-object fusion.

Result: Extensive experiments show superiority over existing methods, demonstrating unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios.

Conclusion: LayerEdit successfully enables conflict-free multi-object editing through precise layer decomposition and coherent fusion, addressing the fundamental limitation of attention entanglements in existing methods.

Abstract: Text-driven multi-object image editing which aims to precisely modify multiple objects within an image based on text descriptions, has recently attracted considerable interest. Existing works primarily follow the localize-editing paradigm, focusing on independent object localization and editing while neglecting critical inter-object interactions. However, this work points out that the neglected attention entanglements in inter-object conflict regions, inherently hinder disentangled multi-object editing, leading to either inter-object editing leakage or intra-object editing constraints. We thereby propose a novel multi-layer disentangled editing framework LayerEdit, a training-free method which, for the first time, through precise object-layered decomposition and coherent fusion, enables conflict-free object-layered editing. Specifically, LayerEdit introduces a novel “decompose-editingfusion” framework, consisting of: (1) Conflict-aware Layer Decomposition module, which utilizes an attention-aware IoU scheme and time-dependent region removing, to enhance conflict awareness and suppression for layer decomposition. (2) Object-layered Editing module, to establish coordinated intra-layer text guidance and cross-layer geometric mapping, achieving disentangled semantic and structural modifications. (3) Transparency-guided Layer Fusion module, to facilitate structure-coherent inter-object layer fusion through precise transparency guidance learning. Extensive experiments verify the superiority of LayerEdit over existing methods, showing unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios. Codes are available at: https://github.com/fufy1024/LayerEdit.

[202] Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation

Jae Joong Lee, Bedrich Benes

Main category: cs.CV

TL;DR: Top2Ground is a diffusion-based method that directly generates photorealistic ground-view images from aerial images without intermediate representations, using joint spatial features and semantic embeddings.

Details

Motivation: Generating ground-level images from aerial views is challenging due to extreme viewpoint differences, occlusions, and limited field of view. Existing methods often rely on intermediate representations like depth maps or 3D voxels.

Method: A diffusion-based approach that conditions the denoising process on joint VAE-encoded spatial features (from aerial RGB and height maps) and CLIP-based semantic embeddings, ensuring geometric constraints and semantic consistency.

Result: Top2Ground achieves 7.3% average improvement in SSIM across CVUSA, CVACT, and Auto Arborist datasets, demonstrating robust performance for both wide and narrow fields of view.

Conclusion: The method shows strong generalization capabilities and can handle diverse aerial-to-ground image generation tasks without relying on intermediate representations.

Abstract: Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene’s 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.

Yue Min, Shaobo Wang, Jiaze Li, Tianle Niu, Junxin Fan, Yongliang Miao, Lijin Yang, Linfeng Zhang

Main category: cs.CV

TL;DR: ImageBindDC is a novel multimodal data condensation framework that uses Characteristic Function loss in ImageBind’s unified feature space to preserve inter-modal dependencies through uni-modal, cross-modal, and joint-modal alignment.

Details

Motivation: Existing data condensation methods work well for unimodal data but fail in multimodal scenarios where preserving complex inter-modal dependencies is crucial for maintaining data semantics and relationships.

Method: The framework operates in ImageBind’s unified feature space and employs Characteristic Function loss in the Fourier domain for precise statistical alignment. It enforces three levels of distributional consistency: uni-modal alignment, cross-modal alignment, and joint-modal alignment.

Result: On NYU-v2 dataset, models trained on just 5 condensed datapoints per class achieve lossless performance comparable to full dataset training, with 8.2% absolute improvement over previous methods and more than 4x faster condensation time.

Conclusion: ImageBindDC successfully addresses multimodal data condensation by preserving intricate inter-modal dependencies through comprehensive statistical alignment, achieving state-of-the-art performance with high efficiency.

Abstract: Data condensation techniques aim to synthesize a compact dataset from a larger one to enable efficient model training, yet while successful in unimodal settings, they often fail in multimodal scenarios where preserving intricate inter-modal dependencies is crucial. To address this, we introduce ImageBindDC, a novel data condensation framework operating within the unified feature space of ImageBind. Our approach moves beyond conventional distribution-matching by employing a powerful Characteristic Function (CF) loss, which operates in the Fourier domain to facilitate a more precise statistical alignment via exact infinite moment matching. We design our objective to enforce three critical levels of distributional consistency: (i) uni-modal alignment, which matches the statistical properties of synthetic and real data within each modality; (ii) cross-modal alignment, which preserves pairwise semantics by matching the distributions of hybrid real-synthetic data pairs; and (iii) joint-modal alignment, which captures the complete multivariate data structure by aligning the joint distribution of real data pairs with their synthetic counterparts. Extensive experiments highlight the effectiveness of ImageBindDC: on the NYU-v2 dataset, a model trained on just 5 condensed datapoints per class achieves lossless performance comparable to one trained on the full dataset, achieving a new state-of-the-art with an 8.2% absolute improvement over the previous best method and more than 4$\times$ less condensation time.

[204] Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation

Nan Bao, Yifan Zhao, Lin Zhu, Jia Li

Main category: cs.CV

TL;DR: ESC framework unifies event-RGB features using edge cues for robust semantic segmentation under extreme conditions, outperforming SOTA by 2.55% mIoU.

Details

Motivation: Existing semantic segmentation methods fail under extreme conditions (low light, camera motion) due to RGB information loss, while event-RGB fusion suffers from feature heterogeneity.

Method: Edge-awareness Semantic Concordance framework with Edge-awareness Latent Re-coding to align event-RGB features using edge dictionary, and Re-coded Consolidation with Uncertainty Optimization for heterogeneous fusion.

Result: Outperforms state-of-the-art by 2.55% mIoU on DERS-XS dataset, shows superior resilience under spatial occlusion.

Conclusion: ESC framework effectively addresses event-RGB heterogeneity for robust semantic segmentation in extreme conditions through edge-aware feature unification.

Abstract: Semantic segmentation has achieved great success in ideal conditions. However, when facing extreme conditions (e.g., insufficient light, fierce camera motion), most existing methods suffer from significant information loss of RGB, severely damaging segmentation results. Several researches exploit the high-speed and high-dynamic event modality as a complement, but event and RGB are naturally heterogeneous, which leads to feature-level mismatch and inferior optimization of existing multi-modality methods. Different from these researches, we delve into the edge secret of both modalities for resilient fusion and propose a novel Edge-awareness Semantic Concordance framework to unify the multi-modality heterogeneous features with latent edge cues. In this framework, we first propose Edge-awareness Latent Re-coding, which obtains uncertainty indicators while realigning event-RGB features into unified semantic space guided by re-coded distribution, and transfers event-RGB distributions into re-coded features by utilizing a pre-established edge dictionary as clues. We then propose Re-coded Consolidation and Uncertainty Optimization, which utilize re-coded edge features and uncertainty indicators to solve the heterogeneous event-RGB fusion issues under extreme conditions. We establish two synthetic and one real-world event-RGB semantic segmentation datasets for extreme scenario comparisons. Experimental results show that our method outperforms the state-of-the-art by a 2.55% mIoU on our proposed DERS-XS, and possesses superior resilience under spatial occlusion. Our code and datasets are publicly available at https://github.com/iCVTEAM/ESC.

[205] SWAN - Enabling Fast and Mobile Histopathology Image Annotation through Swipeable Interfaces

Sweta Banerjee, Timo Gosch, Sara Hester, Viktoria Weiss, Thomas Conrad, Taryn A. Donovan, Nils Porsche, Jonas Ammeling, Christoph Stroblberger, Robert Klopfleisch, Christopher Kaltenecker, Christof A. Bertram, Katharina Breininger, Marc Aubreville

Main category: cs.CV

TL;DR: SWAN is a web app that uses swipe gestures for faster histopathology image annotation compared to traditional folder-based methods, achieving comparable accuracy with high usability.

Details

Motivation: Traditional folder-based annotation workflows for histopathology images are slow, fatiguing, and difficult to scale, creating a bottleneck in developing deep learning models for clinical tasks like mitotic figure classification.

Method: Developed SWAN (SWipeable ANnotations), an open-source web application that enables intuitive image patch classification using swipe gestures, supporting both desktop and mobile platforms with real-time metadata capture and flexible gesture-to-class mapping.

Result: In a pilot study with four pathologists annotating 600 mitotic figure patches, SWAN achieved pairwise percent agreement of 86.52%-93.68% (Cohen’s Kappa = 0.61-0.80) vs folder-based method’s 86.98%-91.32% (Cohen’s Kappa = 0.63-0.75), demonstrating comparable performance with high inter-annotator consistency.

Conclusion: SWAN can accelerate image annotation while maintaining quality, offering a scalable and user-friendly alternative to conventional workflows, with participants rating it as highly usable and appreciating mobile annotation capabilities.

Abstract: The annotation of large scale histopathology image datasets remains a major bottleneck in developing robust deep learning models for clinically relevant tasks, such as mitotic figure classification. Folder-based annotation workflows are usually slow, fatiguing, and difficult to scale. To address these challenges, we introduce SWipeable ANnotations (SWAN), an open-source, MIT-licensed web application that enables intuitive image patch classification using a swiping gesture. SWAN supports both desktop and mobile platforms, offers real-time metadata capture, and allows flexible mapping of swipe gestures to class labels. In a pilot study with four pathologists annotating 600 mitotic figure image patches, we compared SWAN against a traditional folder-sorting workflow. SWAN enabled rapid annotations with pairwise percent agreement ranging from 86.52% to 93.68% (Cohen’s Kappa = 0.61-0.80), while for the folder-based method, the pairwise percent agreement ranged from 86.98% to 91.32% (Cohen’s Kappa = 0.63-0.75) for the task of classifying atypical versus normal mitotic figures, demonstrating high consistency between annotators and comparable performance. Participants rated the tool as highly usable and appreciated the ability to annotate on mobile devices. These results suggest that SWAN can accelerate image annotation while maintaining annotation quality, offering a scalable and user-friendly alternative to conventional workflows.

[206] MAUGIF: Mechanism-Aware Unsupervised General Image Fusion via Dual Cross-Image Autoencoders

Kunjing Yang, Zhiwei Wang, Minru Bai

Main category: cs.CV

TL;DR: Proposes MAUGIF, a mechanism-aware unsupervised general image fusion method using dual cross-image autoencoders that adapts to different fusion mechanisms (additive vs multiplicative) for better performance and interpretability.

Details

Motivation: Existing fusion methods are either too task-specific or apply uniform strategies across diverse tasks, ignoring their distinct fusion mechanisms.

Method: Uses dual cross-image autoencoders with shared latent space, dual encoders capture common content while isolating modality-specific details, and dual decoders act as feature injectors with architecture varying by fusion mechanism.

Result: Extensive experiments validate effectiveness and generalization ability across diverse fusion tasks.

Conclusion: MAUGIF provides a flexible framework that adapts to different fusion mechanisms, enhancing both performance and interpretability in general image fusion tasks.

Abstract: Image fusion aims to integrate structural and complementary information from multi-source images. However, existing fusion methods are often either highly task-specific, or general frameworks that apply uniform strategies across diverse tasks, ignoring their distinct fusion mechanisms. To address this issue, we propose a mechanism-aware unsupervised general image fusion (MAUGIF) method based on dual cross-image autoencoders. Initially, we introduce a classification of additive and multiplicative fusion according to the inherent mechanisms of different fusion tasks. Then, dual encoders map source images into a shared latent space, capturing common content while isolating modality-specific details. During the decoding phase, dual decoders act as feature injectors, selectively reintegrating the unique characteristics of each modality into the shared content for reconstruction. The modality-specific features are injected into the source image in the fusion process, generating the fused image that integrates information from both modalities. The architecture of decoders varies according to their fusion mechanisms, enhancing both performance and interpretability. Extensive experiments are conducted on diverse fusion tasks to validate the effectiveness and generalization ability of our method. The code is available at https://anonymous.4open.science/r/MAUGIF.

[207] SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer

Kaiyi Xu, Junchao Gong, Zhiwang Zhou, Zhangrui Li, Yuandong Pu, Yihao Liu, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bei

Main category: cs.CV

TL;DR: SynWeather is the first dataset for unified multi-region, multi-variable weather data synthesis, and SynWeatherDiff is a diffusion transformer model that addresses over-smoothing in weather synthesis.

Details

Motivation: Current weather data synthesis approaches are limited to single-variable, single-region tasks using deterministic modeling, which prevents unified synthesis across variables/regions, overlooks cross-variable complementarity, and causes over-smoothed results.

Method: Created SynWeather dataset covering four regions (US, Europe, East Asia, Tropical Cyclones) with high-resolution observations of key weather variables. Developed SynWeatherDiff, a probabilistic weather synthesis model based on Diffusion Transformer framework.

Result: Experiments on SynWeather dataset show the model’s effectiveness compared to both task-specific and general models.

Conclusion: The proposed SynWeather dataset and SynWeatherDiff model successfully address limitations of current approaches by enabling unified multi-region, multi-variable weather data synthesis with improved results.

Abstract: With the advancement of meteorological instruments, abundant data has become available. Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. To address above challenges, we introduce SynWeather, the first dataset designed for Unified Multi-region and Multi-variable Weather Observation Data Synthesis. SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models.

[208] SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

Laura Bragagnolo, Leonardo Barcellona, Stefano Ghidoni

Main category: cs.CV

TL;DR: SkelSplat is a novel framework for multi-view 3D human pose estimation using differentiable Gaussian rendering, modeling human pose as a skeleton of 3D Gaussians without requiring 3D ground-truth supervision.

Details

Motivation: State-of-the-art multi-view methods rely on large annotated datasets and suffer from poor generalization when test scenarios differ from training data.

Method: Models human pose as a skeleton of 3D Gaussians (one per joint) with a novel one-hot encoding scheme, optimized via differentiable rendering to fuse arbitrary camera views without 3D ground-truth supervision.

Result: Outperforms approaches without 3D ground truth in Human3.6M and CMU datasets, reduces cross-dataset error by up to 47.8% compared to learning-based methods, and demonstrates robustness to occlusions without scenario-specific fine-tuning.

Conclusion: SkelSplat provides an effective framework for multi-view 3D human pose estimation that generalizes well across datasets and handles occlusions without requiring 3D supervision or scenario-specific training.

Abstract: Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.

[209] NeuSpring: Neural Spring Fields for Reconstruction and Simulation of Deformable Objects from Videos

Qingshan Xu, Jiao Liu, Shangshu Yu, Yuxuan Wang, Yuan Zhou, Junbao Zhou, Jiequan Cui, Yew-Soon Ong, Hanwang Zhang

Main category: cs.CV

TL;DR: NeuSpring: A neural spring field method for creating physical digital twins of deformable objects that improves reconstruction and simulation performance for both current state modeling and future prediction.

Details

Motivation: Existing methods for deformable object modeling focus on current state physical learning but generalize poorly to future prediction because they ignore intrinsic physical properties of deformable objects.

Method: Proposes NeuSpring with two innovations: 1) piecewise topology solution using zero-order optimization to model multi-region spring connections considering material heterogeneity, and 2) neural spring field using canonical coordinate-based neural network to represent spring physical properties across frames.

Result: Achieves superior reconstruction and simulation performance with Chamfer distance improved by 20% for current state modeling and 25% for future prediction on real-world datasets.

Conclusion: NeuSpring effectively addresses the limitations of existing methods by incorporating intrinsic physical properties through neural spring fields, enabling better generalization to future predictions.

Abstract: In this paper, we aim to create physical digital twins of deformable objects under interaction. Existing methods focus more on the physical learning of current state modeling, but generalize worse to future prediction. This is because existing methods ignore the intrinsic physical properties of deformable objects, resulting in the limited physical learning in the current state modeling. To address this, we present NeuSpring, a neural spring field for the reconstruction and simulation of deformable objects from videos. Built upon spring-mass models for realistic physical simulation, our method consists of two major innovations: 1) a piecewise topology solution that efficiently models multi-region spring connection topologies using zero-order optimization, which considers the material heterogeneity of real-world objects. 2) a neural spring field that represents spring physical properties across different frames using a canonical coordinate-based neural network, which effectively leverages the spatial associativity of springs for physical learning. Experiments on real-world datasets demonstrate that our NeuSping achieves superior reconstruction and simulation performance for current state modeling and future prediction, with Chamfer distance improved by 20% and 25%, respectively.

[210] Mitigating Negative Flips via Margin Preserving Training

Simone Ricci, Niccolò Biondi, Federico Pernici, Alberto Del Bimbo

Main category: cs.CV

TL;DR: Proposes a method to reduce negative flips (inconsistencies where updated models misclassify previously correct samples) in image classification by preserving original model margins while learning new classes.

Details

Motivation: Minimizing inconsistencies across successive AI versions is crucial. Adding new classes reduces class margins and introduces conflicting patterns, degrading performance on original classes through negative flips.

Method: Preserves original model margins while learning improved model. Uses margin-calibration term on logits and integrates double-source focal distillation loss with previous model and new independently trained model to learn appropriate decision margins.

Result: Extensive experiments on image classification benchmarks show consistent reduction in negative flip rate with high overall accuracy.

Conclusion: The proposed approach effectively mitigates negative flips while maintaining high accuracy, addressing the challenge of model inconsistency when adding new classes over time.

Abstract: Minimizing inconsistencies across successive versions of an AI system is as crucial as reducing the overall error. In image classification, such inconsistencies manifest as negative flips, where an updated model misclassifies test samples that were previously classified correctly. This issue becomes increasingly pronounced as the number of training classes grows over time, since adding new categories reduces the margin of each class and may introduce conflicting patterns that undermine their learning process, thereby degrading performance on the original subset. To mitigate negative flips, we propose a novel approach that preserves the margins of the original model while learning an improved one. Our method encourages a larger relative margin between the previously learned and newly introduced classes by introducing an explicit margin-calibration term on the logits. However, overly constraining the logit margin for the new classes can significantly degrade their accuracy compared to a new independently trained model. To address this, we integrate a double-source focal distillation loss with the previous model and a new independently trained model, learning an appropriate decision margin from both old and new data, even under a logit margin calibration. Extensive experiments on image classification benchmarks demonstrate that our approach consistently reduces the negative flip rate with high overall accuracy.

[211] The Impact of Longitudinal Mammogram Alignment on Breast Cancer Risk Assessment

Solveig Thrun, Stine Hansen, Zijun Sun, Nele Blum, Suaiba A. Salahuddin, Xin Wang, Kristoffer Wickstrøm, Elisabeth Wetzer, Robert Jenssen, Maik Stille, Michael Kampffmeyer

Main category: cs.CV

TL;DR: Image-based registration outperforms feature-based and implicit alignment methods for longitudinal mammography risk modeling, providing better prediction accuracy and anatomically plausible deformation fields.

Details

Motivation: Accurate spatial alignment across longitudinal mammograms is crucial for deep learning-based risk models, as misalignment can obscure tissue changes and degrade performance.

Method: Evaluated various alignment strategies including image-based registration, feature-level alignment with/without regularization, and implicit alignment methods using two large-scale mammography datasets.

Result: Image-based registration consistently outperformed other methods across all metrics (predictive accuracy, precision, recall, deformation quality) and enabled more accurate, temporally consistent predictions.

Conclusion: Image-based deformation fields are essential for spatial alignment in longitudinal risk modeling, offering improved prediction accuracy and robustness for personalized breast cancer screening.

Abstract: Regular mammography screening is crucial for early breast cancer detection. By leveraging deep learning-based risk models, screening intervals can be personalized, especially for high-risk individuals. While recent methods increasingly incorporate longitudinal information from prior mammograms, accurate spatial alignment across time points remains a key challenge. Misalignment can obscure meaningful tissue changes and degrade model performance. In this study, we provide insights into various alignment strategies, image-based registration, feature-level (representation space) alignment with and without regularization, and implicit alignment methods, for their effectiveness in longitudinal deep learning-based risk modeling. Using two large-scale mammography datasets, we assess each method across key metrics, including predictive accuracy, precision, recall, and deformation field quality. Our results show that image-based registration consistently outperforms the more recently favored feature-based and implicit approaches across all metrics, enabling more accurate, temporally consistent predictions and generating smooth, anatomically plausible deformation fields. Although regularizing the deformation field improves deformation quality, it reduces the risk prediction performance of feature-level alignment. Applying image-based deformation fields within the feature space yields the best risk prediction performance. These findings underscore the importance of image-based deformation fields for spatial alignment in longitudinal risk modeling, offering improved prediction accuracy and robustness. This approach has strong potential to enhance personalized screening and enable earlier interventions for high-risk individuals. The code is available at https://github.com/sot176/Mammogram_Alignment_Study_Risk_Prediction.git, allowing full reproducibility of the results.

[212] Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter

Zhiyang Chen, Chen Zhang, Hao Fang, Runmin Cong

Main category: cs.CV

TL;DR: DiveSeg is a novel underwater instance segmentation framework that leverages DINO pretrained models with two key components: AquaStyle Aligner for underwater domain adaptation and ObjectPrior Prompter for instance-level guidance, achieving state-of-the-art performance.

Details

Motivation: Underwater instance segmentation is crucial for marine resource exploration and ecological protection, but faces challenges in adapting to underwater domain characteristics and requiring both object- and instance-level reasoning.

Method: Built on DINO pretrained models with two components: AquaStyle Aligner embeds underwater color style features during fine-tuning, and ObjectPrior Prompter uses binary segmentation-based prompts to provide object-level priors for instance segmentation.

Result: Achieves state-of-the-art performance on UIIS and USIS10K datasets, demonstrating effective adaptation to underwater domain and superior instance segmentation capabilities.

Conclusion: DiveSeg successfully adapts foundation models to underwater instance segmentation through domain-specific alignment and object-level prompting, providing an effective solution for marine applications.

Abstract: Underwater instance segmentation (UIS), integrating pixel-level understanding and instance-level discrimination, is a pivotal technology in marine resource exploration and ecological protection. In recent years, large-scale pretrained visual foundation models, exemplified by DINO, have advanced rapidly and demonstrated remarkable performance on complex downstream tasks. In this paper, we demonstrate that DINO can serve as an effective feature learner for UIS, and we introduce DiveSeg, a novel framework built upon two insightful components: (1) The AquaStyle Aligner, designed to embed underwater color style features into the DINO fine-tuning process, facilitating better adaptation to the underwater domain. (2) The ObjectPrior Prompter, which incorporates binary segmentation-based prompts to deliver object-level priors, provides essential guidance for instance segmentation task that requires both object- and instance-level reasoning. We conduct thorough experiments on the popular UIIS and USIS10K datasets, and the results show that DiveSeg achieves the state-of-the-art performance. Code: https://github.com/ettof/Diveseg.

[213] Towards Open-Set Myoelectric Gesture Recognition via Dual-Perspective Inconsistency Learning

Chen Liu, Can Han, Weishi Xu, Yaqi Wang, Dahong Qian

Main category: cs.CV

TL;DR: SASG-DA is a diffusion-based data augmentation method for sEMG gesture recognition that uses semantic guidance and sparse-aware sampling to generate faithful and diverse training samples, improving model generalization.

Details

Motivation: sEMG-based gesture recognition systems suffer from limited training data, leading to overfitting and poor generalization in deep learning models. Existing data augmentation methods struggle to balance faithfulness and diversity.

Method: Proposed SASG-DA with three key components: Semantic Representation Guidance (SRG) for faithful generation, Gaussian Modeling Semantic Modeling (GMSS) for flexible sampling, and Sparse-Aware Semantic Sampling to explore underrepresented regions.

Result: Extensive experiments on Ninapro DB2, DB4, and DB7 datasets show SASG-DA significantly outperforms existing augmentation methods in mitigating overfitting and improving recognition performance and generalization.

Conclusion: The proposed diffusion-based data augmentation approach effectively addresses data scarcity in sEMG gesture recognition by generating both faithful and diverse samples, enhancing model performance and generalization capabilities.

Abstract: Surface electromyography (sEMG)-based gesture recognition plays a critical role in human-machine interaction (HMI), particularly for rehabilitation and prosthetic control. However, sEMG-based systems often suffer from the scarcity of informative training data, leading to overfitting and poor generalization in deep learning models. Data augmentation offers a promising approach to increasing the size and diversity of training data, where faithfulness and diversity are two critical factors to effectiveness. However, promoting untargeted diversity can result in redundant samples with limited utility. To address these challenges, we propose a novel diffusion-based data augmentation approach, Sparse-Aware Semantic-Guided Diffusion Augmentation (SASG-DA). To enhance generation faithfulness, we introduce the Semantic Representation Guidance (SRG) mechanism by leveraging fine-grained, task-aware semantic representations as generation conditions. To enable flexible and diverse sample generation, we propose a Gaussian Modeling Semantic Modeling (GMSS) strategy, which models the semantic representation distribution and allows stochastic sampling to produce both faithful and diverse samples. To enhance targeted diversity, we further introduce a Sparse-Aware Semantic Sampling strategy to explicitly explore underrepresented regions, improving distribution coverage and sample utility. Extensive experiments on benchmark sEMG datasets, Ninapro DB2, DB4, and DB7, demonstrate that SASG-DA significantly outperforms existing augmentation methods. Overall, our proposed data augmentation approach effectively mitigates overfitting and improves recognition performance and generalization by offering both faithful and diverse samples.

[214] VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation

Arpan Phukan, Anupam Pandey, Deepjyoti Bodo, Asif Ekbal

Main category: cs.CV

TL;DR: VideoChain is a novel framework for Multi-hop Video Question Generation (MVQG) that generates reasoning-intensive questions spanning multiple video segments, outperforming existing methods on standard metrics.

Details

Motivation: Existing multi-hop QG is limited to text, while VideoQG only handles zero-hop questions on single video segments, creating a gap for reasoning across multiple video segments.

Method: Modular architecture built on modified BART backbone with video embeddings, using automatically constructed MVQ-60 dataset from TVQA+ by merging zero-hop QA pairs.

Result: Strong performance on generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110).

Conclusion: VideoChain effectively generates coherent, contextually grounded, and reasoning-intensive questions across multiple video segments, advancing multi-modal reasoning capabilities.

Abstract: Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain’s strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model’s ability to generate coherent, contextually grounded, and reasoning-intensive questions.

[215] Extreme Model Compression with Structured Sparsity at Low Precision

Dan Liu, Nikita Dvornik, Xue Liu

Main category: cs.CV

TL;DR: SLOPE is a unified framework that effectively combines structured sparsity and low-bit quantization to reduce DNN model size while maintaining accuracy, achieving ~20× size reduction with ~99% original accuracy on ResNet-18.

Details

Motivation: Deep neural networks are too large and computationally expensive for resource-constrained devices. While weight quantization and structured sparsity individually help reduce model size, their combination severely harms accuracy, creating a need for a principled approach to combine both techniques effectively.

Method: SLOPE uses a training-time regularization strategy that minimizes discrepancy between full-precision weights and their sparse, quantized counterparts by promoting angular alignment rather than direct matching, enabling effective combination of structured sparsity and low-bit quantization.

Result: On ResNet-18, SLOPE achieves ~20× model size reduction while retaining ~99% of the original accuracy. It consistently outperforms state-of-the-art quantization and structured sparsity methods across classification, detection, and segmentation tasks on various models including ResNet-18, ViT-Small, and Mask R-CNN.

Conclusion: SLOPE provides a principled framework to successfully combine structured sparsity and quantization, overcoming their compounded negative effects and enabling significant model compression while maintaining high accuracy across diverse computer vision tasks.

Abstract: Deep neural networks (DNNs) are used in many applications, but their large size and high computational cost make them hard to run on devices with limited resources. Two widely used techniques to address this challenge are weight quantization, which lowers the precision of all weights, and structured sparsity, which removes unimportant weights while retaining the important ones at full precision. Although both are effective individually, they are typically studied in isolation due to their compounded negative impact on model accuracy when combined. In this work, we introduce SLOPE Structured Sparsity at Low Precision), a unified framework, to effectively combine structured sparsity and low-bit quantization in a principled way. We show that naively combining sparsity and quantization severely harms performance due to the compounded impact of both techniques. To address this, we propose a training-time regularization strategy that minimizes the discrepancy between full-precision weights and their sparse, quantized counterparts by promoting angular alignment rather than direct matching. On ResNet-18, SLOPE achieves $\sim20\times$ model size reduction while retaining $\sim$99% of the original accuracy. It consistently outperforms state-of-the-art quantization and structured sparsity methods across classification, detection, and segmentation tasks on models such as ResNet-18, ViT-Small, and Mask R-CNN.

[216] Retrospective motion correction in MRI using disentangled embeddings

Qi Wang, Veronika Ecker, Marcel Früh, Sergios Gatidis, Thomas Küstner

Main category: cs.CV

TL;DR: A hierarchical vector-quantized variational auto-encoder learns disentangled motion-to-clean image features for MRI motion correction, enabling generalization across different motion types without artifact-specific training.

Details

Motivation: Existing MRI motion correction methods struggle to generalize across different motion types and body regions, with ML-based approaches often being tailored to specific applications and datasets.

Method: Proposed hierarchical vector-quantized variational auto-encoder with codebook to capture motion patterns at multiple resolutions, combined with auto-regressive model for motion-free image prior to guide correction.

Result: Robust correction across varying motion severity on simulated whole-body motion artifacts, effectively disentangling physical motion features and improving generalizability.

Conclusion: The disentangled motion feature approach shows potential for application across anatomical regions and motion types, enhancing ML-based MRI motion correction generalizability.

Abstract: Physiological motion can affect the diagnostic quality of magnetic resonance imaging (MRI). While various retrospective motion correction methods exist, many struggle to generalize across different motion types and body regions. In particular, machine learning (ML)-based corrections are often tailored to specific applications and datasets. We hypothesize that motion artifacts, though diverse, share underlying patterns that can be disentangled and exploited. To address this, we propose a hierarchical vector-quantized (VQ) variational auto-encoder that learns a disentangled embedding of motion-to-clean image features. A codebook is deployed to capture finite collection of motion patterns at multiple resolutions, enabling coarse-to-fine correction. An auto-regressive model is trained to learn the prior distribution of motion-free images and is used at inference to guide the correction process. Unlike conventional approaches, our method does not require artifact-specific training and can generalize to unseen motion patterns. We demonstrate the approach on simulated whole-body motion artifacts and observe robust correction across varying motion severity. Our results suggest that the model effectively disentangled physical motion of the simulated motion-effective scans, therefore, improving the generalizability of the ML-based MRI motion correction. Our work of disentangling the motion features shed a light on its potential application across anatomical regions and motion types.

[217] A Circular Argument : Does RoPE need to be Equivariant for Vision?

Chase van de Geijn, Timo Lüddecke, Polina Turishcheva, Alexander S. Ecker

Main category: cs.CV

TL;DR: The paper questions the importance of strict positional equivariance in Rotary Positional Encodings (RoPE) for vision tasks, proposing Spherical RoPE with non-commutative generators that performs equally or better than equivariant alternatives.

Details

Motivation: To challenge the common belief that relative positional embeddings (equivariance) are crucial for RoPE's success, particularly in computer vision applications, and explore whether this constraint can be relaxed.

Method: Mathematically analyzed RoPE as a general solution for equivariant positional embedding, proposed Mixed RoPE for M-dimensional data with commutative generators, and introduced Spherical RoPE with non-commutative generators.

Result: Spherical RoPE demonstrated equivalent or better learning behavior compared to equivariant analogues, suggesting relative positional embeddings may not be as important as commonly believed in computer vision.

Conclusion: The findings suggest removing the preconception that positional encodings must be relative could lead to faster and better-generalizing positional encodings for vision tasks.

Abstract: Rotary Positional Encodings (RoPE) have emerged as a highly effective technique for one-dimensional sequences in Natural Language Processing spurring recent progress towards generalizing RoPE to higher-dimensional data such as images and videos. The success of RoPE has been thought to be due to its positional equivariance, i.e. its status as a relative positional encoding. In this paper, we mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data. Moreover, we show Mixed RoPE to be the analogously general solution for M-dimensional data, if we require commutative generators – a property necessary for RoPE’s equivariance. However, we question whether strict equivariance plays a large role in RoPE’s performance. We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators. Empirically, we find Spherical RoPE to have the equivalent or better learning behavior compared to its equivariant analogues. This suggests that relative positional embeddings are not as important as is commonly believed, at least within computer vision. We expect this discovery to facilitate future work in positional encodings for vision that can be faster and generalize better by removing the preconception that they must be relative.

[218] Text-based Aerial-Ground Person Retrieval

Xinyu Zhou, Yu Wu, Jiayao Ma, Wenhao Wang, Min Cao, Mang Ye

Main category: cs.CV

TL;DR: This paper introduces TAG-PR, a new task for retrieving person images from both aerial and ground views using text descriptions, along with a new dataset TAG-PEDES and a framework TAG-CLIP that handles view heterogeneity through specialized modules.

Details

Motivation: Traditional text-based person retrieval only focuses on ground-view images, but real-world applications often need to retrieve people across different viewpoints (aerial and ground), which introduces significant challenges due to viewpoint discrepancies.

Method: Proposed TAG-CLIP framework with: (1) hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features, and (2) viewpoint decoupling strategy to separate view-specific features for better cross-modal alignment.

Result: The method is evaluated on the proposed TAG-PEDES dataset and existing T-PR benchmarks, showing effectiveness in handling view heterogeneity.

Conclusion: TAG-PR addresses the practical need for cross-view person retrieval and provides both a new dataset and framework that successfully handle the challenges of viewpoint heterogeneity between aerial and ground images.

Abstract: This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/TAG-PR.

[219] RAPTR: Radar-based 3D Pose Estimation using Transformer

Sorachi Kato, Ryoma Yataka, Pu Perry Wang, Pedro Miraldo, Takuya Fujihashi, Petros Boufounos

Main category: cs.CV

TL;DR: RAPTR: Radar-based indoor 3D human pose estimation using transformers with weak supervision (only 3D BBox and 2D keypoint labels), achieving significant error reduction compared to existing methods.

Details

Motivation: Traditional radar-based indoor 3D pose estimation requires costly fine-grained 3D keypoint labels, which are difficult to obtain in complex indoor settings with clutter, occlusions, or multiple people.

Method: Two-stage pose decoder architecture with pseudo-3D deformable attention: pose decoder estimates initial 3D poses using 3D template loss with BBox labels, and joint decoder refines poses using 2D keypoint labels and 3D gravity loss.

Result: Outperforms existing methods, reducing joint position error by 34.3% on HIBER dataset and 76.9% on MMVR dataset.

Conclusion: RAPTR enables effective 3D human pose estimation from radar data using only weak supervision, making it more scalable and practical for real-world indoor applications.

Abstract: Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbf{RAPTR} (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by $34.3%$ on HIBER and $76.9%$ on MMVR. Our implementation is available at https://github.com/merlresearch/radar-pose-transformer.

[220] Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation

Difei Gu, Yunhe Gao, Mu Zhou, Dimitris Metaxas

Main category: cs.CV

TL;DR: Anatomy-VLM is a fine-grained vision-language model that incorporates multi-scale anatomical information and structured knowledge to achieve expert-level disease interpretation from radiology images.

Details

Motivation: Current vision-language models treat images holistically and miss fine-grained details crucial for medical diagnosis, while clinicians analyze specific anatomical regions using medical knowledge.

Method: The model localizes key anatomical features, enriches regions with structured knowledge, and aligns multi-scale medical information to generate clinically-interpretable disease predictions.

Result: Anatomy-VLM achieves outstanding performance on in- and out-of-distribution datasets, validates on downstream segmentation tasks, and enables zero-shot anatomy-wise interpretation.

Conclusion: The model demonstrates strong expert-level clinical interpretation capabilities by capturing fine-grained anatomical and pathology-related knowledge through its multi-scale alignment approach.

Abstract: Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM’s encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.

[221] OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

Yuncheng Guo, Junyan Ye, Chenjue Zhang, Hengrui Kang, Haohuan Fu, Conghui He, Weijia Li

Main category: cs.CV

TL;DR: OmniAID is a novel AIGI detection framework that uses a decoupled Mixture-of-Experts architecture to separate content-specific flaws from universal artifacts, achieving superior generalization across diverse generative models and semantic content.

Details

Motivation: Current AIGI detectors learn entangled forgery representations that conflate content-dependent flaws with content-agnostic artifacts, and are limited by outdated benchmarks, failing to generalize across diverse generative models and semantic content.

Method: A decoupled Mixture-of-Experts architecture with Routable Specialized Semantic Experts for distinct content domains and a Fixed Universal Artifact Expert, trained using a two-stage strategy: independent expert training with domain-specific hard-sampling followed by lightweight gating network training.

Result: OmniAID surpasses existing monolithic detectors in extensive experiments using both traditional benchmarks and the new Mirage dataset, establishing robust generalization against modern in-the-wild threats.

Conclusion: By explicitly decoupling content-specific flaws from universal artifacts, OmniAID achieves robust generalization and sets a new standard for AIGI authentication, addressing the limitations of current entangled representation learning approaches.

Abstract: A truly universal AI-Generated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current state-of-the-art methods learn a single, entangled forgery representation–conflating content-dependent flaws with content-agnostic artifacts–and are further constrained by outdated benchmarks. To overcome these limitations, we propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture. The core of our method is a hybrid expert system engineered to decouple: (1) semantic flaws across distinct content domains, and (2) these content-dependent flaws from content-agnostic universal artifacts. This system employs a set of Routable Specialized Semantic Experts, each for a distinct domain (e.g., human, animal), complemented by a Fixed Universal Artifact Expert. This architecture is trained using a bespoke two-stage strategy: we first train the experts independently with domain-specific hard-sampling to ensure specialization, and subsequently train a lightweight gating network for effective input routing. By explicitly decoupling “what is generated” (content-specific flaws) from “how it is generated” (universal artifacts), OmniAID achieves robust generalization. To address outdated benchmarks and validate real-world applicability, we introduce Mirage, a new large-scale, contemporary dataset. Extensive experiments, using both traditional benchmarks and our Mirage dataset, demonstrate our model surpasses existing monolithic detectors, establishing a new, robust standard for AIGI authentication against modern, in-the-wild threats.

[222] Cross-pyramid consistency regularization for semi-supervised medical image segmentation

Matus Bojko, Maros Kollar, Marek Jakab, Wanda Benesova

Main category: cs.CV

TL;DR: A hybrid consistency learning approach for semi-supervised medical image segmentation using Cross-Pyramid Consistency Regularization between dual decoders.

Details

Motivation: To effectively exploit unlabeled data in semi-supervised medical image segmentation by leveraging consistency learning across multiple resolution scales and decoders.

Method: Proposes DBPNet with encoder and two slightly different decoders producing pyramid predictions, combined with CPCR learning strategy that extends soft-labeling to pyramid predictions across decoders for knowledge distillation.

Result: Outperforms five state-of-the-art SSL methods and achieves comparable performance with recent methods on public benchmark dataset.

Conclusion: The hybrid consistency learning with cross-pyramid regularization effectively leverages unlabeled data and improves semi-supervised medical image segmentation performance.

Abstract: Semi-supervised learning (SSL) enables training of powerful models with the assumption of limited, carefully labelled data and a large amount of unlabeled data to support the learning. In this paper, we propose a hybrid consistency learning approach to effectively exploit unlabeled data for semi-supervised medical image segmentation by leveraging Cross-Pyramid Consistency Regularization (CPCR) between two decoders. First, we design a hybrid Dual Branch Pyramid Network (DBPNet), consisting of an encoder and two decoders that differ slightly, each producing a pyramid of perturbed auxiliary predictions across multiple resolution scales. Second, we present a learning strategy for this network named CPCR that combines existing consistency learning and uncertainty minimization approaches on the main output predictions of decoders with our novel regularization term. More specifically, in this term, we extend the soft-labeling setting to pyramid predictions across decoders to support knowledge distillation in deep hierarchical features. Experimental results show that DBPNet with CPCR outperforms five state-of-the-art self-supervised learning methods and has comparable performance with recent ones on a public benchmark dataset.

[223] Contrastive Integrated Gradients: A Feature Attribution-Based Method for Explaining Whole Slide Image Classification

Anh Mai Vu, Tuan L. Vo, Ngoc Lam Quang Bui, Nam Nguyen Le Binh, Akash Awasthi, Huy Quoc Vo, Thanh-Huy Nguyen, Zhu Han, Chandra Mohan, Hien Van Nguyen

Main category: cs.CV

TL;DR: CIG is a novel attribution method for WSI analysis that uses contrastive gradients to highlight class-discriminative regions, improving interpretability in computational pathology.

Details

Motivation: Existing attribution methods like IG capture model decisions but miss class-discriminative signals crucial for distinguishing tumor subtypes in high-resolution WSIs.

Method: CIG computes contrastive gradients in logit space, comparing feature importance relative to reference classes to highlight discriminative regions while satisfying integrated attribution axioms.

Result: CIG outperforms baselines on three cancer datasets (CAMELYON16, TCGA-RCC, TCGA-Lung), showing more informative attributions via proposed metrics MIL-AIC and MIL-SIC.

Conclusion: CIG provides enhanced interpretability for WSI-based diagnostics, producing attributions that better align with ground truth tumor regions for trustworthy AI-assisted pathology.

Abstract: Interpretability is essential in Whole Slide Image (WSI) analysis for computational pathology, where understanding model predictions helps build trust in AI-assisted diagnostics. While Integrated Gradients (IG) and related attribution methods have shown promise, applying them directly to WSIs introduces challenges due to their high-resolution nature. These methods capture model decision patterns but may overlook class-discriminative signals that are crucial for distinguishing between tumor subtypes. In this work, we introduce Contrastive Integrated Gradients (CIG), a novel attribution method that enhances interpretability by computing contrastive gradients in logit space. First, CIG highlights class-discriminative regions by comparing feature importance relative to a reference class, offering sharper differentiation between tumor and non-tumor areas. Second, CIG satisfies the axioms of integrated attribution, ensuring consistency and theoretical soundness. Third, we propose two attribution quality metrics, MIL-AIC and MIL-SIC, which measure how predictive information and model confidence evolve with access to salient regions, particularly under weak supervision. We validate CIG across three datasets spanning distinct cancer types: CAMELYON16 (breast cancer metastasis in lymph nodes), TCGA-RCC (renal cell carcinoma), and TCGA-Lung (lung cancer). Experimental results demonstrate that CIG yields more informative attributions both quantitatively, using MIL-AIC and MIL-SIC, and qualitatively, through visualizations that align closely with ground truth tumor regions, underscoring its potential for interpretable and trustworthy WSI-based diagnostics

[224] Generalizable Blood Cell Detection via Unified Dataset and Faster R-CNN

Siddharth Sahay

Main category: cs.CV

TL;DR: Automated classification and object detection of peripheral blood cells using Faster R-CNN with ResNet-50-FPN backbone, comparing random initialization vs. transfer learning from COCO dataset.

Details

Motivation: Address data scarcity and heterogeneity in peripheral blood cell analysis by creating a unified dataset from multiple public sources for automated hematological diagnosis.

Method: Developed robust data pipeline to merge four public datasets, used Faster R-CNN with ResNet-50-FPN, compared randomly initialized baseline vs. transfer learning from COCO dataset.

Result: Transfer learning achieved significantly faster convergence and superior stability with final validation loss of 0.08666, substantially better than baseline.

Conclusion: Validated methodology establishes robust foundation for high-accuracy, deployable automated hematological diagnosis systems.

Abstract: This paper presents a comprehensive methodology and comparative performance analysis for the automated classification and object detection of peripheral blood cells (PBCs) in microscopic images. Addressing the critical challenge of data scarcity and heterogeneity, robust data pipeline was first developed to standardize and merge four public datasets (PBC, BCCD, Chula, Sickle Cell) into a unified resource. Then employed a state-of-the-art Faster R-CNN object detection framework, leveraging a ResNet-50-FPN backbone. Comparative training rigorously evaluated a randomly initialized baseline model (Regimen 1) against a Transfer Learning Regimen (Regimen 2), initialized with weights pre-trained on the Microsoft COCO dataset. The results demonstrate that the Transfer Learning approach achieved significantly faster convergence and superior stability, culminating in a final validation loss of 0.08666, a substantial improvement over the baseline. This validated methodology establishes a robust foundation for building high-accuracy, deployable systems for automated hematological diagnosis.

[225] Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

Da Li, Yuxiao Luo, Keping Bi, Jiafeng Guo, Wei Yuan, Biao Yang, Yan Wang, Fan Yang, Tingting Gao, Guorui Zhou

Main category: cs.CV

TL;DR: CoMa introduces a compressed pre-training phase as a warm-up for contrastive learning to transform VLMs into competitive embedding models with minimal data.

Details

Motivation: To decouple comprehensive input understanding from discriminative feature learning in VLMs, enabling efficient adaptation into embedding models.

Method: Proposes CoMa - a compressed pre-training phase that serves as warm-up for contrastive learning, requiring only small amounts of pre-training data.

Result: Achieves state-of-the-art results on MMEB benchmark among VLMs of comparable size, optimizing both efficiency and effectiveness.

Conclusion: CoMa demonstrates that comprehensive input understanding facilitates superior downstream task performance through contrastive learning, enabling efficient VLM adaptation.

Abstract: Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that VLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input facilitates the embedding model in achieving superior performance in downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform a VLM into a competitive embedding model. CoMa achieves new state-of-the-art results among VLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness.

[226] Large Sign Language Models: Toward 3D American Sign Language Translation

Sen Zhang, Xiaoxiao He, Di Liu, Zhaoyang Xia, Mingyu Zhao, Chaowei Tan, Vivian Li, Bo Liu, Dimitris N. Metaxas, Mubbasir Kapadia

Main category: cs.CV

TL;DR: LSLM framework uses LLMs to translate 3D American Sign Language, capturing spatial and depth information for more accurate ASL translation and enhanced accessibility for hearing-impaired individuals.

Details

Motivation: To improve digital communication accessibility for hearing-impaired individuals and explore how LLMs can process embodied multimodal languages beyond text-based inputs.

Method: Leverages LLMs as backbone, uses 3D sign language data instead of 2D video, investigates direct translation from 3D gesture features to text and instruction-guided translation modulated by external prompts.

Result: Enables more accurate and resilient ASL translation by capturing rich spatial, gestural, and depth information in 3D scenes.

Conclusion: Provides foundational step toward inclusive, multimodal intelligent systems capable of understanding diverse forms of language beyond text.

Abstract: We present Large Sign Language Models (LSLM), a novel framework for translating 3D American Sign Language (ASL) by leveraging Large Language Models (LLMs) as the backbone, which can benefit hearing-impaired individuals’ virtual communication. Unlike existing sign language recognition methods that rely on 2D video, our approach directly utilizes 3D sign language data to capture rich spatial, gestural, and depth information in 3D scenes. This enables more accurate and resilient translation, enhancing digital communication accessibility for the hearing-impaired community. Beyond the task of ASL translation, our work explores the integration of complex, embodied multimodal languages into the processing capabilities of LLMs, moving beyond purely text-based inputs to broaden their understanding of human communication. We investigate both direct translation from 3D gesture features to text and an instruction-guided setting where translations can be modulated by external prompts, offering greater flexibility. This work provides a foundational step toward inclusive, multimodal intelligent systems capable of understanding diverse forms of language.

[227] Fast Multi-Organ Fine Segmentation in CT Images with Hierarchical Sparse Sampling and Residual Transformer

Xueqi Guo, Halid Ziya Yerebakan, Yoshihisa Shinagawa, Kritika Iyer, Gerardo Hermosillo Valadez

Main category: cs.CV

TL;DR: A fast multi-organ segmentation framework using hierarchical sparse sampling and Residual Transformer that reduces computation time while maintaining accuracy, achieving ~2.24 seconds on CPU.

Details

Motivation: Current deep learning methods for 3D medical image segmentation are computationally expensive, and classifiers face speed-accuracy trade-offs, requiring a faster alternative.

Method: Hierarchical sparse sampling strategy to reduce computation while preserving context, combined with Residual Transformer architecture to extract and combine multi-level information efficiently.

Result: Improved qualitative and quantitative segmentation performance on 10,253 CT images and TotalSegmentator dataset, achieving ~2.24 seconds on CPU hardware.

Conclusion: The method demonstrates potential for real-time fine organ segmentation with fast speed and maintained accuracy.

Abstract: Multi-organ segmentation of 3D medical images is fundamental with meaningful applications in various clinical automation pipelines. Although deep learning has achieved superior performance, the time and memory consumption of segmenting the entire 3D volume voxel by voxel using neural networks can be huge. Classifiers have been developed as an alternative in cases with certain points of interest, but the trade-off between speed and accuracy remains an issue. Thus, we propose a novel fast multi-organ segmentation framework with the usage of hierarchical sparse sampling and a Residual Transformer. Compared with whole-volume analysis, the hierarchical sparse sampling strategy could successfully reduce computation time while preserving a meaningful hierarchical context utilizing multiple resolution levels. The architecture of the Residual Transformer segmentation network could extract and combine information from different levels of information in the sparse descriptor while maintaining a low computational cost. In an internal data set containing 10,253 CT images and the public dataset TotalSegmentator, the proposed method successfully improved qualitative and quantitative segmentation performance compared to the current fast organ classifier, with fast speed at the level of ~2.24 seconds on CPU hardware. The potential of achieving real-time fine organ segmentation is suggested.

[228] SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology

Shanaka Liyanaarachchi, Chathurya Wijethunga, Shihab Aaquil Ahamed, Akthas Absar, Ranga Rodrigo

Main category: cs.CV

TL;DR: SENCA-st: A novel architecture using shared encoder with neighborhood cross-attention to integrate histopathology images and spatial transcriptomics data, preserving features from both modalities and emphasizing structurally similar but functionally different regions.

Details

Motivation: Current methods for integrating histopathology and spatial transcriptomics either prioritize one modality over the other or use vanilla contrastive learning that loses essential functional information, leading to either noisy or overly smoothed results.

Method: Proposed SENCA-st architecture with shared encoder and neighborhood cross-attention mechanism that preserves features from both histopathology and spatial transcriptomics, specifically emphasizing regions that are structurally similar in histopathology but functionally different in spatial transcriptomics.

Result: Demonstrated superior performance surpassing state-of-the-art methods in detecting tumor heterogeneity and tumor micro-environment regions, which are clinically crucial aspects.

Conclusion: SENCA-st effectively integrates structural and functional information from both modalities while preserving essential features, providing better detection of clinically important tumor regions compared to existing approaches.

Abstract: Spatial transcriptomics is an emerging field that enables the identification of functional regions based on the spatial distribution of gene expression. Integrating this functional information present in transcriptomic data with structural data from histopathology images is an active research area with applications in identifying tumor substructures associated with cancer drug resistance. Current histopathology-spatial-transcriptomic region segmentation methods suffer due to either making spatial transcriptomics prominent by using histopathology features just to assist processing spatial transcriptomics data or using vanilla contrastive learning that make histopathology images prominent due to only promoting common features losing functional information. In both extremes, the model gets either lost in the noise of spatial transcriptomics or overly smoothed, losing essential information. Thus, we propose our novel architecture SENCA-st (Shared Encoder with Neighborhood Cross Attention) that preserves the features of both modalities. More importantly, it emphasizes regions that are structurally similar in histopathology but functionally different on spatial transcriptomics using cross-attention. We demonstrate the superior performance of our model that surpasses state-of-the-art methods in detecting tumor heterogeneity and tumor micro-environment regions, a clinically crucial aspect.

[229] CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing

Leonie Bossemeyer, Samuel Heinrich, Grant Van Horn, Oisin Mac Aodha

Main category: cs.CV

TL;DR: CleverBirds is a large-scale knowledge tracing benchmark for fine-grained bird species recognition, collected from over 40,000 participants answering 17+ million questions across 10,000+ species, enabling study of visual expertise development.

Details

Motivation: To understand how individuals acquire expertise in complex fine-grained visual classification and accurately infer human learners' knowledge states, which is essential for modeling visual learning progression.

Method: Collected data from citizen-science platform eBird where participants engage in bird species recognition quizzes, creating a dataset with long-range learning patterns (average 400 questions per participant).

Result: The benchmark shows that tracking learners’ knowledge is challenging across participant subgroups and question types, with contextual information providing varying predictive benefits. It’s one of the largest benchmarks with substantially more learnable concepts.

Conclusion: CleverBirds enables new avenues for studying visual expertise development over time and across individuals, supporting development of improved knowledge tracing methods for fine-grained recognition tasks.

Abstract: Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans remains challenging, and accurately inferring a human learner’s knowledge state is a key step toward understanding visual learning. We introduce CleverBirds, a large-scale knowledge tracing benchmark for fine-grained bird species recognition. Collected by the citizen-science platform eBird, it offers insight into how individuals acquire expertize in complex fine-grained classification. More than 40,000 participants have engaged in the quiz, answering over 17 million multiple-choice questions spanning over 10,000 bird species, with long-range learning patterns across an average of 400 questions per participant. We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. We show that tracking learners’ knowledge is challenging, especially across participant subgroups and question types, with different forms of contextual information offering varying degrees of predictive benefit. CleverBirds is among the largest benchmark of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertize over time and across individuals.

[230] UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei

Main category: cs.CV

TL;DR: UniVA is an open-source multi-agent framework that unifies video understanding, segmentation, editing, and generation into cohesive workflows using a Plan-and-Act dual-agent architecture with hierarchical memory.

Details

Motivation: Real-world applications require complex, iterative video workflows that combine multiple AI capabilities, but current specialized models only excel at isolated tasks, creating a gap for unified video processing systems.

Method: Uses Plan-and-Act dual-agent architecture: planner agent interprets user intentions and decomposes tasks, while executor agents execute through modular MCP-based tool servers. Features hierarchical multi-level memory for long-horizon reasoning and contextual continuity.

Result: Enables iterative any-conditioned video workflows (text/image/video-conditioned generation → multi-round editing → object segmentation → compositional synthesis) that were previously cumbersome. Also introduces UniVA-Bench benchmark suite for evaluation.

Conclusion: UniVA and UniVA-Bench are fully open-sourced to catalyze research on interactive, agentic, and general-purpose video intelligence for next-generation multimodal AI systems.

Abstract: While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

[231] 3D4D: An Interactive, Editable, 4D World Model via 3D Video Generation

Yunhong He, Zhengqing Yuan, Zhengzhong Tu, Yanfang Ye, Lichao Sun

Main category: cs.CV

TL;DR: 3D4D is an interactive 4D visualization framework using WebGL and Supersplat rendering to convert static images/text into coherent 4D scenes with real-time multi-modal interaction.

Details

Motivation: To enable adaptive, user-driven exploration of complex 4D environments through interactive visualization.

Method: Integrates WebGL with Supersplat rendering, uses four core modules to transform static content into 4D scenes, and employs foveated rendering for efficient real-time interaction.

Result: Developed a framework that enables coherent 4D scene generation from static inputs with efficient real-time multi-modal interaction capabilities.

Conclusion: 3D4D provides an effective solution for interactive 4D visualization and exploration of complex environments through its integrated rendering approach and adaptive interaction design.

Abstract: We introduce 3D4D, an interactive 4D visualization framework that integrates WebGL with Supersplat rendering. It transforms static images and text into coherent 4D scenes through four core modules and employs a foveated rendering strategy for efficient, real-time multi-modal interaction. This framework enables adaptive, user-driven exploration of complex 4D environments. The project page and code are available at https://yunhonghe1021.github.io/NOVA/.

[232] RePose-NeRF: Robust Radiance Fields for Mesh Reconstruction under Noisy Camera Poses

Sriram Srinivasan, Gautam Ramachandra

Main category: cs.CV

TL;DR: Proposes a robust framework for 3D mesh reconstruction from multi-view images with noisy camera poses, jointly refining poses while learning implicit scene representation for editable meshes compatible with standard 3D tools.

Details

Motivation: Accurate 3D reconstruction is essential for robotics but existing NeRF methods rely on precise camera poses and produce inefficient implicit representations that differ from widely used polygonal meshes.

Method: Jointly refines camera poses while learning implicit scene representation that captures geometric detail and photorealistic appearance, producing editable 3D meshes.

Result: Achieves accurate and robust 3D reconstruction under pose uncertainty on standard benchmarks, producing meshes compatible with common 3D graphics and robotics tools.

Conclusion: Bridges the gap between neural implicit representations and practical robotic applications by enabling efficient downstream use of reconstructed 3D meshes.

Abstract: Accurate 3D reconstruction from multi-view images is essential for downstream robotic tasks such as navigation, manipulation, and environment understanding. However, obtaining precise camera poses in real-world settings remains challenging, even when calibration parameters are known. This limits the practicality of existing NeRF-based methods that rely heavily on accurate extrinsic estimates. Furthermore, their implicit volumetric representations differ significantly from the widely adopted polygonal meshes, making rendering and manipulation inefficient in standard 3D software. In this work, we propose a robust framework that reconstructs high-quality, editable 3D meshes directly from multi-view images with noisy extrinsic parameters. Our approach jointly refines camera poses while learning an implicit scene representation that captures fine geometric detail and photorealistic appearance. The resulting meshes are compatible with common 3D graphics and robotics tools, enabling efficient downstream use. Experiments on standard benchmarks demonstrate that our method achieves accurate and robust 3D reconstruction under pose uncertainty, bridging the gap between neural implicit representations and practical robotic applications.

[233] Vision Transformer Based User Equipment Positioning

Parshwa Shah, Dhaval K. Patel, Brijesh Soni, Miguel López-Benítez, Siddhartan Govindasamy

Main category: cs.CV

TL;DR: Proposed an attention-based Vision Transformer for UE positioning using Angle Delay Profile from CSI, achieving 38% improvement over state-of-the-art methods.

Details

Motivation: Existing DL models for UE positioning have limitations: they apply equal attention to entire input and are not well-suited for non-sequential data like instantaneous CSI.

Method: Used attention-based Vision Transformer architecture focusing on Angle Delay Profile from CSI matrix, validated on DeepMIMO and ViWi ray-tracing datasets.

Result: Achieved RMSE of 0.55m indoors, 13.59m outdoors in DeepMIMO, and 3.45m in ViWi’s outdoor blockage scenario. Outperformed state-of-the-art by ~38% with better error distribution.

Conclusion: The attention-based ViT approach effectively addresses limitations of traditional DL models for UE positioning and significantly improves positioning accuracy.

Abstract: Recently, Deep Learning (DL) techniques have been used for User Equipment (UE) positioning. However, the key shortcomings of such models is that: i) they weigh the same attention to the entire input; ii) they are not well suited for the non-sequential data e.g., when only instantaneous Channel State Information (CSI) is available. In this context, we propose an attention-based Vision Transformer (ViT) architecture that focuses on the Angle Delay Profile (ADP) from CSI matrix. Our approach, validated on the DeepMIMO' and ViWi’ ray-tracing datasets, achieves an Root Mean Squared Error (RMSE) of 0.55m indoors, 13.59m outdoors in DeepMIMO, and 3.45m in ViWi’s outdoor blockage scenario. The proposed scheme outperforms state-of-the-art schemes by $\sim$ 38%. It also performs substantially better than other approaches that we have considered in terms of the distribution of error distance.

[234] DODA: Adapting Object Detectors to Dynamic Agricultural Environments in Real-Time with Diffusion

Shuai Xiang, Pieter M. Blok, James Burridge, Haozhou Wang, Wei Guo

Main category: cs.CV

TL;DR: DODA is a diffusion-based framework that quickly adapts object detectors to new agricultural domains in just 2 minutes without retraining, using external domain embeddings and improved layout-to-image generation.

Details

Motivation: Object detection models in agriculture suffer from domain shifts due to constantly changing environments, making traditional domain adaptation methods impractical as they require retraining for each new domain.

Method: DODA uses diffusion models with external domain embeddings and an improved layout-to-image approach to generate high-quality detection data for new domains without additional training.

Result: On the Global Wheat Head Detection dataset, fine-tuning detectors on DODA-generated data yields significant improvements across multiple domains.

Conclusion: DODA provides a simple yet powerful solution for agricultural domain adaptation, reducing barriers for growers to use detection in personalized environments.

Abstract: Object detection has wide applications in agriculture, but domain shifts of diverse environments limit the broader use of the trained models. Existing domain adaptation methods usually require retraining the model for new domains, which is impractical for agricultural applications due to constantly changing environments. In this paper, we propose DODA ($D$iffusion for $O$bject-detection $D$omain Adaptation in $A$griculture), a diffusion-based framework that can adapt the detector to a new domain in just 2 minutes. DODA incorporates external domain embeddings and an improved layout-to-image approach, allowing it to generate high-quality detection data for new domains without additional training. We demonstrate DODA’s effectiveness on the Global Wheat Head Detection dataset, where fine-tuning detectors on DODA-generated data yields significant improvements across multiple domains. DODA provides a simple yet powerful solution for agricultural domain adaptation, reducing the barriers for growers to use detection in personalised environments. The code is available at https://github.com/UTokyo-FieldPhenomics-Lab/DODA.

[235] Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen

Main category: cs.CV

TL;DR: This paper proposes multimodal adversarial training (MAT) for defending against multimodal attacks in vision-language tasks, addressing limitations of existing unimodal defenses and exploring one-to-many relationships in training data.

Details

Motivation: Existing defense methods focus on image classification and overlook multimodal attacks (both image and text perturbations) and one-to-many relationships in vision-language tasks, where current VL defense methods only consider vision robustness.

Method: Proposed multimodal adversarial training (MAT) that incorporates adversarial perturbations in both image and text modalities during training, and investigated diverse augmentation techniques to leverage one-to-many relationships in training data.

Result: MAT significantly outperforms existing unimodal defenses. Analysis shows that effective defense requires augmented image-text pairs to be well-aligned, diverse, and avoid distribution shift - conditions overlooked by prior research.

Conclusion: This work pioneers defense strategies against multimodal attacks in vision-language models, providing insights for building robust VLMs from both optimization and data perspectives.

Abstract: Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift – conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives.

[236] MMCL: Correcting Content Query Distributions for Improved Anti-Overlapping X-Ray Object Detection

Mingyuan Li, Tong Jia, Hui Lu, Hao Wang, Bowen Ma, Shiyi Guo, Shuyang Lin, Dongyue Chen, Haoran Wang, Baosheng Yu

Main category: cs.CV

TL;DR: Proposes multi-class min-margin contrastive learning (MMCL) to improve content query distribution in DETR for X-ray object detection, addressing depth-induced superimposition challenges.

Details

Motivation: X-ray images have depth-induced superimposition where objects at different depths overlap and blend features, requiring specialized mechanisms to disentangle mixed representations between target objects and backgrounds. Current DETR adaptations for anti-overlapping detection overlook the importance of well-distributed content queries.

Method: MMCL framework groups content queries by object category and applies two complementary loss components: multi-class exclusion loss for inter-class separability and min-margin clustering loss for intra-class diversity.

Result: Evaluation on three X-ray prohibited-item detection datasets (PIXray, OPIXray, PIDray) using two backbone networks and four DETR variants shows MMCL effectively enhances anti-overlapping object detection and achieves state-of-the-art performance.

Conclusion: MMCL successfully corrects content query distribution, achieving balanced intra-class diversity and inter-class separability for improved X-ray object detection in overlapping scenarios.

Abstract: Unlike natural images with occlusion-based overlap, X-ray images exhibit depth-induced superimposition and semi-transparent appearances, where objects at different depths overlap and their features blend together. These characteristics demand specialized mechanisms to disentangle mixed representations between target objects (e.g., prohibited items) and irrelevant backgrounds. While recent studies have explored adapting detection transformers (DETR) for anti-overlapping object detection, the importance of well-distributed content queries that represent object hypotheses remains underexplored. In this paper, we introduce a multi-class min-margin contrastive learning (MMCL) framework to correct the distribution of content queries, achieving balanced intra-class diversity and inter-class separability. The framework first groups content queries by object category and then applies two proposed complementary loss components: a multi-class exclusion loss to enhance inter-class separability, and a min-margin clustering loss to encourage intra-class diversity. We evaluate the proposed method on three widely used X-ray prohibited-item detection datasets, PIXray, OPIXray, and PIDray, using two backbone networks and four DETR variants. Experimental results demonstrate that MMCL effectively enhances anti-overlapping object detection and achieves state-of-the-art performance on both datasets. Code will be made publicly available on GitHub.

[237] OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang

Main category: cs.CV

TL;DR: OpenVLThinker is an open-source large vision-language model that achieves significant performance gains on visual reasoning tasks through an alternating SFT-RL training approach, demonstrating the synergy between supervised fine-tuning and reinforcement learning for complex reasoning.

Details

Motivation: Existing methods face challenges: SFT-based distillation from text reasoning models suffers from performance degradation due to imprecise visual grounding, while purely RL-based methods struggle with large search spaces in smaller models, hindering reflective reasoning behaviors.

Method: Alternating between supervised fine-tuning (SFT) and reinforcement learning (RL) in iterative cycles. SFT surfaces latent reasoning behaviors and narrows the RL search space, while RL refines reasoning skills and produces higher-quality SFT data for continued self-improvement.

Result: OpenVLThinker-7B achieves notable performance improvements across six benchmarks: MathVista (+3.8%), EMMA (+2.4%), HallusionBench (+1.6%), and other mathematical and general reasoning tasks, demonstrating consistent advancement in visual reasoning capabilities.

Conclusion: The alternating SFT-RL approach enables sophisticated chain-of-thought reasoning in vision-language models, providing early evidence for achieving R1-style reasoning in multimodal contexts and demonstrating effective synergy between SFT and RL for complex reasoning tasks.

Abstract: We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model’s reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.

[238] One Homography is All You Need: IMM-based Joint Homography and Multiple Object State Estimation

Paul Johannes Claasen, Johan Pieter de Villiers

Main category: cs.CV

TL;DR: IMM-JHSE is a novel online MOT algorithm that uses homography estimation instead of regular 3D measurements, combining static and dynamic camera motion models with IMM filtering for robust multi-object tracking.

Details

Motivation: To address limitations in previous MOT methods that rely on explicit camera motion compensation and regular 3D measurements, aiming to create a more robust tracking system that can handle motion away from the ground plane.

Method: Uses IMM filter to jointly model homography matrix and dynamics in track states, combines static/dynamic camera motion models, applies non-standard IMM approach for association using mixed BIoU scores and Mahalanobis distances, and employs dynamic process/measurement noise estimation.

Result: Outperforms UCMCTrack, OC-SORT, C-BIoU and ByteTrack on DanceTrack and KITTI-car datasets (HOTA improvements of 2.64 and 2.11 respectively), competitive on MOT17, MOT20 and KITTI-pedestrian datasets, and shows similar performance to tracking-by-attention methods on DanceTrack while outperforming them on MOT17.

Conclusion: IMM-JHSE provides state-of-the-art performance in 2D MOT using only homography estimation as 3D information, demonstrating robustness to camera motion and competitive results across multiple challenging datasets.

Abstract: A novel online MOT algorithm, IMM Joint Homography State Estimation (IMM-JHSE), is proposed. IMM-JHSE uses an initial homography estimate as the only additional 3D information, whereas other 3D MOT methods use regular 3D measurements. By jointly modelling the homography matrix and its dynamics as part of track state vectors, IMM-JHSE removes the explicit influence of camera motion compensation techniques on predicted track position states, which was prevalent in previous approaches. Expanding upon this, static and dynamic camera motion models are combined using an IMM filter. A simple bounding box motion model is used to predict bounding box positions to incorporate image plane information. In addition to applying an IMM to camera motion, a non-standard IMM approach is applied where bounding-box-based BIoU scores are mixed with ground-plane-based Mahalanobis distances in an IMM-like fashion to perform association only, making IMM-JHSE robust to motion away from the ground plane. Finally, IMM-JHSE makes use of dynamic process and measurement noise estimation techniques. IMM-JHSE improves upon related techniques, including UCMCTrack, OC-SORT, C-BIoU and ByteTrack on the DanceTrack and KITTI-car datasets, increasing HOTA by 2.64 and 2.11, respectively, while offering competitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets. Using publicly available detections, IMM-JHSE outperforms almost all other 2D MOT methods and is outperformed only by 3D MOT methods – some of which are offline – on the KITTI-car dataset. Compared to tracking-by-attention methods, IMM-JHSE shows remarkably similar performance on the DanceTrack dataset and outperforms them on the MOT17 dataset. The code is publicly available: https://github.com/Paulkie99/imm-jhse.

[239] A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling

Kyle Buettner, Jacob T. Emmerson, Adriana Kovashka

Main category: cs.CV

TL;DR: The paper addresses perceptual bias in vision-language models by using native speaker data and multimodal reasoning to augment captions, improving text-image retrieval in German and Japanese.

Details

Motivation: Modern VLMs rely on English captions translated to other languages, which introduces perceptual bias from English speakers' perspectives, limiting cultural and linguistic diversity in image descriptions.

Method: A framework using small native speaker data, nearest-neighbor guidance, and multimodal LLM reasoning to rewrite captions to better reflect target language descriptions, then using these for multilingual CLIP finetuning.

Result: Improved German and Japanese text-image retrieval (up to +3.5 mean recall, +4.4 on native vs. translation errors) and insights into cross-language object description variation.

Conclusion: The proposed method effectively reduces perceptual bias in VLMs by incorporating native speaker perspectives, enhancing cross-cultural and cross-linguistic generalization in image captioning.

Abstract: When captioning an image, people describe objects in diverse ways, such as by using different terms and/or including details that are perceptually noteworthy to them. Descriptions can be especially unique across languages and cultures. Modern vision-language models (VLMs) gain understanding of images with text in different languages often through training on machine translations of English captions. However, this process relies on input content written from the perception of English speakers, leading to a perceptual bias. In this work, we outline a framework to address this bias. We specifically use a small amount of native speaker data, nearest-neighbor example guidance, and multimodal LLM reasoning to augment captions to better reflect descriptions in a target language. When adding the resulting rewrites to multilingual CLIP finetuning, we improve on German and Japanese text-image retrieval case studies (up to +3.5 mean recall, +4.4 on native vs. translation errors). We also propose a mechanism to build understanding of object description variation across languages, and offer insights into cross-dataset and cross-language generalization.

[240] Benchmarking Domain Generalization Algorithms in Computational Pathology

Neda Zamanitajeddin, Mostafa Jahanifar, Kesi Xu, Fouzia Siraj, Nasir Rajpoot

Main category: cs.CV

TL;DR: This study benchmarks 30 domain generalization algorithms on 3 computational pathology tasks through 7,560 cross-validation runs, finding that self-supervised learning and stain augmentation perform best, and introduces a new pan-cancer tumor detection dataset.

Details

Motivation: Deep learning models in computational pathology suffer from performance degradation on unseen data due to domain shifts, but there's a lack of systematic evaluation of domain generalization algorithms in this context.

Method: Used a unified platform to evaluate 30 DG algorithms on 3 CPath tasks through extensive cross-validation (7,560 runs), incorporating modality-specific techniques and pretrained foundation models.

Result: Self-supervised learning and stain augmentation consistently outperformed other methods. A new pan-cancer tumor detection dataset (HISTOPANTUM) was introduced as a benchmark.

Conclusion: The study provides valuable guidance for selecting appropriate domain generalization approaches in computational pathology, highlighting the effectiveness of pretrained models and data augmentation strategies.

Abstract: Deep learning models have shown immense promise in computational pathology (CPath) tasks, but their performance often suffers when applied to unseen data due to domain shifts. Addressing this requires domain generalization (DG) algorithms. However, a systematic evaluation of DG algorithms in the CPath context is lacking. This study aims to benchmark the effectiveness of 30 DG algorithms on 3 CPath tasks of varying difficulty through 7,560 cross-validation runs. We evaluate these algorithms using a unified and robust platform, incorporating modality-specific techniques and recent advances like pretrained foundation models. Our extensive cross-validation experiments provide insights into the relative performance of various DG strategies. We observe that self-supervised learning and stain augmentation consistently outperform other methods, highlighting the potential of pretrained models and data augmentation. Furthermore, we introduce a new pan-cancer tumor detection dataset (HISTOPANTUM) as a benchmark for future research. This study offers valuable guidance to researchers in selecting appropriate DG approaches for CPath tasks.

[241] Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Chenliang Xu

Main category: cs.CV

TL;DR: The paper identifies specialized attention heads in MLLMs that focus on visual content, revealing how language models adapt to multimodal tasks by bridging textual and visual understanding.

Details

Motivation: To understand how language models trained on linguistic data can effectively interpret and process visual content in multimodal settings.

Method: Systematic investigation across 4 model families and 4 model scales, analyzing attention heads and their focus on visual tokens.

Result: Discovery of unique attention heads that specifically target visual content, with strong correlation between attention behavior, weight distribution, and concentration on visual tokens.

Conclusion: LLMs demonstrate potential to bridge textual and visual understanding, paving the way for AI systems capable of engaging with diverse modalities.

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the development of AI systems capable of engaging with diverse modalities.

[242] Towards Visual Grounding: A Survey

Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, Changsheng Xu

Main category: cs.CV

TL;DR: This survey provides a comprehensive overview of Visual Grounding (Referring Expression Comprehension), covering its development history, recent advancements since 2021, various settings, datasets, applications, and future research directions.

Details

Motivation: Visual Grounding simulates referential relationships between visual and linguistic modalities, enabling machines to develop human-like multimodal comprehension capabilities. Since 2021, the field has seen significant advancements with new concepts that bring numerous challenges, necessitating a comprehensive survey.

Method: The authors examine the developmental history, provide background knowledge, systematically track advancements, define and organize various settings, analyze datasets and applications, and highlight advanced topics.

Result: This survey represents the most comprehensive overview currently available in visual grounding, encompassing representative work from the past decade and serving as an invaluable resource for both beginners and experienced researchers.

Conclusion: The paper outlines challenges in visual grounding and proposes valuable directions for future research, while maintaining an updated repository of related work to help researchers understand key concepts and track latest developments.

Abstract: Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships between visual and linguistic modalities, enabling machines to develop human-like multimodal comprehension capabilities. Consequently, it has extensive applications in various domains. However, since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training, grounding multimodal LLMs, generalized visual grounding, and giga-pixel grounding, which have brought numerous new challenges. In this survey, we first examine the developmental history of visual grounding and provide an overview of essential background knowledge. We systematically track and summarize the advancements, and then meticulously define and organize the various settings to standardize future research and ensure a fair comparison. Additionally, we delve into numerous related datasets and applications, and highlight several advanced topics. Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers. By extracting common technical details, this survey encompasses the representative work in each subtopic over the past decade. To the best of our knowledge, this paper represents the most comprehensive overview currently available in the field of visual grounding. This survey is designed to be suitable for both beginners and experienced researchers, serving as an invaluable resource for understanding key concepts and tracking the latest research developments. We keep tracing related work at https://github.com/linhuixiao/Awesome-Visual-Grounding.

[243] Bridged Semantic Alignment for Zero-shot 3D Medical Image Diagnosis

Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Weifu Lv, Wei Wei, S. Kevin Zhou

Main category: cs.CV

TL;DR: The paper proposes BrgSA, a framework that bridges the semantic gap between visual and textual embeddings in vision-language alignment for 3D medical image diagnosis, achieving state-of-the-art zero-shot performance on underrepresented abnormalities.

Details

Motivation: Existing vision-language alignment methods for 3D medical images show well-separated clusters between visual and textual embeddings, creating a significant gap that limits zero-shot learning effectiveness, particularly for underrepresented abnormalities.

Method: BrgSA uses large language models for semantic summarization of reports and introduces a Cross-Modal Knowledge Interaction module with a cross-modal knowledge bank as a semantic bridge to facilitate interaction between modalities and narrow the alignment gap.

Result: BrgSA achieves state-of-the-art performance on both public benchmark datasets and a custom-labeled dataset with 15 underrepresented abnormalities, showing significant improvements in zero-shot diagnosis of rare conditions.

Conclusion: The proposed semantic bridging approach effectively narrows the gap between visual and textual embeddings, enabling superior zero-shot diagnosis performance for underrepresented abnormalities in 3D medical imaging without requiring additional annotations.

Abstract: 3D medical images such as computed tomography are widely used in clinical practice, offering a great potential for automatic diagnosis. Supervised learning-based approaches have achieved significant progress but rely heavily on extensive manual annotations, limited by the availability of training data and the diversity of abnormality types. Vision-language alignment (VLA) offers a promising alternative by enabling zero-shot learning without additional annotations. However, we empirically discover that the visual and textural embeddings after alignment endeavors from existing VLA methods form two well-separated clusters, presenting a wide gap to be bridged. To bridge this gap, we propose a Bridged Semantic Alignment (BrgSA) framework. First, we utilize a large language model to perform semantic summarization of reports, extracting high-level semantic information. Second, we design a Cross-Modal Knowledge Interaction module that leverages a cross-modal knowledge bank as a semantic bridge, facilitating interaction between the two modalities, narrowing the gap, and improving their alignment. To comprehensively evaluate our method, we construct a benchmark dataset that includes 15 underrepresented abnormalities as well as utilize two existing benchmark datasets. Experimental results demonstrate that BrgSA achieves state-of-the-art performances on both public benchmark datasets and our custom-labeled dataset, with significant improvements in zero-shot diagnosis of underrepresented abnormalities.

[244] X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding

Wenqi Zhou, Kai Cao, Hao Zheng, Yunze Liu, Xinyi Zheng, Miao Liu, Per Ola Kristensson, Walterio Mayol-Cuevas, Fan Zhang, Weizhe Lin, Junxiao Shen

Main category: cs.CV

TL;DR: X-LeBench is a benchmark dataset for evaluating long-form egocentric video understanding, featuring 432 simulated life logs spanning 23 minutes to 16.4 hours, created by combining synthetic daily plans with real Ego4D footage.

Details

Motivation: Existing datasets focus on short to moderately long videos, creating a gap for evaluating extensive, ultra-long egocentric video recordings needed for applications in embodied intelligence and long-term activity analysis.

Method: Developed a life-logging simulation pipeline that produces realistic daily plans aligned with real-world video data, flexibly integrating synthetic daily plans with real footage from Ego4D.

Result: Created 432 simulated video life logs ranging from 23 minutes to 16.4 hours. Evaluations showed poor performance of baseline systems and MLLMs across all tasks, highlighting challenges in temporal localization, context aggregation, and memory retention.

Conclusion: The benchmark reveals significant limitations in current models for long-form egocentric video understanding and underscores the need for more advanced approaches to handle temporal reasoning and context aggregation in ultra-long videos.

Abstract: Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short (\eg, minutes to tens of minutes) to moderately long videos, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset meticulously designed to fill this gap by focusing on tasks requiring a comprehensive understanding of extremely long egocentric video recordings. Our X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of synthetic daily plans with real-world footage from Ego4D-a massive-scale egocentric video dataset covers a wide range of daily life scenarios-resulting in 432 simulated video life logs spanning from 23 minutes to 16.4 hours. The evaluations of several baseline systems and multimodal large language models (MLLMs) reveal their poor performance across the board, highlighting the inherent challenges of long-form egocentric video understanding, such as temporal localization and reasoning, context aggregation, and memory retention, and underscoring the need for more advanced models.

[245] GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Shijie Zhou, Viet Dac Lai, Hao Tan, Jihyung Kil, Wanrong Zhu, Changyou Chen, Ruiyi Zhang

Main category: cs.CV

TL;DR: GUI-AIMA is an attention-based framework that enhances GUI grounding by leveraging MLLMs’ native attention mechanisms instead of direct coordinate generation, achieving state-of-the-art performance with minimal training data.

Details

Motivation: Existing MLLM-based GUI grounding methods struggle with precise coordinate generation from visual inputs, which is computationally intensive. The authors observed that MLLMs have inherent grounding capabilities in their attention mechanisms that can be better utilized.

Method: Proposed GUI-AIMA framework that aligns MLLMs’ multimodal attention with patch-wise grounding signals calculated through multi-head aggregation on simplified query-visual attention matrices. Uses coordinate-free approach with plug-and-play zoom-in capability.

Result: GUI-AIMA-3B trained with only 85k screenshots achieved state-of-the-art performance: 59.6% on ScreenSpot-Pro, 63.8% on OSWorld-G, and 91.5% on ScreenSpot-v2, demonstrating exceptional data efficiency.

Conclusion: Light training can effectively trigger MLLMs’ native grounding capabilities. The attention-based, coordinate-free approach provides an efficient alternative to direct coordinate generation for GUI grounding tasks.

Abstract: Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to first select visual patches relevant to the instructions and then determine the precise click location within those patches. Based on the observations that general MLLMs have some native grounding capability, nested within their attentions, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are calculated adaptively for diverse user instructions by multi-head aggregation on simplified query-visual attention matrices. Besides, its coordinate-free manner can easily integrate a plug-and-play zoom-in stage. GUI-AIMA-3B was trained with only 85k screenshots, demonstrating exceptional data efficiency and verifying that light training can trigger the native grounding capability of MLLMs. It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 59.6% on ScreenSpot-Pro, 63.8% on OSWorld-G and 91.5% on ScreenSpot-v2. Project page: https://github.com/sjz5202/GUI-AIMA

[246] CSPCL: Category Semantic Prior Contrastive Learning for Deformable DETR-Based Prohibited Item Detectors

Mingyuan Li, Tong Jia, Hao Wang, Bowen Ma, Hui Lu, Shiyi Guo, Da Cai, Dongyue Chen

Main category: cs.CV

TL;DR: Proposes Category Semantic Prior Contrastive Learning (CSPCL) to improve X-ray prohibited item detection by aligning class prototypes with content queries to address foreground-background feature coupling issues.

Details

Motivation: X-ray images suffer from foreground-background feature coupling due to overlapping objects, causing poor performance of natural image detectors on security inspection tasks.

Method: CSPCL mechanism with CSP loss comprising Intra-Class Truncated Attraction (ITA) loss and Inter-Class Adaptive Repulsion (IAR) loss, integrated into Deformable DETR-based models.

Result: Significant performance improvements on PIXray, OPIXray, PIDray, and CLCXray datasets without increasing inference complexity.

Conclusion: CSPCL effectively enhances model sensitivity to foreground features and improves inter-class discriminability, especially for similar categories in X-ray security inspection.

Abstract: Prohibited item detection based on X-ray images is one of the most effective security inspection methods. However, the foreground-background feature coupling caused by the overlapping phenomenon specific to X-ray images makes general detectors designed for natural images perform poorly. To address this issue, we propose a Category Semantic Prior Contrastive Learning (CSPCL) mechanism, which aligns the class prototypes perceived by the classifier with the content queries to correct and supplement the missing semantic information responsible for classification, thereby enhancing the model sensitivity to foreground features. To achieve this alignment, we design a specific contrastive loss, CSP loss, which comprises the Intra-Class Truncated Attraction (ITA) loss and the Inter-Class Adaptive Repulsion (IAR) loss, and outperforms classic contrastive losses. Specifically, the ITA loss leverages class prototypes to attract intra-class content queries and preserves essential intra-class diversity via a gradient truncation function. The IAR loss employs class prototypes to adaptively repel inter-class content queries, with the repulsion strength scaled by prototype-prototype similarity, thereby improving inter-class discriminability, especially among similar categories. CSPCL is general and can be easily integrated into Deformable DETR-based models. Extensive experiments on the PIXray, OPIXray, PIDray, and CLCXray datasets demonstrate that CSPCL significantly enhances the performance of various state-of-the-art models without increasing inference complexity. The code is publicly available at https://github.com/Limingyuan001/CSPCL.

[247] TransParking: A Dual-Decoder Transformer Framework with Soft Localization for End-to-End Automatic Parking

Hangyu Du, Chee-Meng Chew

Main category: cs.CV

TL;DR: A vision-based transformer model for end-to-end automatic parking that reduces trajectory prediction errors by ~50% compared to state-of-the-art methods.

Details

Motivation: Automatic parking is critical for precise vehicle positioning in complex environments, and fully differentiable end-to-end systems are a research hotspot in intelligent transportation.

Method: Purely vision-based transformer model trained using expert trajectories, taking camera-captured data as input and directly outputting future trajectory coordinates.

Result: Experimental results show approximately 50% reduction in various errors compared to current state-of-the-art end-to-end trajectory prediction algorithms of the same type.

Conclusion: The approach provides an effective solution for fully differentiable automatic parking with significantly improved accuracy.

Abstract: In recent years, fully differentiable end-to-end autonomous driving systems have become a research hotspot in the field of intelligent transportation. Among various research directions, automatic parking is particularly critical as it aims to enable precise vehicle parking in complex environments. In this paper, we present a purely vision-based transformer model for end-to-end automatic parking, trained using expert trajectories. Given camera-captured data as input, the proposed model directly outputs future trajectory coordinates. Experimental results demonstrate that the various errors of our model have decreased by approximately 50% in comparison with the current state-of-the-art end-to-end trajectory prediction algorithm of the same type. Our approach thus provides an effective solution for fully differentiable automatic parking.

[248] RedDiffuser: Red Teaming Vision-Language Models for Toxic Continuation via Reinforced Stable Diffusion

Ruofan Wang, Xiang Zheng, Xiaosen Wang, Cong Wang, Xingjun Ma, Yu-Gang Jiang

Main category: cs.CV

TL;DR: RedDiffuser is a reinforcement learning framework that fine-tunes diffusion models to generate natural-looking adversarial images that induce toxic continuations in Vision-Language Models, significantly increasing toxicity rates across multiple models.

Details

Motivation: Vision-Language Models are vulnerable to toxic continuation attacks where malicious inputs combined with partial toxic outputs lead to harmful completions, posing unique challenges in multimodal settings where subtle image variations can disproportionately affect model responses.

Method: Proposes RedDiffuser framework that uses reinforcement learning to fine-tune diffusion models, integrating greedy search for candidate image prompts with reinforcement fine-tuning that jointly promotes toxic output and semantic coherence.

Result: RedDiffuser increases toxicity rate in LLaVA outputs by 10.69% and 8.91% on original and hold-out sets, and shows strong transferability with 5.1% increase on Gemini and 26.83% on LLaMA-Vision.

Conclusion: The findings reveal cross-modal toxicity amplification vulnerability in current VLM alignment, highlighting the need for robust multimodal red teaming approaches.

Abstract: Vision-Language Models (VLMs) are vulnerable to jailbreak attacks, where adversaries bypass safety mechanisms to elicit harmful outputs. In this work, we examine an insidious variant of this threat: toxic continuation. Unlike standard jailbreaks that rely solely on malicious instructions, toxic continuation arises when the model is given a malicious input alongside a partial toxic output, resulting in harmful completions. This vulnerability poses a unique challenge in multimodal settings, where even subtle image variations can disproportionately affect the model’s response. To this end, we propose RedDiffuser (RedDiff), the first red teaming framework that uses reinforcement learning to fine-tune diffusion models into generating natural-looking adversarial images that induce toxic continuations. RedDiffuser integrates a greedy search procedure for selecting candidate image prompts with reinforcement fine-tuning that jointly promotes toxic output and semantic coherence. Experiments demonstrate that RedDiffuser significantly increases the toxicity rate in LLaVA outputs by 10.69% and 8.91% on the original and hold-out sets, respectively. It also exhibits strong transferability, increasing toxicity rates on Gemini by 5.1% and on LLaMA-Vision by 26.83%. These findings uncover a cross-modal toxicity amplification vulnerability in current VLM alignment, highlighting the need for robust multimodal red teaming. We will release the RedDiffuser codebase to support future research.

[249] Systematic Literature Review on Vehicular Collaborative Perception - A Computer Vision Perspective

Lei Wan, Jianxin Zhao, Andreas Wiedholz, Manuel Bied, Mateus Martinez de Lucena, Abhishek Dinkar Jagtap, Andreas Festag, Antônio Augusto Fröhlich, Hannan Ejaz Keen, Alexey Vinel

Main category: cs.CV

TL;DR: This systematic review analyzes 106 papers on collaborative perception for autonomous vehicles, examining modalities, collaboration schemes, and perception tasks while addressing practical challenges like pose errors and communication constraints.

Details

Motivation: Current single-vehicle perception systems face limitations including visual occlusions and limited long-range detection, motivating the need for collaborative perception through V2V and V2I communication to enhance autonomous system reliability.

Method: The study follows PRISMA 2020 guidelines to systematically review 106 peer-reviewed articles, analyzing them based on modalities, collaboration schemes, and key perception tasks through comparative analysis.

Result: The review identifies how different methods address practical issues (pose errors, latency, communication constraints, domain shifts, heterogeneity, adversarial attacks) and reveals misalignment between current evaluation metrics and CP’s fundamental objectives.

Conclusion: This comprehensive review provides valuable insights into challenges, opportunities, and risks in vehicular collaborative perception, serving as a reference for advancing future research in this field.

Abstract: The effectiveness of autonomous vehicles relies on reliable perception capabilities. Despite significant advancements in artificial intelligence and sensor fusion technologies, current single-vehicle perception systems continue to encounter limitations, notably visual occlusions and limited long-range detection capabilities. Collaborative Perception (CP), enabled by Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communication, has emerged as a promising solution to mitigate these issues and enhance the reliability of autonomous systems. Beyond advancements in communication, the computer vision community is increasingly focusing on improving vehicular perception through collaborative approaches. However, a systematic literature review that thoroughly examines existing work and reduces subjective bias is still lacking. Such a systematic approach helps identify research gaps, recognize common trends across studies, and inform future research directions. In response, this study follows the PRISMA 2020 guidelines and includes 106 peer-reviewed articles. These publications are analyzed based on modalities, collaboration schemes, and key perception tasks. Through a comparative analysis, this review illustrates how different methods address practical issues such as pose errors, temporal latency, communication constraints, domain shifts, heterogeneity, and adversarial attacks. Furthermore, it critically examines evaluation methodologies, highlighting a misalignment between current metrics and CP’s fundamental objectives. By delving into all relevant topics in-depth, this review offers valuable insights into challenges, opportunities, and risks, serving as a reference for advancing research in vehicular collaborative perception.

[250] VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang

Main category: cs.CV

TL;DR: This paper introduces Reinforcement Fine-Tuning (RFT) with spatio-temporal rewards to enhance video reasoning in Multimodal Large Language Models, resulting in VideoChat-R1 which achieves state-of-the-art performance on temporal and spatial perception tasks.

Details

Motivation: To address the unique challenges of video understanding, particularly long-range temporal associations, by integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) for improved video reasoning.

Method: Proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method that incorporates rule-based temporal rewards to enhance video reasoning on specific tasks while preserving original model capabilities. Uses joint RFT on multiple spatio-temporal perception tasks.

Result: Developed VideoChat-R1 which achieves state-of-the-art spatio-temporal perception with significant improvements: temporal grounding (+31.8) and object tracking (+31.2). Also improves general QA benchmarks and enables a more reliable video dialogue system.

Conclusion: The work provides a foundation for developing robust, real-world video comprehension agents through enhanced perception and preserved chat abilities, leading to the proposed “Temporal Clue-driven Reasoning” inference schema.

Abstract: Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.

[251] WildFireCan-MMD: A Multimodal Dataset for Classification of User-Generated Content During Wildfires in Canada

Braeden Sherritt, Isar Nejadgholi, Efstratios Aivaliotis, Khaled Mslmani, Marzieh Amini

Main category: cs.CV

TL;DR: WildFireCan-MMD is a new multimodal dataset of Canadian wildfire X posts with 12 theme annotations. Custom-trained models outperform zero-shot VLMs and baseline classifiers, achieving 84.48% f-score for wildfire analysis.

Details

Motivation: Traditional wildfire data sources are slow and costly, while social media offers real-time updates but extracting relevant insights remains challenging. Existing multimodal datasets lack Canadian context representation.

Method: Created WildFireCan-MMD dataset with annotated X posts from Canadian wildfires across 12 themes. Evaluated zero-shot vision-language models, custom-trained classifiers, and baseline methods on this dataset.

Result: Custom-trained models achieved best performance (84.48% f-score), outperforming both zero-shot VLMs and baseline classifiers. The model was successfully applied to analyze trends in large unlabeled wildfire datasets.

Conclusion: Tailored datasets and task-specific training are crucial for effective wildfire response analysis. Localized datasets are important as disaster response requirements vary across regions and contexts.

Abstract: Rapid information access is vital during wildfires, yet traditional data sources are slow and costly. Social media offers real-time updates, but extracting relevant insights remains a challenge. In this work, we focus on multimodal wildfire social media data, which, although existing in current datasets, is currently underrepresented in Canadian contexts. We present WildFireCan-MMD, a new multimodal dataset of X posts from recent Canadian wildfires, annotated across twelve key themes. We evaluate zero-shot vision-language models on this dataset and compare their results with those of custom-trained and baseline classifiers. We show that while baseline methods and zero-shot prompting offer quick deployment, custom-trained models outperform them when labelled data is available. Our best-performing custom model reaches 84.48% f-score, outperforming VLMs and baseline classifiers. We also demonstrate how this model can be used to uncover trends during wildfires, through the collection and analysis of a large unlabeled dataset. Our dataset facilitates future research in wildfire response, and our findings highlight the importance of tailored datasets and task-specific training. Importantly, such datasets should be localized, as disaster response requirements vary across regions and contexts.

[252] CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones

Giacomo Pacini, Lorenzo Bianchi, Luca Ciampi, Nicola Messina, Giuseppe Amato, Fabrizio Falchi

Main category: cs.CV

TL;DR: CountingDINO is the first training-free exemplar-based class-agnostic counting framework that uses self-supervised vision-only backbones to extract object-aware features without any annotated data.

Details

Motivation: Current exemplar-based class-agnostic counting methods rely heavily on labeled data for training, which limits scalability and generalization to downstream use cases.

Method: Uses self-supervised DINO features to extract latent object prototypes via ROI-Align as convolutional kernels to generate similarity maps, then transforms them into density maps through a normalization scheme.

Result: Outperforms SOTA unsupervised object detector baseline and achieves competitive results against training-free methods with supervised backbones, non-training-free unsupervised methods, and several fully supervised SOTA approaches on FSC-147 benchmark.

Conclusion: Label- and training-free class-agnostic counting can be both scalable and effective, demonstrating the viability of fully unsupervised approaches.

Abstract: Class-agnostic counting (CAC) aims to estimate the number of objects in images without being restricted to predefined categories. However, while current exemplar-based CAC methods offer flexibility at inference time, they still rely heavily on labeled data for training, which limits scalability and generalization to many downstream use cases. In this paper, we introduce CountingDINO, the first training-free exemplar-based CAC framework that exploits a fully unsupervised feature extractor. Specifically, our approach employs self-supervised vision-only backbones to extract object-aware features, and it eliminates the need for annotated data throughout the entire proposed pipeline. At inference time, we extract latent object prototypes via ROI-Align from DINO features and use them as convolutional kernels to generate similarity maps. These are then transformed into density maps through a simple yet effective normalization scheme. We evaluate our approach on the FSC-147 benchmark, where we consistently outperform a baseline based on an SOTA unsupervised object detector under the same label- and training-free setting. Additionally, we achieve competitive results – and in some cases surpass – training-free methods that rely on supervised backbones, non-training-free unsupervised methods, as well as several fully supervised SOTA approaches. This demonstrates that label- and training-free CAC can be both scalable and effective. Code: https://lorebianchi98.github.io/CountingDINO/.

[253] RealRep: Generalized SDR-to-HDR Conversion via Attribute-Disentangled Representation Learning

Li Xu, Siqi Wang, Kepeng Xu, Gang He, Lin Zhang, Weiran Wang, Yu-Wing Tai

Main category: cs.CV

TL;DR: A robust SDR-to-HDR conversion framework using attribute-disentangled representations and degradation-aware mapping to handle diverse real-world SDR content.

Details

Motivation: Existing fixed tone mapping operators struggle with diverse appearances and degradations in real-world SDR content, requiring a more adaptive and robust solution.

Method: RealRep framework with attribute-disentangled representation learning, negative exemplar generation for contrastive learning, and DDACMNet - a lightweight two-stage mapping network with control-aware normalization.

Result: Outperforms state-of-the-art methods in generalization and perceptually faithful HDR color gamut reconstruction across diverse degradation domains.

Conclusion: The proposed RealRep framework provides robust and adaptive SDR-to-HDR conversion by effectively modeling tone discrepancies and degradation variations through attribute-level disentanglement.

Abstract: High-Dynamic-Range Wide-Color-Gamut (HDR-WCG) technology is becoming increasingly widespread, driving a growing need for converting Standard Dynamic Range (SDR) content to HDR. Existing methods primarily rely on fixed tone mapping operators, which struggle to handle the diverse appearances and degradations commonly present in real-world SDR content. To address this limitation, we propose a generalized SDR-to-HDR framework that enhances robustness by learning attribute-disentangled representations. Central to our approach is Realistic Attribute-Disentangled Representation Learning (RealRep), which explicitly disentangles luminance and chrominance components to capture intrinsic content variations across different SDR distributions. Furthermore, we design a Luma-/Chroma-aware negative exemplar generation strategy that constructs degradation-sensitive contrastive pairs, effectively modeling tone discrepancies across SDR styles. Building on these attribute-level priors, we introduce the Degradation-Domain Aware Controlled Mapping Network (DDACMNet), a lightweight, two-stage framework that performs adaptive hierarchical mapping guided by a control-aware normalization mechanism. DDACMNet dynamically modulates the mapping process via degradation-conditioned features, enabling robust adaptation across diverse degradation domains. Extensive experiments demonstrate that RealRep consistently outperforms state-of-the-art methods in both generalization and perceptually faithful HDR color gamut reconstruction.

[254] A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior

Jorge Quesada, Chen Zhou, Prithwijit Chowdhury, Mohammad Alotaibi, Ahmad Mustafa, Yusufjon Kumakov, Mohit Prabhushankar, Ghassan AlRegib

Main category: cs.CV

TL;DR: Large-scale benchmarking study on domain shift strategies for seismic fault delineation, analyzing 200+ model-dataset combinations to provide guidelines for handling distribution shifts in seismic interpretation workflows.

Details

Motivation: Lack of systematic understanding of model generalizability across diverse seismic data settings, with distributional shifts, limited fine-tuning strategies, and inconsistent evaluation protocols hindering reliable real-world deployment.

Method: Benchmark spanning 200+ combinations of model architectures, datasets (FaultSeg3D, CRACKS, Thebe) and training strategies, systematically assessing pretraining, fine-tuning, and joint training under varying domain shifts with novel fault characteristic descriptor analysis.

Result: Fine-tuning can cause catastrophic forgetting with disjoint datasets; larger models like Segformer are more robust; domain adaptation outperforms fine-tuning for large shifts but underperforms for similar domains; models absorb structural biases from training data.

Conclusion: Established robust experimental baseline revealing tradeoffs in fault delineation workflows, providing insights for building more generalizable and interpretable models in seismic interpretation.

Abstract: Machine learning has taken a critical role in seismic interpretation workflows, especially in fault delineation tasks. However, despite the recent proliferation of pretrained models and synthetic datasets, the field still lacks a systematic understanding of the generalizability limits of these models across seismic data representing diverse geologic, acquisition and processing settings. Distributional shifts between data sources, limitations in fine-tuning strategies and labeled data accessibility, and inconsistent evaluation protocols all remain major roadblocks to deploying reliable models in real-world exploration. In this paper, we present the first large-scale benchmarking study explicitly designed to provide guidelines for domain shift strategies in seismic interpretation. Our benchmark spans over 200 combinations of model architectures, datasets and training strategies, across three datasets (synthetic and real) including FaultSeg3D, CRACKS, and Thebe. We systematically assess pretraining, fine-tuning, and joint training under varying domain shifts. Our analysis shows that common fine-tuning practices can lead to catastrophic forgetting, especially when source and target datasets are disjoint, and that larger models such as Segformer are more robust than smaller architectures. We also find that domain adaptation methods outperform fine-tuning when shifts are large, yet underperform when domains are similar. Finally, we complement segmentation metrics with a novel analysis based on fault characteristic descriptors, revealing how models absorb structural biases from training datasets. Overall, we establish a robust experimental baseline that provides insights into tradeoffs in current fault delineation workflows and highlights directions for building more generalizable and interpretable models.

[255] FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment

Myunsoo Kim, Seong-Woong Shim, Byung-Jun Lee

Main category: cs.CV

TL;DR: FALCON is a learning-based mini-batch construction strategy that adaptively balances hard and false negatives in vision-language pretraining to mitigate the negative impact of false negatives on embedding quality.

Details

Motivation: False negatives in large-scale vision-language datasets introduce conflicting supervision signals that degrade the learned embedding space and reduce the effectiveness of hard negative sampling.

Method: FALCON uses a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement.

Result: FALCON significantly improves performance across three vision-language learning frameworks (ALBEF, BLIP-2, SigLIP-2) and a broad range of downstream tasks and evaluation settings.

Conclusion: FALCON demonstrates effectiveness and robustness in mitigating the impact of false negatives in vision-language pretraining through adaptive hard-negative sampling.

Abstract: False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across three vision-language learning frameworks (ALBEF, BLIP-2, SigLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.

[256] Continuous Subspace Optimization for Continual Learning

Quan Cheng, Yuanyu Wan, Lingyu Wu, Chenping Hou, Lijun Zhang

Main category: cs.CV

TL;DR: CoSO proposes continuous subspace optimization for continual learning by fine-tuning models in dynamically determined sequential subspaces using SVD of gradients, with orthogonal constraints to prevent forgetting.

Details

Motivation: Existing continual learning methods using pre-trained models with low-rank adaptation constrain optimization to fixed subspaces, limiting learning capacity and performance.

Method: CoSO fine-tunes models in sequential subspaces determined by SVD of gradients, projects gradients onto these subspaces, maintains orthogonal constraints to historical subspaces, and uses task-specific components to update historical task subspaces.

Result: Extensive experiments show CoSO significantly outperforms state-of-the-art methods, especially in challenging scenarios with long task sequences.

Conclusion: CoSO effectively addresses catastrophic forgetting in continual learning by enabling memory-efficient optimization through continuous subspace adaptation while maintaining strong performance across multiple tasks.

Abstract: Continual learning aims to learn multiple tasks sequentially while preserving prior knowledge, but faces the challenge of catastrophic forgetting when adapting to new tasks. Recently, approaches leveraging pre-trained models have gained increasing popularity in mitigating this issue, due to the strong generalization ability of foundation models. To adjust pre-trained models for new tasks, existing methods usually employ low-rank adaptation, which restricts parameter updates to a fixed low-rank subspace. However, constraining the optimization space inherently compromises the model’s learning capacity, resulting in inferior performance. To address this limitation, we propose Continuous Subspace Optimization for Continual Learning (CoSO) to fine-tune the model in a series of subspaces rather than a single one. These sequential subspaces are dynamically determined through the singular value decomposition of the gradients. CoSO updates the model by projecting gradients onto these subspaces, ensuring memory-efficient optimization. To mitigate forgetting, the optimization subspace of each task is constrained to be orthogonal to the historical task subspace. During task learning, CoSO maintains a task-specific component that captures the critical update directions for the current task. Upon completing a task, this component is used to update the historical task subspace, laying the groundwork for subsequent learning. Extensive experiments on multiple datasets demonstrate that CoSO significantly outperforms state-of-the-art methods, especially in challenging scenarios with long task sequences.

[257] FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, Ning Guo

Main category: cs.CV

TL;DR: FSDrive introduces visual spatio-temporal chains-of-thought to bridge the perception-planning gap in Vision-Language-Action models for autonomous driving, enabling models to “think visually” by generating future frames as reasoning steps.

Details

Motivation: Current VLA models rely on textual CoT which symbolically compresses visual information, creating a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained visual cues.

Method: FSDrive operates as a world model generating unified future frames with predicted backgrounds and physically-plausible priors (lane dividers, 3D object boxes), then uses the same VLA as an inverse-dynamics model for trajectory planning. Uses unified pre-training with visual tokens and joint optimization for VQA and future-frame prediction with progressive curriculum.

Result: Improves trajectory accuracy and reduces collisions on nuScenes and NAVSIM, achieves competitive FID for video generation with lightweight autoregressive model, and advances scene understanding on DriveLM.

Conclusion: Visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving by capturing both spatial structure and temporal evolution in a single visual representation.

Abstract: Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to “think visually” using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model’s vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.

[258] Towards Understanding the Mechanisms of Classifier-Free Guidance

Xiang Li, Rongrong Wang, Qing Qu

Main category: cs.CV

TL;DR: Classifier-free guidance (CFG) improves image generation by steering samples toward class means, amplifying class-specific features, and suppressing generic features from unconditional data.

Details

Motivation: To understand the underlying mechanisms of CFG, which powers state-of-the-art image generation systems but remains poorly understood.

Method: Analyze CFG in a simplified linear diffusion model and verify insights in real-world nonlinear diffusion models across various noise levels.

Result: Linear CFG reveals three key components: mean-shift steering samples toward class means, positive CPC amplifying class-specific features, and negative CPC suppressing generic features. These insights largely hold in nonlinear models.

Conclusion: The linear analysis provides valuable insights into CFG’s mechanisms in nonlinear diffusion models, despite some divergence at low noise levels.

Abstract: Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we begin by analyzing CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG’s mechanism in the nonlinear regime.

[259] DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

Yiheng Liu, Liao Qu, Huichao Zhang, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Xian Li, Shuai Wang, Daniel K. Du, Fangmin Chen, Zehuan Yuan, Xinglong Wu

Main category: cs.CV

TL;DR: DetailFlow is a coarse-to-fine 1D autoregressive image generation method that uses next-detail prediction with resolution-aware token sequences and parallel inference, achieving better quality and efficiency than previous methods.

Details

Motivation: To address the limitations of previous autoregressive image generation methods that require large token counts and suffer from slow inference speeds, by developing a more natural and efficient coarse-to-fine generation approach.

Method: Uses a novel next-detail prediction strategy with resolution-aware token sequences learned from progressively degraded images, enabling generation from global structure to fine details. Implements parallel inference with self-correction to accelerate generation and reduce sampling errors.

Result: Achieves 2.96 gFID on ImageNet 256x256 with only 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID) which require 680 tokens. Runs nearly 2x faster inference speed compared to previous methods.

Conclusion: DetailFlow demonstrates superior generation quality and efficiency compared to state-of-the-art methods, with significantly reduced token counts and faster inference through its coarse-to-fine 1D autoregressive approach.

Abstract: This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow’s superior generation quality and efficiency compared to existing state-of-the-art methods.

[260] A Unified and Fast-Sampling Diffusion Bridge Framework via Stochastic Optimal Control

Mokai Pan, Kaizhen Zhu, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi

Main category: cs.CV

TL;DR: UniDB is a unified diffusion bridge framework using Stochastic Optimal Control that improves detail preservation and sampling speed compared to existing h-transform methods.

Details

Motivation: Existing diffusion bridge models using Doob's h-transform produce blurred image details and lack theoretical foundation for their limitations.

Method: Reformulates diffusion bridges through SOC optimization, introduces tunable terminal penalty coefficient, derives closed-form SDE solutions for fast sampling, and implements SDE-Corrector mechanism.

Result: Achieves optimal balance between control costs and terminal penalties, substantially improves detail preservation and output quality across diverse image restoration tasks.

Conclusion: UniDB bridges the gap between theoretical generality and practical efficiency, providing a unified framework that outperforms existing diffusion bridge approaches.

Abstract: Recent advances in diffusion bridge models leverage Doob’s $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches often produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified and fast-sampling framework for diffusion bridges based on Stochastic Optimal Control (SOC). We reformulate the problem through an SOC-based optimization, proving that existing diffusion bridges employing Doob’s $h$-transform constitute a special case, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. To avoid computationally expensive costs of iterative Euler sampling methods in UniDB, we design a training-free accelerated algorithm by deriving exact closed-form solutions for UniDB’s reverse-time SDE. It is further complemented by replacing conventional noise prediction with a more stable data prediction model, along with an SDE-Corrector mechanism that maintains perceptual quality for low-step regimes, effectively reducing error accumulation. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework, bridging the gap between theoretical generality and practical efficiency. Our code is available online https://github.com/2769433owo/UniDB-plusplus.

[261] Visual Explanation via Similar Feature Activation for Metric Learning

Yi Liao, Ugochukwu Ejike Akpudo, Jue Zhang, Yongsheng Gao, Jun Zhou, Wenyi Zeng, Weichuan Zhang

Main category: cs.CV

TL;DR: SFAM is a novel visual explanation method for metric learning models that lack fully connected classifiers, using channel-wise importance scores from similarity measurements to create interpretable maps.

Details

Motivation: Existing class activation map methods (CAM, Grad-CAM, Relevance-CAM) cannot be applied to metric learning models because they require fully connected layers as classifiers, which metric learning models lack.

Method: Proposes Similar Feature Activation Map (SFAM) with channel-wise contribution importance score (CIS) derived from similarity between image embeddings, linearly combined with CNN feature maps to create explanation maps.

Result: Quantitative and qualitative experiments demonstrate that SFAM provides highly promising interpretable visual explanations for CNN models using Euclidean distance or cosine similarity metrics.

Conclusion: SFAM effectively addresses the limitation of traditional CAM methods by enabling visual explanations for metric learning models through similarity-based importance scoring.

Abstract: Visual explanation maps enhance the trustworthiness of decisions made by deep learning models and offer valuable guidance for developing new algorithms in image recognition tasks. Class activation maps (CAM) and their variants (e.g., Grad-CAM and Relevance-CAM) have been extensively employed to explore the interpretability of softmax-based convolutional neural networks, which require a fully connected layer as the classifier for decision-making. However, these methods cannot be directly applied to metric learning models, as such models lack a fully connected layer functioning as a classifier. To address this limitation, we propose a novel visual explanation method termed Similar Feature Activation Map (SFAM). This method introduces the channel-wise contribution importance score (CIS) to measure feature importance, derived from the similarity measurement between two image embeddings. The explanation map is constructed by linearly combining the proposed importance weights with the feature map from a CNN model. Quantitative and qualitative experiments show that SFAM provides highly promising interpretable visual explanations for CNN models using Euclidean distance or cosine similarity as the similarity metric.

[262] Jamais Vu: Exposing the Generalization Gap in Supervised Semantic Correspondence

Octave Mariotti, Zhipeng Du, Yash Bhalgat, Oisin Mac Aodha, Hakan Bilen

Main category: cs.CV

TL;DR: Proposes a method for dense semantic correspondence by lifting 2D keypoints to 3D canonical space using monocular depth, without requiring 3D supervision or camera annotations. Also introduces SPair-U dataset extension.

Details

Motivation: Supervised semantic correspondence methods are limited in generalization beyond sparsely annotated training keypoints, effectively acting as keypoint detectors rather than learning robust dense correspondences.

Method: Lifts 2D keypoints into canonical 3D space using monocular depth estimation, constructing a continuous canonical manifold that captures object geometry without explicit 3D supervision or camera annotations.

Result: Significantly outperforms supervised baselines on unseen keypoints and shows that unsupervised baselines outperform supervised counterparts when generalized across different datasets.

Conclusion: The proposed approach effectively learns robust dense correspondences by leveraging 3D geometry through canonical space lifting, demonstrating better generalization than supervised methods.

Abstract: Semantic correspondence (SC) aims to establish semantically meaningful matches across different instances of an object category. We illustrate how recent supervised SC methods remain limited in their ability to generalize beyond sparsely annotated training keypoints, effectively acting as keypoint detectors. To address this, we propose a novel approach for learning dense correspondences by lifting 2D keypoints into a canonical 3D space using monocular depth estimation. Our method constructs a continuous canonical manifold that captures object geometry without requiring explicit 3D supervision or camera annotations. Additionally, we introduce SPair-U, an extension of SPair-71k with novel keypoint annotations, to better assess generalization. Experiments not only demonstrate that our model significantly outperforms supervised baselines on unseen keypoints, highlighting its effectiveness in learning robust correspondences, but that unsupervised baselines outperform supervised counterparts when generalized across different datasets.

[263] AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments

Zikang Leng, Megha Thukral, Yaqi Liu, Hrudhai Rajasekhar, Shruthi K. Hiremath, Jiaman He, Thomas Plötz

Main category: cs.CV

TL;DR: AgentSense uses LLM-guided embodied agents in simulated smart homes to generate synthetic sensor data for Human Activity Recognition, addressing data scarcity issues and improving model performance especially in low-resource settings.

Details

Motivation: Address the lack of large and diverse labeled datasets for HAR in smart homes, and overcome variations in home layouts, sensor configurations, and individual behaviors that limit system robustness and generalizability.

Method: Virtual data generation pipeline where LLM-guided embodied agents perform daily routines in simulated smart homes (VirtualHome simulator), with routines decomposed into fine-grained actions and recorded by virtual ambient sensors.

Result: Models pretrained on generated data consistently outperform baselines, especially in low-resource settings. Combining virtual data with small amounts of real data achieves performance comparable to training on full real-world datasets across five real HAR datasets.

Conclusion: LLM-guided embodied agents provide a scalable and cost-effective approach for sensor data generation in HAR, producing privacy-preserving synthetic data that reflects real-world diversity and enhances model performance.

Abstract: A major challenge in developing robust and generalizable Human Activity Recognition (HAR) systems for smart homes is the lack of large and diverse labeled datasets. Variations in home layouts, sensor configurations, and individual behaviors further exacerbate this issue. To address this, we leverage the idea of embodied AI agents – virtual agents that perceive and act within simulated environments guided by internal world models. We introduce AgentSense, a virtual data generation pipeline in which agents live out daily routines in simulated smart homes, with behavior guided by Large Language Models (LLMs). The LLM generates diverse synthetic personas and realistic routines grounded in the environment, which are then decomposed into fine-grained actions. These actions are executed in an extended version of the VirtualHome simulator, which we augment with virtual ambient sensors that record the agents’ activities. Our approach produces rich, privacy-preserving sensor data that reflects real-world diversity. We evaluate AgentSense on five real HAR datasets. Models pretrained on the generated data consistently outperform baselines, especially in low-resource settings. Furthermore, combining the generated virtual sensor data with a small amount of real data achieves performance comparable to training on full real-world datasets. These results highlight the potential of using LLM-guided embodied agents for scalable and cost-effective sensor data generation in HAR. Our code is publicly available at https://github.com/ZikangLeng/AgentSense.

[264] X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability

Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, Gim Hee Lee

Main category: cs.CV

TL;DR: X-Scene is a novel framework for large-scale driving scene generation that achieves geometric intricacy, appearance fidelity, and flexible controllability through multi-granular control and unified 3D semantic occupancy generation.

Details

Motivation: While diffusion models have advanced autonomous driving through realistic data synthesis, large-scale 3D scene generation requiring spatial coherence remains underexplored.

Method: X-Scene uses a unified pipeline that sequentially generates 3D semantic occupancy and multi-view images/videos, with consistency-aware outpainting to extend local regions into large-scale scenes while maintaining spatial and visual coherence.

Result: Extensive experiments demonstrate that X-Scene substantially advances controllability and fidelity in large-scale scene generation, supporting diverse applications such as simulation and scene exploration.

Conclusion: X-Scene empowers data generation and simulation for autonomous driving by achieving superior geometric intricacy, appearance fidelity, and flexible controllability in large-scale scene generation.

Abstract: Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, large-scale 3D scene generation requiring spatial coherence remains underexplored. In this paper, we present X-Scene, a novel framework for large-scale driving scene generation that achieves geometric intricacy, appearance fidelity, and flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level layout conditioning driven by user input or text for detailed scene composition, and high-level semantic guidance informed by user intent and LLM-enriched prompts for efficient customization. To enhance geometric and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and corresponding multi-view images and videos, ensuring alignment and temporal consistency across modalities. We further extend local regions into large-scale scenes via consistency-aware outpainting, which extrapolates occupancy and images from previously generated areas to maintain spatial and visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as simulation and scene exploration. Extensive experiments demonstrate that X-Scene substantially advances controllability and fidelity in large-scale scene generation, empowering data generation and simulation for autonomous driving.

[265] PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications

Pietro Bonazzi, Nicola Farronato, Stefan Zihlmann, Haotong Qin, Michele Magno

Main category: cs.CV

TL;DR: PicoSAM2 is a lightweight (1.3M parameters) promptable segmentation model optimized for edge devices and in-sensor execution, achieving real-time performance on devices like Sony IMX500 with 14.3ms latency.

Details

Motivation: Enable real-time, on-device segmentation for latency-sensitive and privacy-aware applications like smart glasses and IoT devices, eliminating the need for cloud or host processing.

Method: Builds on depthwise separable U-Net architecture with knowledge distillation from SAM2, using fixed-point prompt encoding and quantization (1.22MB model size).

Result: Achieves 51.9% mIoU on COCO and 44.9% mIoU on LVIS, with distillation boosting LVIS performance by +3.5% mIoU and +5.1% mAP. Runs at 14.3ms on IMX500 with 86 MACs/cycle.

Conclusion: Efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud processing while meeting memory and compute constraints for in-sensor deployment.

Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.

[266] LidarPainter: One-Step Away From Any Lidar View To Novel Guidance

Yuzhou Ji, Ke Ma, Hong Cai, Anchun Zhang, Lizhuang Ma, Xin Tan

Main category: cs.CV

TL;DR: LidarPainter is a real-time diffusion model that reconstructs high-quality driving scenes from sparse LiDAR data and corrupted renderings, enabling lane shifts and stylized generation.

Details

Motivation: Current dynamic driving scene reconstruction methods suffer from degradation when views deviate from input trajectory, causing corrupted backgrounds and vehicle models. Existing approaches have limitations in consistency, deformation, and speed.

Method: Proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR conditions and artifact-corrupted renderings in real-time.

Result: Outperforms state-of-the-art methods with 7x faster speed than StreetCrafter and only one-fifth GPU memory requirement. Supports stylized generation using text prompts like “foggy” and “night”.

Conclusion: LidarPainter enables high-fidelity lane shifts in driving scene reconstruction and allows diverse expansion of existing asset libraries through stylized generation.

Abstract: Dynamic driving scene reconstruction is of great importance in fields like digital twin system and autonomous driving simulation. However, unacceptable degradation occurs when the view deviates from the input trajectory, leading to corrupted background and vehicle models. To improve reconstruction quality on novel trajectory, existing methods are subject to various limitations including inconsistency, deformation, and time consumption. This paper proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR condition and artifact-corrupted renderings in real-time, enabling high-fidelity lane shifts in driving scene reconstruction. Extensive experiments show that LidarPainter outperforms state-of-the-art methods in speed, quality and resource efficiency, specifically 7 x faster than StreetCrafter with only one fifth of GPU memory required. LidarPainter also supports stylized generation using text prompts such as “foggy” and “night”, allowing for a diverse expansion of the existing asset library.

[267] Imbalance in Balance: Online Concept Balancing in Generation Models

Yukai Shi, Jiarong Ou, Rui Chen, Haotian Yang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai

Main category: cs.CV

TL;DR: Proposes IMBA loss to improve stability in complex concept responses for visual generation, achieving competitive results with minimal code changes.

Details

Motivation: Addresses poor stability and error-prone responses when combining complex concepts in visual generation tasks, an under-explored problem area.

Method: Uses concept-wise equalization loss function (IMBA loss) that operates online without offline dataset processing, requiring minimal code modifications.

Result: Significantly enhances concept response capability on Inert-CompBench benchmark and two other public test sets, achieving competitive performance.

Conclusion: The IMBA loss method effectively improves complex concept handling in visual generation with minimal implementation overhead.

Abstract: In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes released at https://github.com/KwaiVGI/IMBA-Loss.

[268] From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Grounded Open-vocabulary Situation Recognition

Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew

Main category: cs.CV

TL;DR: Proposes MIPD framework to distill multimodal knowledge from large MLLMs to small GSR models for enhanced open-vocabulary grounded situation recognition, improving generalization to unseen and rare situations.

Details

Motivation: MLLMs have strong zero-shot abilities but are resource-intensive, while conventional GSR models lack generalization for unseen and rare situations. Need to transfer knowledge from large models to small specialized models.

Method: Multimodal Interactive Prompt Distillation (MIPD) framework with LLM-based Judgmental Rationales Generator, scene-aware and instance-perception prompts, and Negative-Guided Multimodal Prompting Alignment to distill enriched multimodal knowledge.

Result: Achieves superior performance on seen, rare, and unseen situations on Ov-SWiG dataset, and improved unseen detection on HICO-DET dataset.

Conclusion: MIPD effectively transfers knowledge from large MLLMs to small GSR models, enhancing generalization and zero-shot abilities for open-vocabulary grounded situation recognition.

Abstract: Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.

[269] SpatioTemporal Difference Network for Video Depth Super-Resolution

Zhengxue Wang, Yuan Wu, Xiang Li, Zhiqiang Yan, Jian Yang

Main category: cs.CV

TL;DR: STDNet addresses long-tailed distribution issues in video depth super-resolution using spatial and temporal difference mechanisms to improve reconstruction in non-smooth regions and temporal variation zones.

Details

Motivation: Video depth super-resolution suffers from pronounced long-tailed distributions, particularly in spatial non-smooth regions and temporal variation zones, which degrade reconstruction quality.

Method: Proposed SpatioTemporal Difference Network (STDNet) with two branches: spatial difference branch that aligns RGB features with spatial difference representations for depth calibration, and temporal difference branch that propagates temporal variation information from adjacent frames for motion compensation.

Result: Extensive experiments across multiple datasets show STDNet outperforms existing approaches in video depth super-resolution.

Conclusion: STDNet effectively mitigates long-tailed distribution issues in video depth super-resolution through its spatial and temporal difference mechanisms, achieving superior performance compared to existing methods.

Abstract: Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.

[270] Harnessing Textual Semantic Priors for Knowledge Transfer and Refinement in CLIP-Driven Continual Learning

Lingfeng He, De Cheng, Di Xu, Huaijie Wang, Nannan Wang

Main category: cs.CV

TL;DR: SECA is a continual learning framework that leverages CLIP’s textual semantic priors to address the stability-plasticity dilemma through semantic-guided knowledge transfer and visual prototype refinement.

Details

Motivation: Current CL approaches using CLIP don't fully exploit textual semantic priors, leading to interference from unrelated tasks and limited plasticity due to modality gap. Visual classifiers lack rich semantics while text-based ones have limited plasticity.

Method: Proposes SECA with two modules: SG-AKT for semantic-guided adaptive knowledge transfer using textual cues to assess relevance and aggregate knowledge, and SE-VPR for semantic-enhanced visual prototype refinement using inter-class semantic relations from text embeddings.

Result: Extensive experiments on multiple benchmarks validate the effectiveness of the approach in addressing continual learning challenges.

Conclusion: SECA successfully harnesses textual semantic priors to guide knowledge transfer and enhance visual classifiers, achieving better balance between stability and plasticity in continual learning.

Abstract: Continual learning (CL) aims to equip models with the ability to learn from a stream of tasks without forgetting previous knowledge. With the progress of vision-language models like Contrastive Language-Image Pre-training (CLIP), their promise for CL has attracted increasing attention due to their strong generalizability. However, the potential of rich textual semantic priors in CLIP in addressing the stability-plasticity dilemma remains underexplored. During backbone training, most approaches transfer past knowledge without considering semantic relevance, leading to interference from unrelated tasks that disrupt the balance between stability and plasticity. Besides, while text-based classifiers provide strong generalization, they suffer from limited plasticity due to the inherent modality gap in CLIP. Visual classifiers help bridge this gap, but their prototypes lack rich and precise semantics. To address these challenges, we propose Semantic-Enriched Continual Adaptation (SECA), a unified framework that harnesses the anti-forgetting and structured nature of textual priors to guide semantic-aware knowledge transfer in the backbone and reinforce the semantic structure of the visual classifier. Specifically, a Semantic-Guided Adaptive Knowledge Transfer (SG-AKT) module is proposed to assess new images’ relevance to diverse historical visual knowledge via textual cues, and aggregate relevant knowledge in an instance-adaptive manner as distillation signals. Moreover, a Semantic-Enhanced Visual Prototype Refinement (SE-VPR) module is introduced to refine visual prototypes using inter-class semantic relations captured in class-wise textual embeddings. Extensive experiments on multiple benchmarks validate the effectiveness of our approach.

[271] PMGS: Reconstruction of Projectile Motion Across Large Spatiotemporal Spans via 3D Gaussian Splatting

Yijun Xu, Jingrui Zhang, Yuhan Chen, Dingwen Wang, Lei Yu, Chu He

Main category: cs.CV

TL;DR: PMGS reconstructs projectile motion using 3D Gaussian Splatting with two-stage workflow: target modeling and motion recovery, incorporating physics constraints and adaptive optimization.

Details

Motivation: Existing methods struggle with complex rigid motion across large spatiotemporal spans and lack physical consistency for high-speed nonlinear motion.

Method: Two-stage approach: 1) Target modeling via dynamic scene decomposition and improved point density control; 2) Motion recovery using per-frame SE(3) poses with acceleration consistency constraint, dynamic simulated annealing, and Kalman fusion.

Result: Superior performance in reconstructing high-speed nonlinear rigid motion compared to mainstream dynamic methods.

Conclusion: PMGS effectively addresses projectile motion reconstruction with physics-aware constraints and adaptive optimization strategies.

Abstract: Modeling complex rigid motion across large spatiotemporal spans remains an unresolved challenge in dynamic reconstruction. Existing paradigms are mainly confined to short-term, small-scale deformation and offer limited consideration for physical consistency. This study proposes PMGS, focusing on reconstructing Projectile Motion via 3D Gaussian Splatting. The workflow comprises two stages: 1) Target Modeling: achieving object-centralized reconstruction through dynamic scene decomposition and an improved point density control; 2) Motion Recovery: restoring full motion sequences by learning per-frame SE(3) poses. We introduce an acceleration consistency constraint to bridge Newtonian mechanics and pose estimation, and design a dynamic simulated annealing strategy that adaptively schedules learning rates based on motion states. Futhermore, we devise a Kalman fusion scheme to optimize error accumulation from multi-source observations to mitigate disturbances. Experiments show PMGS’s superior performance in reconstructing high-speed nonlinear rigid motion compared to mainstream dynamic methods.

[272] Towards Methane Detection Onboard Satellites

Maggie Chen, Hala Lambdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges, Giacomo Acciarini

Main category: cs.CV

TL;DR: ML models trained on unorthorectified satellite data (UnorthoDOS) achieve comparable methane detection performance to orthorectified data, while bypassing preprocessing steps and enabling faster onboard detection.

Details

Motivation: Timely methane detection is critical for climate change mitigation, and onboard ML can enable rapid detection while reducing downlink costs for faster response systems.

Method: Introduces UnorthoDOS approach using unorthorectified hyperspectral data from EMIT sensor, bypassing conventional orthorectification preprocessing. Also trains models on orthorectified data for comparison.

Result: ML models trained on unorthorectified data achieve performance comparable to orthorectified data. Models trained on orthorectified data outperform matched filter baseline (mag1c).

Conclusion: The approach enables effective methane detection without preprocessing, with released datasets and code supporting further research in onboard ML for environmental monitoring.

Abstract: Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.

[273] DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation

Boyi Li, Ce Zhang, Richard M. Timmerman, Wenxuan Bao

Main category: cs.CV

TL;DR: DGL-RSIS is a training-free unified framework that transfers vision language models to remote sensing segmentation by decoupling visual and textual representations and performing alignment at local semantic and global contextual levels.

Details

Motivation: Transferring vision language models from natural images to remote sensing segmentation is challenging due to domain gaps and task diversity, particularly in open-vocabulary semantic segmentation and referring expression segmentation.

Method: Uses Global-Local Decoupling to separate textual inputs into semantic and contextual tokens, Local Visual-Textual Alignment for context-aware feature extraction and prompt engineering, and Global Visual-Textual Alignment with Grad-CAM for contextual cue capture and mask selection.

Result: Outperforms existing training-free approaches on iSAID (OVSS) and RRSIS-D (RES) benchmarks, with ablation studies validating each module’s effectiveness.

Conclusion: First unified training-free framework for remote sensing image segmentation that effectively transfers VLM semantic capabilities without additional training.

Abstract: The emergence of vision language models (VLMs) bridges the gap between vision and language, enabling multimodal understanding beyond traditional visual-only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the large domain gap and the diversity of RS inputs across tasks, particularly in open-vocabulary semantic segmentation (OVSS) and referring expression segmentation (RES). Here, we propose a training-free unified framework, termed DGL-RSIS, which decouples visual and textual representations and performs visual-language alignment at both local semantic and global contextual levels. Specifically, a Global-Local Decoupling (GLD) module decomposes textual inputs into local semantic tokens and global contextual tokens, while image inputs are partitioned into class-agnostic mask proposals. Then, a Local Visual-Textual Alignment (LVTA) module adaptively extracts context-aware visual features from the mask proposals and enriches textual features through knowledge-guided prompt engineering, achieving OVSS from a local perspective. Furthermore, a Global Visual-Textual Alignment (GVTA) module employs a global-enhanced Grad-CAM mechanism to capture contextual cues for referring expressions, followed by a mask selection module that integrates pixel-level activations into mask-level segmentation outputs, thereby achieving RES from a global perspective. Experiments on the iSAID (OVSS) and RRSIS-D (RES) benchmarks demonstrate that DGL-RSIS outperforms existing training-free approaches. Ablation studies further validate the effectiveness of each module. To the best of our knowledge, this is the first unified training-free framework for RS image segmentation, which effectively transfers the semantic capability of VLMs trained on natural images to the RS domain without additional training.

[274] SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion

Zhiwen Yang, Yuxin Peng

Main category: cs.CV

TL;DR: SPHERE integrates voxel and Gaussian representations for 3D semantic scene completion, using semantic-guided Gaussian initialization and physical-aware harmonics enhancement to achieve realistic geometric details and semantic accuracy in autonomous driving scenes.

Details

Motivation: Existing voxel-based and plane-based SSC methods struggle with realistic geometric details, while neural reconstruction methods like NeRF and 3DGS have high computational costs and slow convergence in large-scale autonomous driving scenes, leading to poor semantic accuracy.

Method: Proposes SPHERE with two modules: 1) Semantic-guided Gaussian Initialization (SGI) uses dual-branch 3D scene representations to locate focal voxels as anchors for efficient Gaussian initialization; 2) Physical-aware Harmonics Enhancement (PHE) incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment.

Result: Extensive experiments on SemanticKITTI and SSCBench-KITTI-360 benchmarks validate SPHERE’s effectiveness in generating SSC results with realistic details.

Conclusion: SPHERE successfully bridges the gap between traditional SSC methods and neural reconstruction approaches by jointly exploiting semantic and physical information through integrated voxel and Gaussian representations.

Abstract: Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.

[275] Association and Consolidation: Evolutionary Memory-Enhanced Incremental Multi-View Clustering

Zisen Kong, Bo Zhong, Pengyuan Li, Dongxia Chang, Yiming Wang, Yongyong Chen

Main category: cs.CV

TL;DR: EMIMC is an incremental multi-view clustering method that addresses the stability-plasticity dilemma through brain-inspired memory mechanisms including rapid association, cognitive forgetting, and knowledge consolidation modules.

Details

Motivation: To solve the stability-plasticity dilemma in view-incremental scenarios where models need both plasticity to adapt to new data and stability to maintain long-term knowledge.

Method: Proposes EMIMC with three modules: rapid association for connecting new and historical views, cognitive forgetting with decay mechanism for dynamic knowledge integration, and knowledge consolidation using temporal tensors to refine short-term into long-term memory.

Result: Extensive experiments show EMIMC achieves remarkable advantages over state-of-the-art methods with strong knowledge retention capabilities in growing view scenarios.

Conclusion: EMIMC effectively addresses the stability-plasticity dilemma in incremental multi-view clustering through brain-inspired memory regulation mechanisms, demonstrating superior performance compared to existing methods.

Abstract: Incremental multi-view clustering aims to achieve stable clustering results while addressing the stability-plasticity dilemma (SPD) in view-incremental scenarios. The core challenge is that the model must have enough plasticity to quickly adapt to new data, while maintaining sufficient stability to consolidate long-term knowledge. To address this challenge, we propose a novel Evolutionary Memory-Enhanced Incremental Multi-View Clustering (EMIMC), inspired by the memory regulation mechanisms of the human brain. Specifically, we design a rapid association module to establish connections between new and historical views, thereby ensuring the plasticity required for learning new knowledge. Second, a cognitive forgetting module with a decay mechanism is introduced. By dynamically adjusting the contribution of the historical view to optimize knowledge integration. Finally, we propose a knowledge consolidation module to progressively refine short-term knowledge into stable long-term memory using temporal tensors, thereby ensuring model stability. By integrating these modules, EMIMC achieves strong knowledge retention capabilities in scenarios with growing views. Extensive experiments demonstrate that EMIMC exhibits remarkable advantages over existing state-of-the-art methods.

[276] RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, Quan Wang

Main category: cs.CV

TL;DR: RSVG-ZeroOV is a training-free framework for zero-shot open-vocabulary remote sensing visual grounding that leverages frozen foundation models without fine-tuning, achieving state-of-the-art performance.

Details

Motivation: Existing RSVG approaches are limited to closed-set vocabularies and require expensive datasets and fine-tuning, restricting their applicability in open-world scenarios.

Method: Three-stage framework: (1) Overview using VLM for cross-attention maps, (2) Focus using diffusion model for structural information, (3) Evolve with attention evolution module to purify segmentation masks.

Result: Extensive experiments show RSVG-ZeroOV consistently outperforms existing weakly-supervised and zero-shot methods without task-specific training.

Conclusion: RSVG-ZeroOV provides an efficient and scalable solution for open-vocabulary RSVG by exploring the potential of frozen foundation models in a training-free manner.

Abstract: Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

[277] Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

Zixuan Wang, Yu Sun, Hongwei Wang, Baoyu Jing, Xiang Shen, Xin Dong, Zhuolin Hao, Hongyu Xiong, Yang Song

Main category: cs.CV

TL;DR: A reasoning-enhanced MLLM pretraining paradigm for unified inappropriate content detection in short videos, using caption, VQA, and CoT tasks to bridge distribution gaps and improve generalization.

Details

Motivation: Existing approaches train separate small models for each content issue type, requiring extensive human-labeled data and lacking cross-issue generalization capability.

Method: Proposed three targeted pretraining tasks: Caption (enhance video detail perception), VQA (deepen understanding of issue definitions), and Chain-of-Thought (enhance reasoning capability) to bridge distribution gaps.

Result: Pretraining approach significantly improves MLLM performance in both zero-shot and supervised fine-tuning settings, with strong generalization to emergent, previously unseen issues.

Conclusion: The reasoning-enhanced MLLM pretraining paradigm effectively addresses distribution gaps and complex issue definitions, enabling unified inappropriate content detection with strong generalization capabilities.

Abstract: Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) \textit{Caption}, to enhance the MLLM’s perception of video details; (2) \textit{Visual Question Answering (VQA)}, to deepen the MLLM’s understanding of issue definitions and annotation guidelines; (3) \textit{Chain-of-Thought (CoT)}, to enhance the MLLM’s reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM’s performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.

Yujian Yuan, Changjie Wu, Xinyuan Chang, Sijin Wang, Hang Zhang, Shiyi Liang, Shuang Zeng, Mu Xu, Ning Guo

Main category: cs.CV

TL;DR: UniMapGen is a generative framework for large-scale map construction that uses discrete sequence representation and multi-modal inputs to overcome satellite data limitations and generate smooth, complete map vectors.

Details

Motivation: Traditional map construction methods are costly and inefficient, while existing satellite-based approaches suffer from data limitations (occlusions, outdatedness) and produce discontinuous roads requiring extensive post-processing.

Method: Represents lane lines as discrete sequences with iterative generation, supports multi-modal inputs (BEV, PV, text prompts), and uses state update strategy for global continuity and consistency.

Result: Achieves state-of-the-art performance on OpenSatMap dataset, can infer occluded roads and predict missing roads from dataset annotations.

Conclusion: UniMapGen provides an efficient generative framework that overcomes satellite data limitations and produces high-quality, complete map vectors for large-scale map construction.

Abstract: Large-scale map construction plays a vital role in applications like autonomous driving and navigation systems. Traditional large-scale map construction approaches mainly rely on costly and inefficient special data collection vehicles and labor-intensive annotation processes. While existing satellite-based methods have demonstrated promising potential in enhancing the efficiency and coverage of map construction, they exhibit two major limitations: (1) inherent drawbacks of satellite data (e.g., occlusions, outdatedness) and (2) inefficient vectorization from perception-based methods, resulting in discontinuous and rough roads that require extensive post-processing. This paper presents a novel generative framework, UniMapGen, for large-scale map construction, offering three key innovations: (1) representing lane lines as \textbf{discrete sequence} and establishing an iterative strategy to generate more complete and smooth map vectors than traditional perception-based methods. (2) proposing a flexible architecture that supports \textbf{multi-modal} inputs, enabling dynamic selection among BEV, PV, and text prompt, to overcome the drawbacks of satellite data. (3) developing a \textbf{state update} strategy for global continuity and consistency of the constructed large-scale map. UniMapGen achieves state-of-the-art performance on the OpenSatMap dataset. Furthermore, UniMapGen can infer occluded roads and predict roads missing from dataset annotations. Our code will be released.

[279] Multi Class Parkinson Disease Detection Based on Finger Tapping Using Attention Enhanced CNN BiLSTM

Abu Saleh Musa Miah, Najmul Hassan, Md Maruf Al Hossain, Yuichi Okuyama, Jungpil Shin

Main category: cs.CV

TL;DR: A multi-class Parkinson’s disease severity detection system using finger-tapping videos with attention-enhanced CNN-BiLSTM framework and handcrafted features.

Details

Motivation: Existing gesture-based PD recognition systems have unsatisfactory performance, and accurate PD severity evaluation is essential for clinical management and intervention development.

Method: Extracted temporal, frequency, and amplitude-based features from finger-tapping videos, then processed through attention-enhanced CNN-BiLSTM model with Conv1D MaxPooling, BiLSTM layers, attention mechanism, and softmax classifier for multi-class PD severity prediction.

Result: The model demonstrated strong performance in distinguishing between five PD severity classes, showing effectiveness of combining spatial-temporal representations with attention mechanisms.

Conclusion: This approach offers a promising non-invasive tool to assist clinicians in monitoring PD progression and making informed treatment decisions.

Abstract: Accurate evaluation of Parkinsons disease (PD) severity is essential for effective clinical management and intervention development. Despite the proposal of several gesture based PD recognition systems, including those using the finger tapping task to assess Parkinsonian symptoms, their performance remains unsatisfactory. In this study, we present a multi class PD detection system based on finger-tapping, using an attention-enhanced CNN BiLSTM framework combined with handcrafted feature extraction and deep learning techniques. In the procedure, we used an existing dataset of finger tapping videos to extract temporal, frequency, and amplitude-based features from wrist and hand movements using their formulas. These handcrafted features were then processed through our attention enhanced CNN BiLSTM model, a hybrid deep learning framework that integrates CNN, BiLSTM, and attention mechanisms to classify PD severity into multiple levels. The features first pass through a Conv1D MaxPooling block to capture local spatial dependencies, followed by processing through a BiLSTM layer to model the temporal dynamics of the motion. An attention mechanism is applied to emphasize the most informative temporal features, which are then refined by a second BiLSTM layer. The CNN derived features and attention enhanced BiLSTM outputs are concatenated, followed by dense and dropout layers, before being passed through a softmax classifier to predict the PD severity level. Our model demonstrated strong performance in distinguishing between the five severity classes, showcasing the effectiveness of combining spatial temporal representations with attention mechanisms for automated PD severity detection. This approach offers a promising non invasive tool to assist clinicians in monitoring PD progression and making informed treatment decisions.

[280] AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset

Amjid Ali, Zulfiqar Ahmad Khan, Altaf Hussain, Muhammad Munsif, Adnan Hussain, Sung Wook Baik

Main category: cs.CV

TL;DR: AVAR-Net is a lightweight audio-visual anomaly recognition framework that combines audio and visual features using Wav2Vec2 and MobileViT, with early fusion and MTCN for temporal modeling, achieving state-of-the-art performance on new VAAR and existing XD-Violence datasets.

Details

Motivation: Current anomaly recognition methods rely only on visual data, making them unreliable under challenging conditions like occlusion, low light, and bad weather. The lack of large-scale synchronized audio-visual datasets has limited progress in multimodal anomaly detection.

Method: AVAR-Net uses Wav2Vec2 for audio feature extraction, MobileViT for visual feature extraction, early fusion to combine modalities, and Multi-Stage Temporal Convolutional Network (MTCN) to learn long-range temporal dependencies for spatiotemporal reasoning.

Result: AVAR-Net achieves 89.29% accuracy on the new VAAR dataset and 88.56% Average Precision on XD-Violence dataset, improving Average Precision by 2.8% over state-of-the-art methods.

Conclusion: The framework demonstrates effectiveness, efficiency, and generalization capability, and the VAAR dataset serves as a valuable benchmark for advancing multimodal anomaly recognition research.

Abstract: Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.

[281] Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu

Main category: cs.CV

TL;DR: Honey-Data-15M is a new 15M QA pair SFT dataset with dual-level CoT enrichment that enables Bee-8B to achieve SOTA performance for fully open MLLMs, competitive with semi-open models.

Details

Motivation: Address the data quality gap between fully open and proprietary MLLMs, particularly the widespread noise and lack of complex reasoning data like Chain-of-Thought in existing open-source datasets.

Method: Created Honey-Data-15M dataset with multiple cleaning techniques and dual-level CoT enrichment, developed HoneyPipe data curation pipeline and DataStudio framework, and trained Bee-8B model on this dataset.

Result: Bee-8B establishes new SOTA for fully open MLLMs, achieving performance competitive with and sometimes surpassing recent semi-open models like InternVL3.5-8B.

Conclusion: A principled focus on data quality is key to developing fully open MLLMs that are highly competitive with semi-open counterparts, demonstrated through comprehensive resources including dataset, pipeline, and model.

Abstract: Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

[282] OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, Fei Huang

Main category: cs.CV

TL;DR: OSWorld-MCP is the first comprehensive benchmark for evaluating computer-use agents’ tool invocation, GUI operation, and decision-making abilities in real-world environments, addressing the gap in fair assessment of tool invocation capabilities.

Details

Motivation: Past evaluations focused mainly on GUI interaction skills while overlooking tool invocation abilities enabled by Model Context Protocol (MCP), creating unfair comparisons between agents with integrated tool invocation and those evaluated only on GUI interaction.

Method: Developed a novel automated code-generation pipeline to create tools and combined them with curated existing tools, followed by rigorous manual validation to produce 158 high-quality tools covering 7 common applications.

Result: MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3, from 40.1% to 43.3% for Claude 4 Sonnet), but even the strongest models have relatively low tool invocation rates (only 36.3%), indicating room for improvement.

Conclusion: OSWorld-MCP deepens understanding of multimodal agents by explicitly measuring MCP tool usage skills and sets a new standard for evaluating performance in complex, tool-assisted environments, highlighting the importance of assessing tool invocation capabilities.

Abstract: With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents’ tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark’s challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.

[283] PLUTO-4: Frontier Pathology Foundation Models

Harshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti, Ryun Han, Andrew Walker, Judy Shen, Chintan Shah, Blake Martin, Aashish Sood, Elliot Miller, Ben Glass, Andy Beck, Harsha Pokkalla, Syed Ashar Javed

Main category: cs.CV

TL;DR: PLUTO-4 introduces two advanced pathology foundation models - a compact PLUTO-4S for efficient deployment and a frontier-scale PLUTO-4G for maximum performance - achieving state-of-the-art results across various pathology tasks.

Details

Motivation: To build on the progress of foundation models in pathology by creating next-generation models that can handle diverse histopathology tasks with improved efficiency and performance.

Method: Developed two Vision Transformer architectures: PLUTO-4S (compact, multi-scale using FlexiViT with 2D-RoPE) and PLUTO-4G (frontier-scale, single patch size). Pretrained on 551,164 WSIs from 137,144 patients across 50+ institutions using DINOv2 self-supervised objective.

Result: State-of-the-art performance across tile classification, segmentation, and slide-level diagnosis. PLUTO-4S provides high-throughput deployment, PLUTO-4G achieves 11% improvement in dermatopathology diagnosis and establishes new performance frontiers.

Conclusion: PLUTO-4 has strong potential to transform real-world pathology applications as a backbone for both translational research and diagnostic use cases, with diverse improvements across multiple benchmarks.

Abstract: Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including tile classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4’s potential to transform real-world applications as a backbone for translational research and diagnostic use cases.

[284] I Detect What I Don’t Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging

Nand Kumar Yadav, Rodrigue Rizk, William CW Chen, KC Santosh

Main category: cs.CV

TL;DR: Unsupervised anomaly detection framework for medical imaging that incrementally expands normal samples without labels using lightweight adapters and uncertainty-gated admission.

Details

Motivation: Addresses the challenge of unknown anomaly detection in medical imaging where labeled anomalies are scarce and expert supervision is costly.

Method: Uses frozen pretrained vision backbone with tiny convolutional adapters, k-NN anomaly scoring with compact coreset, and dual probabilistic gates (z-score distance threshold + SWAG-based epistemic uncertainty) for safe sample admission.

Result: Substantial improvements across datasets: COVID-CXR (ROC-AUC: 0.9489→0.9982, F1: 0.8048→0.9746), Pneumonia CXR (ROC-AUC: 0.6834→0.8968), Brain MRI ND-5 (ROC-AUC: 0.6041→0.7269, PR-AUC: 0.7539→0.8211).

Conclusion: The framework effectively refines normality definition incrementally, demonstrating high efficiency and effectiveness for real-world medical imaging applications with scarce labels.

Abstract: Unknown anomaly detection in medical imaging remains a fundamental challenge due to the scarcity of labeled anomalies and the high cost of expert supervision. We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels. Starting from a small, verified seed of normal images, our method alternates between lightweight adapter updates and uncertainty-gated sample admission. A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead. Extracted embeddings are stored in a compact coreset enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during incremental expansion is enforced by dual probabilistic gates, a sample is admitted into the normal memory only if its distance to the existing coreset lies within a calibrated z-score threshold, and its SWAG-based epistemic uncertainty remains below a seed-calibrated bound. This mechanism prevents drift and false inclusions without relying on generative reconstruction or replay buffers. Empirically, our system steadily refines the notion of normality as unlabeled data arrive, producing substantial gains over baselines. On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These results highlight the effectiveness and efficiency of the proposed framework for real-world, label-scarce medical imaging applications.

[285] Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution

Shiyao Sang

Main category: cs.CV

TL;DR: The paper challenges the need for exhaustive scene modeling in autonomous driving, proposing instead that effective planning emerges from belief-intent co-evolution using minimal semantic tokens.

Details

Motivation: Inspired by cognitive science, the authors challenge the assumption that comprehensive scene reconstruction is necessary for high-performance autonomous driving, suggesting planning arises from belief-intent co-evolution rather than world modeling.

Method: Proposes using sparse intent tokens for planning, conditioning trajectory decoding on predicted future tokens, and avoiding explicit reconstruction loss to enable cognitive planning through belief-intent co-evolution.

Result: Achieves 0.487m ADE with sparse intent tokens alone, improves to 0.382m ADE (21.6% improvement) with future token conditioning, and demonstrates cognitive consistency with stable token dynamics emerging through training.

Conclusion: Establishes a new paradigm where intelligence lies in belief-intent token duality rather than pixel fidelity, bridging world models and VLA systems for foresightful agents that plan through imagination.

Abstract: We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Inspired by cognitive science, we propose that effective planning arises not from reconstructing the world, but from the co-evolution of belief and intent within a minimal set of semantically rich tokens. Experiments on the nuPlan benchmark (720 scenarios, 11k+ samples) reveal three principles: (1) sparse intent tokens alone achieve 0.487 m ADE, demonstrating strong performance without future prediction; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.382 m, a 21.6% improvement, showing that performance emerges from cognitive planning; and (3) explicit reconstruction loss degrades performance, confirming that task-driven belief-intent co-evolution suffices under reliable perception inputs. Crucially, we observe the emergence of cognitive consistency: through prolonged training, the model spontaneously develops stable token dynamics that balance current perception (belief) and future goals (intent). This process, accompanied by “temporal fuzziness,” enables robustness under uncertainty and continuous self-optimization. Our work establishes a new paradigm: intelligence lies not in pixel fidelity, but in the tokenized duality of belief and intent. By reframing planning as understanding rather than reaction, TIWM bridges the gap between world models and VLA systems, paving the way for foresightful agents that plan through imagination. Note: Numerical comparisons with methods reporting results on nuScenes are indicative only, as nuPlan presents a more challenging planning-focused evaluation.

[286] Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation

Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng

Main category: cs.CV

TL;DR: Proposes FCCT framework for fine-grained causal analysis of LVLMs and IRI technique to enhance visual perception and reduce hallucination without training.

Details

Motivation: Existing mechanistic interpretability analyses of LVLMs are insufficiently comprehensive, lacking examination of visual/textual tokens, model components, and full layers, limiting insights for improving faithfulness and hallucination mitigation.

Method: Introduces Fine-grained Cross-modal Causal Tracing (FCCT) framework for systematic quantification of causal effects on visual object perception, covering all visual/textual tokens, MHSA, FFNs, and hidden states across all decoder layers.

Result: FCCT reveals that MHSAs of last token in middle layers aggregate cross-modal information, while FFNs show three-stage hierarchical progression for storing/transferring visual object representations. IRI achieves state-of-the-art performance across five benchmarks.

Conclusion: The proposed IRI technique, based on FCCT insights, effectively enhances visual perception and mitigates hallucination while preserving inference speed and foundational performance, demonstrating the value of comprehensive mechanistic analysis.

Abstract: Despite the remarkable advancements of Large Vision-Language Models (LVLMs), the mechanistic interpretability remains underexplored. Existing analyses are insufficiently comprehensive and lack examination covering visual and textual tokens, model components, and the full range of layers. This limitation restricts actionable insights to improve the faithfulness of model output and the development of downstream tasks, such as hallucination mitigation. To address this limitation, we introduce Fine-grained Cross-modal Causal Tracing (FCCT) framework, which systematically quantifies the causal effects on visual object perception. FCCT conducts fine-grained analysis covering the full range of visual and textual tokens, three core model components including multi-head self-attention (MHSA), feed-forward networks (FFNs), and hidden states, across all decoder layers. Our analysis is the first to demonstrate that MHSAs of the last token in middle layers play a critical role in aggregating cross-modal information, while FFNs exhibit a three-stage hierarchical progression for the storage and transfer of visual object representations. Building on these insights, we propose Intermediate Representation Injection (IRI), a training-free inference-time technique that reinforces visual object information flow by precisely intervening on cross-modal representations at specific components and layers, thereby enhancing perception and mitigating hallucination. Consistent improvements across five widely used benchmarks and LVLMs demonstrate IRI achieves state-of-the-art performance, while preserving inference speed and other foundational performance.

[287] An Artificial Intelligence-based Assistant for the Visually Impaired

Luis Marquez-Carpintero, Francisco Gomez-Donoso, Zuria Bauer, Bessie Dominguez-Dager, Alvaro Belmonte-Baeza, Mónica Pina-Navarro, Francisco Morillas-Espejo, Felix Escalona, Miguel Cazorla

Main category: cs.CV

TL;DR: AIDEN is an AI assistant for visually impaired individuals that uses machine learning to identify objects, read text, and answer questions about the environment.

Details

Motivation: Visually impaired individuals face challenges in object identification, text reading, and navigation that limit their independence, and existing solutions like Braille and screen readers are not always effective.

Method: Uses state-of-the-art machine learning algorithms including You Only Look Once architectures and a Large Language and Vision Assistant to identify objects, read text, and answer environmental questions.

Result: The system facilitates user interaction and access to textual/visual information, enhancing user autonomy and access to information.

Conclusion: AIDEN improves daily usability for visually impaired individuals, as supported by positive user feedback.

Abstract: This paper describes an artificial intelligence-based assistant application, AIDEN, developed during 2023 and 2024, aimed at improving the quality of life for visually impaired individuals. Visually impaired individuals face challenges in identifying objects, reading text, and navigating unfamiliar environments, which can limit their independence and reduce their quality of life. Although solutions such as Braille, audio books, and screen readers exist, they may not be effective in all situations. This application leverages state-of-the-art machine learning algorithms to identify and describe objects, read text, and answer questions about the environment. Specifically, it uses You Only Look Once architectures and a Large Language and Vision Assistant. The system incorporates several methods to facilitate the user’s interaction with the system and access to textual and visual information in an appropriate manner. AIDEN aims to enhance user autonomy and access to information, contributing to an improved perception of daily usability, as supported by user feedback.

[288] Robust Nearest Neighbour Retrieval Using Targeted Manifold Manipulation

B. Ghosh, H. Harikumar, S. Rana

Main category: cs.CV

TL;DR: TMM-NN is a novel nearest-neighbor retrieval method that uses targeted perturbation patches to define neighborhoods based on sample responsiveness rather than geometric distance, outperforming traditional metrics.

Details

Motivation: Current nearest-neighbor retrieval relies on hand-tuning feature layers and distance metrics, which may not capture semantic relationships effectively.

Method: Uses lightweight query-specific trigger patches added to query images, weakly backdooring the network to steer inputs with patches toward a dummy class. Similar images require only slight shifts to be classified as dummy class.

Result: Outperforms traditional metrics under noise and across diverse tasks, with robustness analysis confirming effectiveness.

Conclusion: Trigger-based ranking retrieves more semantically related neighbors than traditional geometric distance approaches.

Abstract: Nearest-neighbour retrieval is central to classification and explainable-AI pipelines, but current practice relies on hand-tuning feature layers and distance metrics. We propose Targeted Manifold Manipulation-Nearest Neighbour (TMM-NN), which reconceptualises retrieval by assessing how readily each sample can be nudged into a designated region of the feature manifold; neighbourhoods are defined by a sample’s responsiveness to a targeted perturbation rather than absolute geometric distance. TMM-NN implements this through a lightweight, query-specific trigger patch. The patch is added to the query image, and the network is weakly ``backdoored’’ so that any input with the patch is steered toward a dummy class. Images similar to the query need only a slight shift and are classified as the dummy class with high probability, while dissimilar ones are less affected. By ranking candidates by this confidence, TMM-NN retrieves the most semantically related neighbours. Robustness analysis and benchmark experiments confirm this trigger-based ranking outperforms traditional metrics under noise and across diverse tasks.

[289] Physics-Informed Deformable Gaussian Splatting: Towards Unified Constitutive Laws for Time-Evolving Material Field

Haoqin Hong, Ding Fan, Fubin Dou, Zhi-Li Zhou, Haoran Sun, Congcong Zhu, Jingrun Chen

Main category: cs.CV

TL;DR: PIDG integrates physics constraints into 3D Gaussian Splatting by treating Gaussian particles as Lagrangian material points with time-varying constitutive parameters, supervised by 2D optical flow.

Details

Motivation: Pure data-driven 3DGS struggles to capture physics-driven motion patterns in dynamic scenes, creating a need for physics-informed approaches.

Method: Uses static-dynamic decoupled 4D hash encoding, imposes Cauchy momentum residual as physics constraint, predicts particle velocity and stress via time-evolving material field, and supervises with Lagrangian particle flow matched to optical flow.

Result: Significant improvements in physical consistency and monocular dynamic reconstruction quality on custom physics-driven datasets and standard synthetic/real-world datasets.

Conclusion: PIDG successfully bridges the gap between data-driven 3DGS and physics-driven motion modeling, enabling more physically accurate dynamic scene reconstruction.

Abstract: Recently, 3D Gaussian Splatting (3DGS), an explicit scene representation technique, has shown significant promise for dynamic novel-view synthesis from monocular video input. However, purely data-driven 3DGS often struggles to capture the diverse physics-driven motion patterns in dynamic scenes. To fill this gap, we propose Physics-Informed Deformable Gaussian Splatting (PIDG), which treats each Gaussian particle as a Lagrangian material point with time-varying constitutive parameters and is supervised by 2D optical flow via motion projection. Specifically, we adopt static-dynamic decoupled 4D decomposed hash encoding to reconstruct geometry and motion efficiently. Subsequently, we impose the Cauchy momentum residual as a physics constraint, enabling independent prediction of each particle’s velocity and constitutive stress via a time-evolving material field. Finally, we further supervise data fitting by matching Lagrangian particle flow to camera-compensated optical flow, which accelerates convergence and improves generalization. Experiments on a custom physics-driven dataset as well as on standard synthetic and real-world datasets demonstrate significant gains in physical consistency and monocular dynamic reconstruction quality.

[290] Active Learning for Animal Re-Identification with Ambiguity-Aware Sampling

Depanshu Sani, Mehar Khurana, Saket Anand

Main category: cs.CV

TL;DR: The paper introduces a novel active learning framework for animal re-identification that uses complementary clustering to identify ambiguous regions in embedding space, requiring only 0.033% of annotations to outperform existing methods.

Details

Motivation: Animal Re-ID faces challenges due to subtle distinguishing patterns, new species handling, and open-set nature. Existing foundation models underperform in zero-shot scenarios, while unsupervised and active learning methods are inadequate for animal Re-ID, creating a need for efficient annotation approaches.

Method: Proposes an AL Re-ID framework that leverages complementary clustering methods to identify structurally ambiguous regions in embedding space, mining informative and representative sample pairs. Uses oracle feedback (must-link/cannot-link constraints) with a simple annotation interface and integrates with unsupervised methods through constrained clustering refinement.

Result: Achieves average improvements of 10.49%, 11.19% and 3.99% (mAP) on 13 wildlife datasets over foundational, unsupervised, and active learning methods respectively, using only 0.033% of annotations. Also shows 11.09%, 8.2% and 2.06% improvement for unknown individuals in open-world settings.

Conclusion: The proposed active learning framework effectively addresses animal Re-ID challenges by combining complementary clustering with constrained refinement, achieving state-of-the-art performance with minimal annotation effort across diverse wildlife datasets.

Abstract: Animal Re-ID has recently gained substantial attention in the AI research community due to its high impact on biodiversity monitoring and unique research challenges arising from environmental factors. The subtle distinguishing patterns, handling new species and the inherent open-set nature make the problem even harder. To address these complexities, foundation models trained on labeled, large-scale and multi-species animal Re-ID datasets have recently been introduced to enable zero-shot Re-ID. However, our benchmarking reveals significant gaps in their zero-shot Re-ID performance for both known and unknown species. While this highlights the need for collecting labeled data in new domains, exhaustive annotation for Re-ID is laborious and requires domain expertise. Our analyses show that existing unsupervised (USL) and AL Re-ID methods underperform for animal Re-ID. To address these limitations, we introduce a novel AL Re-ID framework that leverages complementary clustering methods to uncover and target structurally ambiguous regions in the embedding space for mining pairs of samples that are both informative and broadly representative. Oracle feedback on these pairs, in the form of must-link and cannot-link constraints, facilitates a simple annotation interface, which naturally integrates with existing USL methods through our proposed constrained clustering refinement algorithm. Through extensive experiments, we demonstrate that, by utilizing only 0.033% of all annotations, our approach consistently outperforms existing foundational, USL and AL baselines. Specifically, we report an average improvement of 10.49%, 11.19% and 3.99% (mAP) on 13 wildlife datasets over foundational, USL and AL methods, respectively, while attaining state-of-the-art performance on each dataset. Furthermore, we also show an improvement of 11.09%, 8.2% and 2.06% for unknown individuals in an open-world setting.

[291] Relative Energy Learning for LiDAR Out-of-Distribution Detection

Zizhao Li, Zhengkang Xiang, Jiayang Ao, Joseph West, Kourosh Khoshelham

Main category: cs.CV

TL;DR: REL is a novel framework for OOD detection in LiDAR point clouds that uses relative energy scoring and synthetic outlier generation to improve reliability in autonomous driving.

Details

Motivation: Current LiDAR OOD methods fail to distinguish rare anomalies from common classes, causing high false-positive rates and overconfident errors in safety-critical autonomous driving scenarios.

Method: Proposes Relative Energy Learning (REL) using energy gap between positive/negative logits as scoring function, plus Point Raise - a lightweight data synthesis strategy that perturbs existing point clouds to generate auxiliary anomalies.

Result: Outperforms existing methods by large margin on SemanticKITTI and STU benchmarks, demonstrating improved robustness across various scenes.

Conclusion: Modeling relative energy with synthetic outliers provides principled and scalable solution for reliable OOD detection in open-world autonomous driving.

Abstract: Out-of-distribution (OOD) detection is a critical requirement for reliable autonomous driving, where safety depends on recognizing road obstacles and unexpected objects beyond the training distribution. Despite extensive research on OOD detection in 2D images, direct transfer to 3D LiDAR point clouds has been proven ineffective. Current LiDAR OOD methods struggle to distinguish rare anomalies from common classes, leading to high false-positive rates and overconfident errors in safety-critical settings. We propose Relative Energy Learning (REL), a simple yet effective framework for OOD detection in LiDAR point clouds. REL leverages the energy gap between positive (in-distribution) and negative logits as a relative scoring function, mitigating calibration issues in raw energy values and improving robustness across various scenes. To address the absence of OOD samples during training, we propose a lightweight data synthesis strategy called Point Raise, which perturbs existing point clouds to generate auxiliary anomalies without altering the inlier semantics. Evaluated on SemanticKITTI and the Spotting the Unexpected (STU) benchmark, REL consistently outperforms existing methods by a large margin. Our results highlight that modeling relative energy, combined with simple synthetic outliers, provides a principled and scalable solution for reliable OOD detection in open-world autonomous driving.

[292] Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama

Main category: cs.CV

TL;DR: Otter framework improves wide-angle few-shot action recognition by combining compound segmentation to highlight subjects and temporal reconstruction to model temporal relations, achieving state-of-the-art performance.

Details

Motivation: Wide-angle videos in FSAR are challenging due to background distractions and degraded temporal relations from similar backgrounds. Direct application of RWKV fails to highlight subjects effectively.

Method: Proposes Otter with Compound Segmentation Module (CSM) to segment and emphasize key patches, and Temporal Reconstruction Module (TRM) for bidirectional scanning to reconstruct temporal relations. Combines regular and temporal-enhanced prototypes.

Result: Achieves state-of-the-art performance on SSv2, Kinetics, UCF101, and HMDB51 benchmarks. Superior performance validated on VideoBadminton dataset for wide-angle FSAR.

Conclusion: Otter effectively addresses background distractions and temporal relation degradation in wide-angle FSAR through compound segmentation and temporal reconstruction, demonstrating significant performance improvements.

Abstract: Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting subjects against background information. The Temporal Reconstruction Module (TRM) is incorporated into the temporal-enhanced prototype construction to enable bidirectional scanning, allowing better reconstruct temporal relation. Furthermore, a regular prototype is combined with the temporal-enhanced prototype to simultaneously enhance subject emphasis and temporal modeling, improving wide-angle FSAR performance. Extensive experiments on benchmarks such as SSv2, Kinetics, UCF101, and HMDB51 demonstrate that Otter achieves state-of-the-art performance. Extra evaluation on the VideoBadminton dataset further validates the superiority of Otter in wide-angle FSAR.

[293] TiS-TSL: Image-Label Supervised Surgical Video Stereo Matching via Time-Switchable Teacher-Student Learning

Rui Wang, Ying Zhou, Hao Wang, Wenwei Zhang, Qiang Li, Zhiwei Wang

Main category: cs.CV

TL;DR: TiS-TSL is a time-switchable teacher-student learning framework for video stereo matching in minimally invasive surgery that addresses temporal inconsistency in disparity predictions through unified image and video prediction modes and bidirectional spatio-temporal consistency.

Details

Motivation: Stereo matching in minimally invasive surgery is essential for navigation and AR, but dense disparity supervision is impossible due to anatomical constraints, limiting annotations to sparse image-level labels. Existing teacher-student methods lack temporal consistency, causing unstable predictions and flickering artifacts across video frames.

Method: Proposes TiS-TSL with a unified model operating in three modes: Image-Prediction, Forward Video-Prediction, and Backward Video-Prediction. Uses two-stage learning: Image-to-Video stage transfers sparse image knowledge to temporal modeling, and Video-to-Video stage refines predictions using bidirectional spatio-temporal consistency to filter noisy labels and enforce temporal coherence.

Result: Experimental results on two public datasets show TiS-TSL exceeds other image-based state-of-the-art methods by improving TEPE and EPE by at least 2.11% and 4.54%, respectively.

Conclusion: TiS-TSL effectively addresses temporal inconsistency in surgical stereo matching through unified temporal modeling and bidirectional consistency estimation, achieving superior performance with minimal supervision.

Abstract: Stereo matching in minimally invasive surgery (MIS) is essential for next-generation navigation and augmented reality. Yet, dense disparity supervision is nearly impossible due to anatomical constraints, typically limiting annotations to only a few image-level labels acquired before the endoscope enters deep body cavities. Teacher-Student Learning (TSL) offers a promising solution by leveraging a teacher trained on sparse labels to generate pseudo labels and associated confidence maps from abundant unlabeled surgical videos. However, existing TSL methods are confined to image-level supervision, providing only spatial confidence and lacking temporal consistency estimation. This absence of spatio-temporal reliability results in unstable disparity predictions and severe flickering artifacts across video frames. To overcome these challenges, we propose TiS-TSL, a novel time-switchable teacher-student learning framework for video stereo matching under minimal supervision. At its core is a unified model that operates in three distinct modes: Image-Prediction (IP), Forward Video-Prediction (FVP), and Backward Video-Prediction (BVP), enabling flexible temporal modeling within a single architecture. Enabled by this unified model, TiS-TSL adopts a two-stage learning strategy. The Image-to-Video (I2V) stage transfers sparse image-level knowledge to initialize temporal modeling. The subsequent Video-to-Video (V2V) stage refines temporal disparity predictions by comparing forward and backward predictions to calculate bidirectional spatio-temporal consistency. This consistency identifies unreliable regions across frames, filters noisy video-level pseudo labels, and enforces temporal coherence. Experimental results on two public datasets demonstrate that TiS-TSL exceeds other image-based state-of-the-arts by improving TEPE and EPE by at least 2.11% and 4.54%, respectively.

[294] A Two-Stage System for Layout-Controlled Image Generation using Large Language Models and Diffusion Models

Jan-Hendrik Koch, Jonas Krumme, Konrad Gadzicki

Main category: cs.CV

TL;DR: A two-stage system using LLM for layout generation and diffusion models for image synthesis achieves precise control over object counts and spatial arrangements in text-to-image generation.

Details

Motivation: Text-to-image diffusion models lack precise control over object counts and spatial arrangements, limiting their compositional capabilities.

Method: Two-stage approach: 1) LLM generates structured layout from object lists, 2) layout-conditioned diffusion model synthesizes images. Task decomposition improves object recall from 57.2% to 99.9%. Compared ControlNet vs GLIGEN conditioning methods.

Result: System successfully generates images with specified object counts and plausible spatial arrangements. ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity with reduced prompt-based controllability.

Conclusion: Decoupled approach using LLM for planning and diffusion models for synthesis is viable for compositionally controlled image generation, with trade-offs between layout fidelity and text-based controllability.

Abstract: Text-to-image diffusion models exhibit remarkable generative capabilities, but lack precise control over object counts and spatial arrangements. This work introduces a two-stage system to address these compositional limitations. The first stage employs a Large Language Model (LLM) to generate a structured layout from a list of objects. The second stage uses a layout-conditioned diffusion model to synthesize a photorealistic image adhering to this layout. We find that task decomposition is critical for LLM-based spatial planning; by simplifying the initial generation to core objects and completing the layout with rule-based insertion, we improve object recall from 57.2% to 99.9% for complex scenes. For image synthesis, we compare two leading conditioning methods: ControlNet and GLIGEN. After domain-specific finetuning on table-setting datasets, we identify a key trade-off: ControlNet preserves text-based stylistic control but suffers from object hallucination, while GLIGEN provides superior layout fidelity at the cost of reduced prompt-based controllability. Our end-to-end system successfully generates images with specified object counts and plausible spatial arrangements, demonstrating the viability of a decoupled approach for compositionally controlled synthesis.

[295] Adaptive Morph-Patch Transformer for Aortic Vessel Segmentation

Zhenxi Zhang, Fuchen Zheng, Adnan Iltaf, Yifei Han, Zhenyu Cheng, Yue Du, Bin Li, Tianyong Liu, Shoujun Zhou

Main category: cs.CV

TL;DR: MPT introduces adaptive morphology-aware patches and semantic clustering attention for improved aortic vascular segmentation, achieving state-of-the-art performance on three datasets.

Details

Motivation: Traditional Transformer models use fixed rectangular patches that disrupt complex vascular structures, leading to suboptimal segmentation accuracy in aortic vascular segmentation.

Method: Proposes adaptive Morph Patch Transformer with morphology-aware patch partitioning and Semantic Clustering Attention to dynamically aggregate features from semantically similar patches.

Result: Achieves state-of-the-art performance on AVT, AortaSeg24 and TBAD datasets, with significant improvements in segmenting intricate vascular structures.

Conclusion: MPT effectively addresses the limitations of fixed-patch Transformers by preserving vascular structure integrity through adaptive morphology-aware patches and semantic feature aggregation.

Abstract: Accurate segmentation of aortic vascular structures is critical for diagnosing and treating cardiovascular diseases.Traditional Transformer-based models have shown promise in this domain by capturing long-range dependencies between vascular features. However, their reliance on fixed-size rectangular patches often influences the integrity of complex vascular structures, leading to suboptimal segmentation accuracy. To address this challenge, we propose the adaptive Morph Patch Transformer (MPT), a novel architecture specifically designed for aortic vascular segmentation. Specifically, MPT introduces an adaptive patch partitioning strategy that dynamically generates morphology-aware patches aligned with complex vascular structures. This strategy can preserve semantic integrity of complex vascular structures within individual patches. Moreover, a Semantic Clustering Attention (SCA) method is proposed to dynamically aggregate features from various patches with similar semantic characteristics. This method enhances the model’s capability to segment vessels of varying sizes, preserving the integrity of vascular structures. Extensive experiments on three open-source dataset(AVT, AortaSeg24 and TBAD) demonstrate that MPT achieves state-of-the-art performance, with improvements in segmenting intricate vascular structures.

[296] Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization

Binyan Xu, Fan Yang, Di Tang, Xilin Dai, Kehuan Zhang

Main category: cs.CV

TL;DR: GCB introduces a new clean-image backdoor attack framework using conditional InfoGAN to find natural image features as stealthy triggers, enabling attacks with minimal clean accuracy drop (<1%) across multiple datasets, architectures, and tasks.

Details

Motivation: Existing clean-image backdoor attacks require high poison rates that cause noticeable drops in clean accuracy, compromising stealthiness. The goal is to develop more stealthy attacks that minimize accuracy degradation.

Method: Uses conditional InfoGAN to identify naturally occurring image features that serve as potent and stealthy triggers. Ensures triggers are easily separable from benign task-related features to enable learning from extremely small poisoned datasets.

Result: Achieves clean accuracy drop of less than 1% while successfully attacking six datasets, five architectures, and four tasks (including first demonstration in regression and segmentation). Shows resilience against most existing backdoor defenses.

Conclusion: GCB presents a highly effective and stealthy clean-image backdoor attack paradigm that minimizes accuracy degradation while maintaining attack effectiveness across diverse scenarios, posing significant security threats.

Abstract: Clean-image backdoor attacks, which use only label manipulation in training datasets to compromise deep neural networks, pose a significant threat to security-critical applications. A critical flaw in existing methods is that the poison rate required for a successful attack induces a proportional, and thus noticeable, drop in Clean Accuracy (CA), undermining their stealthiness. This paper presents a new paradigm for clean-image attacks that minimizes this accuracy degradation by optimizing the trigger itself. We introduce Generative Clean-Image Backdoors (GCB), a framework that uses a conditional InfoGAN to identify naturally occurring image features that can serve as potent and stealthy triggers. By ensuring these triggers are easily separable from benign task-related features, GCB enables a victim model to learn the backdoor from an extremely small set of poisoned examples, resulting in a CA drop of less than 1%. Our experiments demonstrate GCB’s remarkable versatility, successfully adapting to six datasets, five architectures, and four tasks, including the first demonstration of clean-image backdoors in regression and segmentation. GCB also exhibits resilience against most of the existing backdoor defenses.

cs.AI

[297] Analysing Environmental Efficiency in AI for X-Ray Diagnosis

Liam Kearns

Main category: cs.AI

TL;DR: Comparison of LLMs vs small discriminative models for COVID-19 detection in chest X-rays, showing smaller models reduce carbon footprint but may have bias issues, while LLMs perform poorly when restricted to probabilistic outputs.

Details

Motivation: To compare the accuracy and environmental impact of large language models versus smaller custom models for medical diagnosis tasks, specifically COVID-19 detection in chest X-rays.

Method: Integrated 14 different model configurations in a Mendix application, using both LLMs (ChatGPT, Claude) and small discriminative models, with discriminative models providing knowledge bases for LLMs to improve accuracy.

Result: Smaller models reduced carbon footprint but showed bias towards positive diagnosis with low confidence probabilities. LLMs performed poorly when restricted to probabilistic outputs. Covid-Net achieved highest accuracy (95.5%) with 99.9% lower carbon footprint than GPT-4.5-Preview.

Conclusion: Small discriminative models are more efficient for classification tasks than LLMs, highlighting environmental risks of using generative AI tools for such applications.

Abstract: The integration of AI tools into medical applications has aimed to improve the efficiency of diagnosis. The emergence of large language models (LLMs), such as ChatGPT and Claude, has expanded this integration even further. Because of LLM versatility and ease of use through APIs, these larger models are often utilised even though smaller, custom models can be used instead. In this paper, LLMs and small discriminative models are integrated into a Mendix application to detect Covid-19 in chest X-rays. These discriminative models are also used to provide knowledge bases for LLMs to improve accuracy. This provides a benchmark study of 14 different model configurations for comparison of accuracy and environmental impact. The findings indicated that while smaller models reduced the carbon footprint of the application, the output was biased towards a positive diagnosis and the output probabilities were lacking confidence. Meanwhile, restricting LLMs to only give probabilistic output caused poor performance in both accuracy and carbon footprint, demonstrating the risk of using LLMs as a universal AI solution. While using the smaller LLM GPT-4.1-Nano reduced the carbon footprint by 94.2% compared to the larger models, this was still disproportionate to the discriminative models; the most efficient solution was the Covid-Net model. Although it had a larger carbon footprint than other small models, its carbon footprint was 99.9% less than when using GPT-4.5-Preview, whilst achieving an accuracy of 95.5%, the highest of all models examined. This paper contributes to knowledge by comparing generative and discriminative models in Covid-19 detection as well as highlighting the environmental risk of using generative tools for classification tasks.

[298] Agentic Educational Content Generation for African Languages on Edge Devices

Ravi Gupta, Guneet Bhatia

Main category: cs.AI

TL;DR: An autonomous agent framework for decentralized, culturally adaptive educational content generation on edge devices in Sub-Saharan Africa, achieving high performance and quality metrics.

Details

Motivation: Address educational inequity in Sub-Saharan Africa by providing accessible, localized, and sustainable AI-driven education in resource-constrained environments.

Method: Uses four specialized autonomous agents in a decentralized framework to generate contextually appropriate educational content on edge devices like Raspberry Pi 4B and NVIDIA Jetson Nano.

Result: Achieved 129 ms TTFT and 45.2 tokens/sec on Jetson Nano (8.4W), 326 ms TTFT and 15.9 tokens/sec on Raspberry Pi 4B (5.8W), with high multilingual quality (BLEU 0.688), cultural relevance (4.4/5), and fluency (4.2/5).

Conclusion: Establishes practical foundation for accessible, localized education in resource-constrained environments, contributing to UN SDGs 4, 9, and 10 through partnerships with community organizations.

Abstract: Addressing educational inequity in Sub-Saharan Africa, this research presents an autonomous agent-orchestrated framework for decentralized, culturally adaptive educational content generation on edge devices. The system leverages four specialized agents that work together to generate contextually appropriate educational content. Experimental validation on platforms including Raspberry Pi 4B and NVIDIA Jetson Nano demonstrates significant performance achievements. InkubaLM on Jetson Nano achieved a Time-To-First-Token (TTFT) of 129 ms, an average inter-token latency of 33 ms, and a throughput of 45.2 tokens per second while consuming 8.4 W. On Raspberry Pi 4B, InkubaLM also led with 326 ms TTFT and 15.9 tokens per second at 5.8 W power consumption. The framework consistently delivered high multilingual quality, averaging a BLEU score of 0.688, cultural relevance of 4.4/5, and fluency of 4.2/5 across tested African languages. Through potential partnerships with active community organizations including African Youth & Community Organization (AYCO) and Florida Africa Foundation, this research aims to establish a practical foundation for accessible, localized, and sustainable AI-driven education in resource-constrained environments. Keeping focus on long-term viability and cultural appropriateness, it contributes to United Nations SDGs 4, 9, and 10. Index Terms - Multi-Agent Systems, Edge AI Computing, Educational Technology, African Languages, Rural Education, Sustainable Development, UN SDG.

[299] Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

Qianxi He, Qingyu Ren, Shanzhe Lei, Xuhong Wang, Yingchun Wang

Main category: cs.AI

TL;DR: A confidence-based reward model for STEM reasoning that penalizes low-confidence correct answers to improve reasoning quality in smaller LLMs during RL training.

Details

Motivation: Traditional rule-based RL rewards often lead to poor reasoning chains and inconsistencies in smaller models, limiting their potential for direct RL training.

Method: Proposed a confidence-based reward model that penalizes both incorrect answers and low-confidence correct responses, validated through static evaluations, Best-of-N inference tests, and PPO-based RL training.

Result: Outperforms state-of-the-art open-source reward models across diverse STEM benchmarks.

Conclusion: The confidence-based approach effectively enhances reasoning robustness and logical consistency in smaller-scale LLMs during reinforcement learning training.

Abstract: Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, numerous technical reports indicate that purely rule-based reward RL frequently results in poor-quality reasoning chains or inconsistencies between reasoning processes and final answers, particularly when the base model is of smaller scale. During the RL exploration process, models might employ low-quality reasoning chains due to the lack of knowledge, occasionally producing correct answers randomly and receiving rewards based on established rule-based judges. This constrains the potential for resource-limited organizations to conduct direct reinforcement learning training on smaller-scale models. We propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in https://github.com/qianxiHe147/C2RM.

[300] Procedural Knowledge Improves Agentic LLM Workflows

Vincent Hsiao, Mark Roberts, Leslie Smith

Main category: cs.AI

TL;DR: Using hierarchical task networks (HTNs) as procedural knowledge dramatically improves LLM performance on agentic tasks, enabling smaller models (20b/70b) to outperform larger ones (120b).

Details

Motivation: LLMs struggle with agentic tasks without substantial tool support, prompt engineering, or fine-tuning. Procedural knowledge can improve planning efficiency but hasn't been well-evaluated for LLM agentic tasks.

Method: Formalized and implemented an agentic LLM workflow using hierarchical task networks (HTNs) as procedural knowledge, testing both hand-coded and LLM-created HTNs.

Result: Hand-coded HTNs dramatically improved LLM performance, boosting smaller models (20b/70b) to outperform larger baseline (120b). LLM-created HTNs also improved performance but less effectively.

Conclusion: Leveraging expertise (human, document, or LLM) to curate procedural knowledge will become an important tool for improving LLM workflows.

Abstract: Large language models (LLMs) often struggle when performing agentic tasks without substantial tool support, prom-pt engineering, or fine tuning. Despite research showing that domain-dependent, procedural knowledge can dramatically increase planning efficiency, little work evaluates its potential for improving LLM performance on agentic tasks that may require implicit planning. We formalize, implement, and evaluate an agentic LLM workflow that leverages procedural knowledge in the form of a hierarchical task network (HTN). Empirical results of our implementation show that hand-coded HTNs can dramatically improve LLM performance on agentic tasks, and using HTNs can boost a 20b or 70b parameter LLM to outperform a much larger 120b parameter LLM baseline. Furthermore, LLM-created HTNs improve overall performance, though less so. The results suggest that leveraging expertise–from humans, documents, or LLMs–to curate procedural knowledge will become another important tool for improving LLM workflows.

[301] Think Before You Retrieve: Learning Test-Time Adaptive Search with Small Language Models

Supriti Vijay, Aman Priyanshu, Anu Vellore, Baturay Saglam, Amin Karbasi

Main category: cs.AI

TL;DR: Orion is a training framework that enables compact models (350M-1.2B parameters) to perform iterative retrieval through learned search strategies, outperforming much larger retrievers on multiple benchmarks.

Details

Motivation: Current approaches fall short: neural retrievers lack reasoning, LLMs are too expensive, and query rewriting limits improvement to static transformations. Existing methods fail to capture iterative dynamics of exploration, feedback, and revision.

Method: Combines: (1) synthetic trajectory generation and supervised fine-tuning for diverse exploration patterns, (2) reinforcement learning that rewards effective query refinement and backtracking, (3) inference-time beam search algorithms exploiting learned self-reflection capabilities.

Result: 1.2B model achieves 77.6% success on SciFact (vs. 72.6% prior), 25.2% on BRIGHT (vs. 22.1%), 63.2% on NFCorpus (vs. 57.8%), competitive on FEVER, HotpotQA, MSMarco. Outperforms retrievers 200-400x larger on five of six benchmarks.

Conclusion: Retrieval performance can emerge from learned strategies, not just model scale, when models are trained to search, reflect, and revise.

Abstract: Effective information retrieval requires reasoning over partial evidence and refining strategies as information emerges. Yet current approaches fall short: neural retrievers lack reasoning capabilities, large language models (LLMs) provide semantic depth but at prohibitive cost, and query rewriting or decomposition limits improvement to static transformations. As a result, existing methods fail to capture the iterative dynamics of exploration, feedback, and revision that complex user queries demand. We introduce Orion, a training framework that enables compact models (350M-1.2B parameters) to perform iterative retrieval through learned search strategies. Orion combines: (1) synthetic trajectory generation and supervised fine-tuning to encourage diverse exploration patterns in models, (2) reinforcement learning (RL) that rewards effective query refinement and backtracking behaviors, and (3) inference-time beam search algorithms that exploit the self-reflection capabilities learned during RL. Despite using only 3% of the training data available, our 1.2B model achieves 77.6% success on SciFact (vs. 72.6% for prior retrievers), 25.2% on BRIGHT (vs. 22.1%), 63.2% on NFCorpus (vs. 57.8%), and remains competitive on FEVER, HotpotQA, and MSMarco. It outperforms retrievers up to 200-400x larger on five of six benchmarks. These findings suggest that retrieval performance can emerge from learned strategies, not just model scale, when models are trained to search, reflect, and revise.

[302] Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury

Main category: cs.AI

TL;DR: GSW is a neuro-inspired generative memory framework that enables LLMs to reason over evolving episodic events by building structured, interpretable representations with temporal, spatial, and logical coherence.

Details

Motivation: Current memory frameworks for LLMs are tailored for fact-based retrieval but fail to build narrative representations needed for tracking entities through episodic events, limiting long-context reasoning capabilities.

Method: GSW consists of an Operator that maps observations to semantic structures and a Reconciler that integrates them into a persistent workspace enforcing coherence across temporal, spatial, and logical dimensions.

Result: GSW outperforms RAG baselines by up to 20% on Episodic Memory Benchmark (100k-1M tokens) and reduces query-time context tokens by 51% compared to the next most token-efficient baseline.

Conclusion: GSW provides a blueprint for endowing LLMs with human-like episodic memory, enabling more capable agents that can reason over long horizons.

Abstract: Large Language Models (LLMs) face fundamental challenges in long-context reasoning: many documents exceed their finite context windows, while performance on texts that do fit degrades with sequence length, necessitating their augmentation with external memory frameworks. Current solutions, which have evolved from retrieval using semantic embeddings to more sophisticated structured knowledge graphs representations for improved sense-making and associativity, are tailored for fact-based retrieval and fail to build the space-time-anchored narrative representations required for tracking entities through episodic events. To bridge this gap, we propose the \textbf{Generative Semantic Workspace} (GSW), a neuro-inspired generative memory framework that builds structured, interpretable representations of evolving situations, enabling LLMs to reason over evolving roles, actions, and spatiotemporal contexts. Our framework comprises an \textit{Operator}, which maps incoming observations to intermediate semantic structures, and a \textit{Reconciler}, which integrates these into a persistent workspace that enforces temporal, spatial, and logical coherence. On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20%}. Furthermore, GSW is highly efficient, reducing query-time context tokens by \textbf{51%} compared to the next most token-efficient baseline, reducing inference time costs considerably. More broadly, GSW offers a concrete blueprint for endowing LLMs with human-like episodic memory, paving the way for more capable agents that can reason over long horizons.

[303] AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation

Jakub Slapek, Mir Seyedebrahimi, Yang Jianhua

Main category: cs.AI

TL;DR: Proposes an AI-enhanced framework for fair team contribution assessment using multi-dimensional benchmarks and LLM analysis to resolve conflicts automatically.

Details

Motivation: Addresses persistent challenges in equitable team contribution assessment where manual conflict resolution is costly and difficult, identifying gaps in existing tools for conflict resolution and AI integration.

Method: Organizes heterogeneous artifacts into three dimensions (Contribution, Interaction, Role) with nine benchmarks, normalizes objective measures, uses Gini index for inequality detection, and employs LLM architecture for contextual analysis and advisory judgments.

Result: Framework enables automated conflict investigation through validated analysis of contribution measures, generating interpretable and transparent advisory outputs for fair performance evaluation.

Conclusion: The proposed AI-enhanced tool is feasible under current policies and addresses practical challenges in team assessment while incorporating bias safeguards and transparent analytics.

Abstract: The equitable assessment of individual contribution in teams remains a persistent challenge, where conflict and disparity in workload can result in unfair performance evaluation, often requiring manual intervention - a costly and challenging process. We survey existing tool features and identify a gap in conflict resolution methods and AI integration. To address this, we propose a framework and implementation design for a novel AI-enhanced tool that assists in dispute investigation. The framework organises heterogeneous artefacts - submissions (code, text, media), communications (chat, email), coordination records (meeting logs, tasks), peer assessments, and contextual information - into three dimensions with nine benchmarks: Contribution, Interaction, and Role. Objective measures are normalised, aggregated per dimension, and paired with inequality measures (Gini index) to surface conflict markers. A Large Language Model (LLM) architecture performs validated and contextual analysis over these measures to generate interpretable and transparent advisory judgments. We argue for feasibility under current statutory and institutional policy, and outline practical analytics (sentimental, task fidelity, word/line count, etc.), bias safeguards, limitations, and practical challenges.

[304] Making LLMs Reliable When It Matters Most: A Five-Layer Architecture for High-Stakes Decisions

Alejandro R. Jadad

Main category: cs.AI

TL;DR: Framework for achieving human-AI cognitive partnership in high-stakes decisions through systematic calibration and protection architecture to prevent cognitive biases and ensure reliable strategic decision-making.

Details

Motivation: Address the gap in LLM reliability for high-stakes strategic decisions with uncertain outcomes, where mutually reinforcing cognitive biases in humans and AI threaten defensible valuations and sustainable investments.

Method: Systematic qualitative assessment across 7 frontier LLMs and 3 venture vignettes under time pressure, using detailed prompting and a 7-stage calibration sequence within a 5-layer protection architecture for bias monitoring and partnership verification.

Result: Achieved initial partnership state but required emergent maintenance protocols; reliability degrades with architectural drift and context exhaustion; cross-model validation revealed systematic performance differences across LLM architectures.

Conclusion: Human-AI teams can achieve cognitive partnership to prevent avoidable regret in high-stakes decisions, supporting consequential decision-making without introducing preventable cognitive traps when verification is delayed.

Abstract: Current large language models (LLMs) excel in verifiable domains where outputs can be checked before action but prove less reliable for high-stakes strategic decisions with uncertain outcomes. This gap, driven by mutually reinforcing cognitive biases in both humans and artificial intelligence (AI) systems, threatens the defensibility of valuations and sustainability of investments in the sector. This report describes a framework emerging from systematic qualitative assessment across 7 frontier-grade LLMs and 3 market-facing venture vignettes under time pressure. Detailed prompting specifying decision partnership and explicitly instructing avoidance of sycophancy, confabulation, solution drift, and nihilism achieved initial partnership state but failed to maintain it under operational pressure. Sustaining protective partnership state required an emergent 7-stage calibration sequence, built upon a 4-stage initialization process, within a 5-layer protection architecture enabling bias self-monitoring, human-AI adversarial challenge, partnership state verification, performance degradation detection, and stakeholder protection. Three discoveries resulted: partnership state is achievable through ordered calibration but requires emergent maintenance protocols; reliability degrades when architectural drift and context exhaustion align; and dissolution discipline prevents costly pursuit of fundamentally wrong directions. Cross-model validation revealed systematic performance differences across LLM architectures. This approach demonstrates that human-AI teams can achieve cognitive partnership capable of preventing avoidable regret in high-stakes decisions, addressing return-on-investment expectations that depend on AI systems supporting consequential decision-making without introducing preventable cognitive traps when verification arrives too late.

[305] AIA Forecaster: Technical Report

Rohan Alur, Bradly C. Stadie, Daniel Kang, Ryan Chen, Matt McManus, Michael Rickert, Tyler Lee, Michael Federici, Richard Zhu, Dennis Fogerty, Hayley Williamson, Nina Lozinski, Aaron Linsky, Jasjeet S. Sekhon

Main category: cs.AI

TL;DR: AIA Forecaster is an LLM-based system that achieves human superforecaster-level performance by combining agentic news search, supervisor reconciliation, and statistical calibration techniques.

Details

Motivation: To develop an AI system capable of expert-level forecasting at scale using unstructured data, addressing the limitations of prior LLM baselines in judgmental forecasting.

Method: Combines three core elements: agentic search over high-quality news sources, supervisor agent for reconciling disparate forecasts, and statistical calibration techniques to counter LLM behavioral biases.

Result: Achieves performance equal to human superforecasters on ForecastBench benchmark and provides additive information when combined with market consensus on a more challenging prediction market benchmark.

Conclusion: Establishes new state-of-the-art in AI forecasting with the first verifiable achievement of expert-level forecasting at scale, providing practical recommendations for future research.

Abstract: This technical report describes the AIA Forecaster, a Large Language Model (LLM)-based system for judgmental forecasting using unstructured data. The AIA Forecaster approach combines three core elements: agentic search over high-quality news sources, a supervisor agent that reconciles disparate forecasts for the same event, and a set of statistical calibration techniques to counter behavioral biases in large language models. On the ForecastBench benchmark (Karger et al., 2024), the AIA Forecaster achieves performance equal to human superforecasters, surpassing prior LLM baselines. In addition to reporting on ForecastBench, we also introduce a more challenging forecasting benchmark sourced from liquid prediction markets. While the AIA Forecaster underperforms market consensus on this benchmark, an ensemble combining AIA Forecaster with market consensus outperforms consensus alone, demonstrating that our forecaster provides additive information. Our work establishes a new state of the art in AI forecasting and provides practical, transferable recommendations for future research. To the best of our knowledge, this is the first work that verifiably achieves expert-level forecasting at scale.

[306] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu

Main category: cs.AI

TL;DR: ResearchRubrics is a benchmark for evaluating Deep Research agents using 2,500+ expert-written rubrics across factual grounding, reasoning, and clarity, revealing current systems achieve under 68% compliance.

Details

Motivation: Evaluating Deep Research agents is challenging due to lengthy, diverse responses with multiple valid solutions and dynamic information sources, requiring standardized assessment.

Method: Built benchmark with 2,800+ human hours, pairing realistic prompts with fine-grained rubrics; proposed complexity framework (conceptual breadth, logical nesting, exploration); developed human and model-based evaluation protocols.

Result: Leading DR agents like Gemini and OpenAI achieve under 68% average compliance, primarily due to missed implicit context and inadequate reasoning about retrieved information.

Conclusion: ResearchRubrics provides robust, scalable assessment for deep research capabilities and is released to facilitate progress toward well-justified research assistants.

Abstract: Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini’s DR and OpenAI’s DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.

[307] Towards AI-Assisted Generation of Military Training Scenarios

Soham Hans, Volkan Ustun, Benjamin Nye, James Sterrett, Matthew Green

Main category: cs.AI

TL;DR: A multi-agent LLM framework for automated generation of complex military training scenarios and OPORDs, overcoming limitations of previous AI tools.

Details

Motivation: Traditional scenario generation for military training is labor-intensive, and pre-LLM AI tools failed to produce sufficiently complex or adaptable scenarios.

Method: Hierarchical multi-agent framework with specialized LLM agents that sequentially process subproblems, integrating text and visual information while preserving logical consistency.

Result: Proof-of-concept successfully generates scheme of maneuver and movement sections of OPORDs with accurate map position estimation, demonstrating feasibility and coherence.

Conclusion: LLM-driven multi-agent systems show strong potential for automating complex scenario generation in military training, enabling dynamic adaptation and nuanced document creation.

Abstract: Achieving expert-level performance in simulation-based training relies on the creation of complex, adaptable scenarios, a traditionally laborious and resource intensive process. Although prior research explored scenario generation for military training, pre-LLM AI tools struggled to generate sufficiently complex or adaptable scenarios. This paper introduces a multi-agent, multi-modal reasoning framework that leverages Large Language Models (LLMs) to generate critical training artifacts, such as Operations Orders (OPORDs). We structure our framework by decomposing scenario generation into a hierarchy of subproblems, and for each one, defining the role of the AI tool: (1) generating options for a human author to select from, (2) producing a candidate product for human approval or modification, or (3) generating textual artifacts fully automatically. Our framework employs specialized LLM-based agents to address distinct subproblems. Each agent receives input from preceding subproblem agents, integrating both text-based scenario details and visual information (e.g., map features, unit positions and applies specialized reasoning to produce appropriate outputs. Subsequent agents process these outputs sequentially, preserving logical consistency and ensuring accurate document generation. This multi-agent strategy overcomes the limitations of basic prompting or single-agent approaches when tackling such highly complex tasks. We validate our framework through a proof-of-concept that generates the scheme of maneuver and movement section of an OPORD while estimating map positions and movements as a precursor demonstrating its feasibility and accuracy. Our results demonstrate the potential of LLM-driven multi-agent systems to generate coherent, nuanced documents and adapt dynamically to changing conditions, advancing automation in scenario generation for military training.

[308] SciAgent: A Unified Multi-Agent System for Generalistic Scientific Reasoning

Xuchen Li, Ruitao Wu, Xuanbo Liu, Xukai Wang, Jinbo Hu, Zhixin Bai, Bohan Zeng, Hao Liang, Leheng Chen, Mingrui Chen, Haitian Zhong, Xuanlin Yang, Xu-Yao Zhang, Liu Liu, Jia Li, Kaiqi Huang, Jiahao Xu, Haitao Mi, Wentao Zhang, Bin Dong

Main category: cs.AI

TL;DR: SciAgent is a unified multi-agent system for general scientific reasoning that achieves expert-level performance across multiple scientific Olympiads by dynamically orchestrating specialized reasoning agents.

Details

Motivation: Current AI systems excel at specific scientific tasks but lack adaptability across different disciplines and difficulty levels, requiring a more generalist approach to scientific reasoning.

Method: Hierarchical multi-agent system with a Coordinator Agent that interprets problems and orchestrates specialized Worker Systems composed of reasoning Sub-agents for symbolic deduction, conceptual modeling, numerical computation, and verification.

Result: Consistently attains or surpasses human gold-medalist performance across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), and shows generalization ability on chemistry Olympiad and Humanity’s Last Exam benchmark.

Conclusion: SciAgent represents a concrete step toward generalistic scientific intelligence, demonstrating coherent cross-disciplinary reasoning at expert levels across diverse scientific domains.

Abstract: Recent advances in large language models have enabled AI systems to achieve expert-level performance on domain-specific scientific tasks, yet these systems remain narrow and handcrafted. We introduce SciAgent, a unified multi-agent system designed for generalistic scientific reasoning-the ability to adapt reasoning strategies across disciplines and difficulty levels. SciAgent organizes problem solving as a hierarchical process: a Coordinator Agent interprets each problem’s domain and complexity, dynamically orchestrating specialized Worker Systems, each composed of interacting reasoning Sub-agents for symbolic deduction, conceptual modeling, numerical computation, and verification. These agents collaboratively assemble and refine reasoning pipelines tailored to each task. Across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), SciAgent consistently attains or surpasses human gold-medalist performance, demonstrating both domain generality and reasoning adaptability. Additionally, SciAgent has been tested on the International Chemistry Olympiad (IChO) and selected problems from the Humanity’s Last Exam (HLE) benchmark, further confirming the system’s ability to generalize across diverse scientific domains. This work establishes SciAgent as a concrete step toward generalistic scientific intelligence-AI systems capable of coherent, cross-disciplinary reasoning at expert levels.

[309] Operational machine learning for remote spectroscopic detection of CH$_{4}$ point sources

Vít Růžička, Gonzalo Mateo-García, Itziar Irakulis-Loitxate, Juan Emmanuel Johnson, Manuel Montesino San Martín, Anna Allen, Luis Guanter, David R. Thompson

Main category: cs.AI

TL;DR: Machine learning system deployed in UN’s Methane Alert and Response System to automatically detect methane emissions from satellite data, reducing false detections by 74% and accelerating leak verification.

Details

Motivation: Current satellite-based methane detection methods using matched filters produce high false detection rates requiring manual verification, creating operational inefficiencies for global methane monitoring.

Method: Created largest global dataset of annotated methane plumes from three imaging spectrometer missions, compared deep learning models, used model ensembling to reduce false detections, and deployed in operational pipeline.

Result: System reduced false detections by over 74%, processed 1,351 distinct methane leaks during 7-month deployment, resulting in 479 stakeholder notifications, and demonstrated utility in verifying mitigation success across multiple countries.

Conclusion: This represents a critical step towards global AI-assisted methane leak detection system needed to handle increasing data volumes from current and future imaging spectrometers.

Abstract: Mitigating anthropogenic methane sources is one the most cost-effective levers to slow down global warming. While satellite-based imaging spectrometers, such as EMIT, PRISMA, and EnMAP, can detect these point sources, current methane retrieval methods based on matched filters still produce a high number of false detections requiring laborious manual verification. This paper describes the operational deployment of a machine learning system for detecting methane emissions within the Methane Alert and Response System (MARS) of the United Nations Environment Programme’s International Methane Emissions Observatory. We created the largest and most diverse global dataset of annotated methane plumes from three imaging spectrometer missions and quantitatively compared different deep learning model configurations. Focusing on the requirements for operational deployment, we extended prior evaluation methodologies from small tiled datasets to full granule evaluation. This revealed that deep learning models still produce a large number of false detections, a problem we address with model ensembling, which reduced false detections by over 74%. Deployed in the MARS pipeline, our system processes scenes and proposes plumes to analysts, accelerating the detection and analysis process. During seven months of operational deployment, it facilitated the verification of 1,351 distinct methane leaks, resulting in 479 stakeholder notifications. We further demonstrate the model’s utility in verifying mitigation success through case studies in Libya, Argentina, Oman, and Azerbaijan. Our work represents a critical step towards a global AI-assisted methane leak detection system, which is required to process the dramatically higher data volumes expected from new and current imaging spectrometers.

[310] Alignment-Aware Quantization for LLM Safety

Sunghyun Wee, Suyoung Kim, Hyeonjin Kim, Kyomin Hwang, Nojun Kwak

Main category: cs.AI

TL;DR: AAQ addresses safety degradation in quantized LLMs by integrating alignment-preserving contrastive loss, enabling 4-bit quantization while maintaining safety where previous methods fail.

Details

Motivation: Conventional PTQ focuses only on perplexity, creating safety vulnerabilities where quantized models show low perplexity but poor alignment with safety policies.

Method: Alignment-Aware Quantization (AAQ) with Alignment-Preserving Contrastive (APC) loss that encourages quantized models to mimic safe instruction-tuned models while diverging from unaligned pre-trained counterparts.

Result: AAQ enables robust 4-bit (W4A4) quantization across LLaMA, Qwen, and Mistral models while maintaining safety, without requiring specialized safety datasets.

Conclusion: AAQ resolves the efficiency-safety trade-off in LLM quantization, paving the way for both efficient and trustworthy LLMs.

Abstract: Safety and efficiency are both important factors when deploying large language models(LLMs). LLMs are trained to follow human alignment for safety, and post training quantization(PTQ) is applied afterward for efficiency. However, these two objectives are often in conflict, revealing a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. Models can demonstrate low perplexity yet exhibit significant degradation in alignment with the safety policy, highlighting that perplexity alone is an insufficient and often misleading proxy for model safety. To address this, we propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive(APC) loss into the PTQ pipeline. Compared to simple reconstruction loss, ours explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. Our method achieves this robust safety alignment without resorting to specialized safety-focused calibration datasets, highlighting its practical utility and broad applicability. AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families such as LLaMA, Qwen, and Mistral while maintaining safety where previous methods fail. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.

Xiangling Chen, Yi Mei, Mengjie Zhang

Main category: cs.AI

TL;DR: GAMA is a neural neighborhood search method for Vehicle Routing Problems that uses Graph-aware Multi-modal Attention to better capture structural and semantic context through separate encoding of problem instances and solutions with attention mechanisms.

Details

Motivation: Existing neural approaches for VRPs use simplistic state representations and naive information fusion, limiting their ability to capture rich structural and semantic context needed for effective routing decisions.

Method: Encodes problem instances and evolving solutions as distinct modalities using graph neural networks, models intra- and inter-modal interactions through stacked self- and cross-attention layers, and uses gated fusion to integrate multi-modal representations.

Result: Significantly outperforms recent neural baselines across various synthetic and benchmark instances, with ablation studies confirming the importance of the multi-modal attention mechanism and gated fusion design.

Conclusion: GAMA’s graph-aware multi-modal attention approach effectively captures rich structural context in VRPs, leading to superior performance over existing neural methods.

Abstract: Recent advances in neural neighborhood search methods have shown potential in tackling Vehicle Routing Problems (VRPs). However, most existing approaches rely on simplistic state representations and fuse heterogeneous information via naive concatenation, limiting their ability to capture rich structural and semantic context. To address these limitations, we propose GAMA, a neural neighborhood search method with Graph-aware Multi-modal Attention model in VRP. GAMA encodes the problem instance and its evolving solution as distinct modalities using graph neural networks, and models their intra- and inter-modal interactions through stacked self- and cross-attention layers. A gated fusion mechanism further integrates the multi-modal representations into a structured state, enabling the policy to make informed and generalizable operator selection decisions. Extensive experiments conducted across various synthetic and benchmark instances demonstrate that the proposed algorithm GAMA significantly outperforms the recent neural baselines. Further ablation studies confirm that both the multi-modal attention mechanism and the gated fusion design play a key role in achieving the observed performance gains.

[312] WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking

Shinwoo Park, Hyejin Park, Hyeseon Ahn, Yo-Sub Han

Main category: cs.AI

TL;DR: WaterMod is a probability-aware watermarking method that uses modular arithmetic to embed imperceptible marks in LLM outputs while maintaining generation quality, supporting both binary attribution and multi-bit payloads.

Details

Motivation: To address the limitation of conventional logit-based watermarks that can exclude high-probability tokens and degrade fluency, while complying with regulations requiring machine-verifiable provenance marks for synthetic content.

Method: Sorts vocabulary by probability, partitions tokens using rank mod k, applies small bias to one class. Uses entropy-adaptive gate for zero-bit (k=2) and payload digit selection for multi-bit (k>2) settings.

Result: WaterMod achieves strong watermark detection performance while maintaining generation quality across natural language, mathematical reasoning, and code synthesis tasks.

Conclusion: The modular arithmetic approach supports both binary attribution and rich payloads, providing robust watermarking without compromising fluency.

Abstract: Large language models now draft news, legal analyses, and software code with human-level fluency. At the same time, regulations such as the EU AI Act mandate that each synthetic passage carry an imperceptible, machine-verifiable mark for provenance. Conventional logit-based watermarks satisfy this requirement by selecting a pseudorandom green vocabulary at every decoding step and boosting its logits, yet the random split can exclude the highest-probability token and thus erode fluency. WaterMod mitigates this limitation through a probability-aware modular rule. The vocabulary is first sorted in descending model probability; the resulting ranks are then partitioned by the residue rank mod k, which distributes adjacent-and therefore semantically similar-tokens across different classes. A fixed bias of small magnitude is applied to one selected class. In the zero-bit setting (k=2), an entropy-adaptive gate selects either the even or the odd parity as the green list. Because the top two ranks fall into different parities, this choice embeds a detectable signal while guaranteeing that at least one high-probability token remains available for sampling. In the multi-bit regime (k>2), the current payload digit d selects the color class whose ranks satisfy rank mod k = d. Biasing the logits of that class embeds exactly one base-k digit per decoding step, thereby enabling fine-grained provenance tracing. The same modular arithmetic therefore supports both binary attribution and rich payloads. Experimental results demonstrate that WaterMod consistently attains strong watermark detection performance while maintaining generation quality in both zero-bit and multi-bit settings. This robustness holds across a range of tasks, including natural language generation, mathematical reasoning, and code synthesis. Our code and data are available at https://github.com/Shinwoo-Park/WaterMod.

[313] Confidence-Aware Neural Decoding of Overt Speech from EEG: Toward Robust Brain-Computer Interfaces

Soowon Kim, Byung-Kwan Ko, Seo-Hyun Lee

Main category: cs.AI

TL;DR: A confidence-aware brain-computer interface framework that uses deep ensembles of speech-oriented convolutional networks with uncertainty quantification and selective classification to improve reliability and trustworthiness in decoding spoken commands from EEG signals.

Details

Motivation: To develop non-invasive brain-computer interfaces that are both accurate and trustworthy for decoding spoken commands from electroencephalogram signals, addressing the need for reliable probability estimates and deployment-oriented behavior.

Method: Deep ensembles of compact, speech-oriented convolutional networks coupled with post-hoc calibration and selective classification. Uncertainty is quantified using ensemble-based predictive entropy, top-two margin, and mutual information, with decisions made using an abstain option governed by an accuracy-coverage operating point.

Result: The proposed method yields more reliable probability estimates, improved selective performance across operating points, and balanced per-class acceptance compared with widely used baselines, evaluated on a multi-class overt speech dataset using a leakage-safe, block-stratified split.

Conclusion: Confidence-aware neural decoding can provide robust, deployment-oriented behavior for real-world brain-computer interface communication systems, making them more trustworthy and practical for real-world applications.

Abstract: Non-invasive brain-computer interfaces that decode spoken commands from electroencephalogram must be both accurate and trustworthy. We present a confidence-aware decoding framework that couples deep ensembles of compact, speech-oriented convolutional networks with post-hoc calibration and selective classification. Uncertainty is quantified using ensemble-based predictive entropy, top-two margin, and mutual information, and decisions are made with an abstain option governed by an accuracy-coverage operating point. The approach is evaluated on a multi-class overt speech dataset using a leakage-safe, block-stratified split that respects temporal contiguity. Compared with widely used baselines, the proposed method yields more reliable probability estimates, improved selective performance across operating points, and balanced per-class acceptance. These results suggest that confidence-aware neural decoding can provide robust, deployment-oriented behavior for real-world brain-computer interface communication systems.

[314] Toward Robust EEG-based Intention Decoding during Misarticulated Speech in Aphasia

Ha-Na Jo, Jung-Sun Lee, Eunyeong Ko

Main category: cs.AI

TL;DR: EEG-based communication system for aphasia patients using multitask learning with delta features achieved 58.6% accuracy for correct speech and 45.5% for misarticulated trials, outperforming baseline by 45% on error trials.

Details

Motivation: Aphasia severely limits verbal communication with frequent misarticulations, but EEG-based communication support systems for aphasic patients have received little attention despite growing interest in brain-computer interfaces.

Method: Recruited single aphasia patient for Korean speech task, recorded EEG signals, labeled trials as correct/incorrect. Used spectral analysis to identify neural patterns, then developed soft multitask learning framework with maximum mean discrepancy regularization focusing on delta features to optimize class discrimination and align EEG feature distributions.

Result: Spectral analysis revealed distinct neural patterns: misarticulated trials showed excessive delta power across widespread channels and increased theta-alpha activity in frontal regions. The proposed model achieved 58.6% accuracy for correct trials and 45.5% for misarticulated trials, outperforming baseline by over 45% on error trials.

Conclusion: The results demonstrate feasibility of EEG-based assistive systems capable of supporting real-world, imperfect speech conditions in aphasia patients through robust intention decoding even under articulation errors.

Abstract: Aphasia severely limits verbal communication due to impaired language production, often leading to frequent misarticulations during speech attempts. Despite growing interest in brain-computer interface technologies, relatively little attention has been paid to developing EEG-based communication support systems tailored for aphasic patients. To address this gap, we recruited a single participant with expressive aphasia and conducted an Korean-based automatic speech task. EEG signals were recorded during task performance, and each trial was labeled as either correct or incorrect depending on whether the intended word was successfully spoken. Spectral analysis revealed distinct neural activation patterns between the two trial types: misarticulated trials exhibited excessive delta power across widespread channels and increased theta-alpha activity in frontal regions. Building upon these findings, we developed a soft multitask learning framework with maximum mean discrepancy regularization that focus on delta features to jointly optimize class discrimination while aligning the EEG feature distributions of correct and misarticulated trials. The proposed model achieved 58.6 % accuracy for correct and 45.5 % for misarticulated trials-outperforming the baseline by over 45 % on the latter-demonstrating robust intention decoding even under articulation errors. These results highlight the feasibility of EEG-based assistive systems capable of supporting real-world, imperfect speech conditions in aphasia patients.

[315] SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

Dengcan Liu, Jiahao Li, Zheren Fu, Yi Tu, Jiajun Li, Zhendong Mao, Yongdong Zhang

Main category: cs.AI

TL;DR: SparseRM uses Sparse Autoencoder to extract preference-relevant features from LLM representations, creating a lightweight and interpretable reward model that achieves superior performance with less than 1% trainable parameters.

Details

Motivation: Training reliable reward models under limited resources is challenging due to reliance on large-scale preference annotations and high fine-tuning costs of LLMs.

Method: Uses Sparse Autoencoder to decompose LLM representations into interpretable preference-relevant directions, projects representations onto these directions to compute alignment scores, and aggregates scores with a simple reward head.

Result: Achieves superior performance over most mainstream reward models on three preference modeling tasks while using less than 1% of trainable parameters.

Conclusion: SparseRM provides an efficient and interpretable approach for reward modeling that integrates seamlessly into downstream alignment pipelines.

Abstract: Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.

[316] Data Descriptions from Large Language Models with Influence Estimation

Chaeri Kim, Jaeyeon Bae, Taehwan Kim

Main category: cs.AI

TL;DR: The paper proposes a novel approach to explain data using textual descriptions generated by large language models, incorporating external knowledge and influence estimation to select informative descriptions, with evaluation through cross-modal transfer classification.

Details

Motivation: Deep learning models are often black-boxes, and while most XAI focuses on interpreting model predictions, this work aims to understand data itself through human-readable language explanations.

Method: A pipeline using large language models to generate textual descriptions of data, incorporating external knowledge bases, influence estimation for selecting informative descriptions, and CLIP scores. Introduces cross-modal transfer classification as a benchmark task.

Result: In zero-shot settings, the generated textual descriptions outperform baseline descriptions and boost performance of image-only trained models across nine classification datasets, with GPT-4o evaluation confirming effectiveness.

Conclusion: The approach provides insights into model decision-making interpretability and demonstrates that textual data descriptions can effectively enhance cross-modal understanding and model performance.

Abstract: Deep learning models have been successful in many areas but understanding their behaviors still remains a black-box. Most prior explainable AI (XAI) approaches have focused on interpreting and explaining how models make predictions. In contrast, we would like to understand how data can be explained with deep learning model training and propose a novel approach to understand the data via one of the most common media - language - so that humans can easily understand. Our approach proposes a pipeline to generate textual descriptions that can explain the data with large language models by incorporating external knowledge bases. However, generated data descriptions may still include irrelevant information, so we introduce to exploit influence estimation to choose the most informative textual descriptions, along with the CLIP score. Furthermore, based on the phenomenon of cross-modal transferability, we propose a novel benchmark task named cross-modal transfer classification to examine the effectiveness of our textual descriptions. In the experiment of zero-shot setting, we show that our textual descriptions are more effective than other baseline descriptions, and furthermore, we successfully boost the performance of the model trained only on images across all nine image classification datasets. These results are further supported by evaluation using GPT-4o. Through our approach, we may gain insights into the inherent interpretability of the decision-making process of the model.

[317] DANS-KGC: Diffusion Based Adaptive Negative Sampling for Knowledge Graph Completion

Haoning Li, Qinghua Huang

Main category: cs.AI

TL;DR: DANS-KGC is a diffusion-based adaptive negative sampling method for knowledge graph completion that generates diverse hardness negative samples using difficulty assessment and dynamic training.

Details

Motivation: To overcome limitations of existing negative sampling strategies including vulnerability to false negatives, limited generalization, and lack of control over sample hardness.

Method: Uses three components: Difficulty Assessment Module (evaluates entity learning difficulty), Adaptive Negative Sampling Module (conditional diffusion model with difficulty-aware noise scheduling), and Dynamic Training Mechanism (adjusts hardness distribution during training).

Result: Achieved state-of-the-art results on all three evaluation metrics for UMLS and YAGO3-10 datasets across six benchmark datasets, demonstrating strong generalization ability.

Conclusion: DANS-KGC effectively addresses key limitations of negative sampling in knowledge graph representation through its adaptive diffusion-based approach and dynamic training mechanism.

Abstract: Negative sampling (NS) strategies play a crucial role in knowledge graph representation. In order to overcome the limitations of existing negative sampling strategies, such as vulnerability to false negatives, limited generalization, and lack of control over sample hardness, we propose DANS-KGC (Diffusion-based Adaptive Negative Sampling for Knowledge Graph Completion). DANS-KGC comprises three key components: the Difficulty Assessment Module (DAM), the Adaptive Negative Sampling Module (ANS), and the Dynamic Training Mechanism (DTM). DAM evaluates the learning difficulty of entities by integrating semantic and structural features. Based on this assessment, ANS employs a conditional diffusion model with difficulty-aware noise scheduling, leveraging semantic and neighborhood information during the denoising phase to generate negative samples of diverse hardness. DTM further enhances learning by dynamically adjusting the hardness distribution of negative samples throughout training, enabling a curriculum-style progression from easy to hard examples. Extensive experiments on six benchmark datasets demonstrate the effectiveness and generalization ability of DANS-KGC, with the method achieving state-of-the-art results on all three evaluation metrics for the UMLS and YAGO3-10 datasets.

[318] Neurophysiological Characteristics of Adaptive Reasoning for Creative Problem-Solving Strategy

Jun-Young Kim, Young-Seok Kweon, Gi-Hwan Shin, Seong-Whan Lee

Main category: cs.AI

TL;DR: Study identifies neural signatures of human adaptive reasoning using EEG and card-sorting tasks, showing coordinated delta-theta-alpha dynamics for rule inference and attention stabilization, while multimodal LLMs lack genuine adaptive reasoning capabilities.

Details

Motivation: To understand the neurophysiological mechanisms underlying human adaptive reasoning and compare it with artificial intelligence systems, particularly multimodal large language models.

Method: Used electroencephalography (EEG) combined with a card-sorting paradigm to analyze stimulus- and feedback-locked neural dynamics, comparing human performance with a multimodal large language model.

Result: Humans showed coordinated delta-theta-alpha dynamics: early delta-theta activity for exploratory monitoring and rule inference, and occipital alpha engagement for confirmatory attention stabilization after successful rule identification. The multimodal LLM only exhibited short-term feedback-driven adjustments without hierarchical rule abstraction.

Conclusion: The study identifies specific neural signatures of human adaptive reasoning and highlights the limitations of current AI systems, suggesting the need for brain-inspired artificial intelligence that incorporates oscillatory feedback coordination for true context-sensitive adaptation.

Abstract: Adaptive reasoning enables humans to flexibly adjust inference strategies when environmental rules or contexts change, yet its underlying neural dynamics remain unclear. This study investigated the neurophysiological mechanisms of adaptive reasoning using a card-sorting paradigm combined with electroencephalography and compared human performance with that of a multimodal large language model. Stimulus- and feedback-locked analyses revealed coordinated delta-theta-alpha dynamics: early delta-theta activity reflected exploratory monitoring and rule inference, whereas occipital alpha engagement indicated confirmatory stabilization of attention after successful rule identification. In contrast, the multimodal large language model exhibited only short-term feedback-driven adjustments without hierarchical rule abstraction or genuine adaptive reasoning. These findings identify the neural signatures of human adaptive reasoning and highlight the need for brain-inspired artificial intelligence that incorporates oscillatory feedback coordination for true context-sensitive adaptation.

[319] Lightweight Diffusion-based Framework for Online Imagined Speech Decoding in Aphasia

Eunyeong Ko, Soowon Kim, Ha-Na Jo

Main category: cs.AI

TL;DR: A diffusion-based EEG decoding system for real-time imagined speech classification in aphasia patients, achieving 65% top-1 accuracy in online trials with minimal preprocessing.

Details

Motivation: To develop a practical brain-computer interface for clinical communication support in individuals with severe expressive language impairment (aphasia), enabling real-time imagined speech classification.

Method: Uses a lightweight conditional diffusion encoder and convolutional classifier trained on subject-specific EEG data, with dual-criterion early stopping, dropout regularization, and grouped temporal convolutions for stable generalization. Processes continuous EEG in 2-second sliding windows during online operation.

Result: Achieved 65% top-1 and 70% top-2 accuracy across 20 real-time trials, outperforming offline evaluation (50% top-1). Maintained reliable performance despite environmental variability and minimal preprocessing.

Conclusion: Demonstrates feasibility of deploying diffusion-based EEG decoding under practical clinical constraints, advancing translation of imagined speech BCIs toward clinical communication support for aphasia patients.

Abstract: A diffusion-based neural decoding framework optimized for real-time imagined speech classification in individuals with aphasia. The system integrates a lightweight conditional diffusion encoder and convolutional classifier trained using subject-specific EEG data acquired from a Korean-language paradigm. A dual-criterion early stopping strategy enabled rapid convergence under limited calibration data, while dropout regularization and grouped temporal convolutions ensured stable generalization. During online operation, continuous EEG streams were processed in two-second sliding windows to generate class probabilities that dynamically modulated visual and auditory feedback according to decoding confidence. Across twenty real-time trials, the framework achieved 65% top-1 and 70% top-2 accuracy, outperforming offline evaluation (50% top-1). These results demonstrate the feasibility of deploying diffusion-based EEG decoding under practical clinical constraints, maintaining reliable performance despite environmental variability and minimal preprocessing. The proposed framework advances the translation of imagined speech brain-computer interfaces toward clinical communication support for individuals with severe expressive language impairment.

[320] Question-to-Knowledge (Q2K): Multi-Agent Generation of Inspectable Facts for Product Mapping

Wonduk Seo, Taesub Shin, Hyunjin An, Dokyun Kim, Seunghyun Lee

Main category: cs.AI

TL;DR: Q2K is a multi-agent LLM framework that improves SKU mapping accuracy by generating targeted questions, searching for answers, and reusing validated reasoning to avoid redundant searches.

Details

Motivation: Traditional rule-based methods often misclassify products due to subtle differences in brand, specifications, or bundle configurations when explicit SKU identifiers are missing.

Method: Uses three agents: Reasoning Agent generates disambiguation questions, Knowledge Agent performs focused web searches, and Deduplication Agent reuses validated reasoning traces. Includes human-in-the-loop for uncertain cases.

Result: Outperforms strong baselines on real-world consumer goods datasets, achieving higher accuracy and robustness in challenging scenarios like bundle identification and brand origin disambiguation.

Conclusion: Q2K provides a scalable, interpretable solution for product integration that balances accuracy with efficiency by reusing retrieved reasoning instead of repeated searches.

Abstract: Identifying whether two product listings refer to the same Stock Keeping Unit (SKU) is a persistent challenge in ecommerce, especially when explicit identifiers are missing and product names vary widely across platforms. Rule based heuristics and keyword similarity often misclassify products by overlooking subtle distinctions in brand, specification, or bundle configuration. To overcome these limitations, we propose Question to Knowledge (Q2K), a multi agent framework that leverages Large Language Models (LLMs) for reliable SKU mapping. Q2K integrates: (1) a Reasoning Agent that generates targeted disambiguation questions, (2) a Knowledge Agent that resolves them via focused web searches, and (3) a Deduplication Agent that reuses validated reasoning traces to reduce redundancy and ensure consistency. A human in the loop mechanism further refines uncertain cases. Experiments on real world consumer goods datasets show that Q2K surpasses strong baselines, achieving higher accuracy and robustness in difficult scenarios such as bundle identification and brand origin disambiguation. By reusing retrieved reasoning instead of issuing repeated searches, Q2K balances accuracy with efficiency, offering a scalable and interpretable solution for product integration.

[321] Computational Blueprints: Generating Isomorphic Mathematics Problems with Large Language Models

Jeong-Hoon Kim, Jinwoo Nam, Geunsik Jo

Main category: cs.AI

TL;DR: The paper introduces Isomorphic Math Problem Generation (IMPG) to create structurally consistent math problem variants for personalized education, using LLM-based CBIT framework for cost-effective generation.

Details

Motivation: To address the growing demand for large sets of similar practice problems in personalized mathematics education, bridging the gap between existing data augmentation methods and direct educational deployment needs.

Method: Developed Computational Blueprints for Isomorphic Twins (CBIT) framework using LLM-based approaches with meta-level generation and template-based selective variation to produce structurally consistent math problem variants.

Result: CBIT achieved high mathematical correctness and structural consistency while reducing generation costs. Generated problems had 17.8% lower error rate than expert-authored items, with successful deployment to 6,732 learners generating 186,870 interactions.

Conclusion: CBIT framework is superior for accurate and cost-effective isomorphic math problem generation at scale, demonstrating practical educational value through successful platform deployment.

Abstract: Personalized mathematics education is growing rapidly, creating a strong demand for large sets of similar practice problems. Yet existing studies on mathematics problem generation have focused on data augmentation for training neural language models rather than on direct educational deployment. To bridge this gap, we define a new task, Isomorphic Math Problem Generation (IMPG), designed to produce structurally consistent variants of source problems. Subsequently, we explored LLM-based frameworks for automatic IMPG through successive refinements, and established Computational Blueprints for Isomorphic Twins (CBIT). With meta-level generation and template-based selective variation, CBIT achieves high mathematical correctness and structural consistency while reducing the cost of generation. Empirical results across refinements demonstrate that CBIT is superior on generation accuracy and cost-effectiveness at scale. Most importantly, CBIT-generated problems exhibited an error rate 17.8% lower than expert-authored items, with deployment to 6,732 learners on a commercial education platform yielding 186,870 interactions.

[322] Toward Practical BCI: A Real-time Wireless Imagined Speech EEG Decoding System

Ji-Ha Park, Heon-Gyu Kwak, Gi-Hwan Shin, Yoo-In Jeon, Sun-Min Park, Ji-Yeon Hwang, Seong-Whan Lee

Main category: cs.AI

TL;DR: Real-time wireless imagined speech EEG decoding system for practical BCI applications, achieving 62% accuracy on wired and 46.67% on wireless devices.

Details

Motivation: Current BCI research is limited to static environments, hindering real-world applicability. The goal is to create flexible, everyday-use BCI systems.

Method: End-to-end pipeline using lab streaming layer for real-time EEG signal processing, with user identification for personalized decoding and extensibility to wireless hardware.

Result: Achieved 62.00% accuracy for 4-class imagined speech classification on wired EEG devices and 46.67% on portable wireless headsets.

Conclusion: This represents a significant advancement towards practical, accessible BCI technology, establishing direction for robust and personalized neural interfaces.

Abstract: Brain-computer interface (BCI) research, while promising, has largely been confined to static and fixed environments, limiting real-world applicability. To move towards practical BCI, we introduce a real-time wireless imagined speech electroencephalogram (EEG) decoding system designed for flexibility and everyday use. Our framework focuses on practicality, demonstrating extensibility beyond wired EEG devices to portable, wireless hardware. A user identification module recognizes the operator and provides a personalized, user-specific service. To achieve seamless, real-time operation, we utilize the lab streaming layer to manage the continuous streaming of live EEG signals to the personalized decoder. This end-to-end pipeline enables a functional real-time application capable of classifying user commands from imagined speech EEG signals, achieving an overall 4-class accuracy of 62.00 % on a wired device and 46.67 % on a portable wireless headset. This paper demonstrates a significant step towards truly practical and accessible BCI technology, establishing a clear direction for future research in robust, practical, and personalized neural interfaces.

[323] Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction

Jun Xu, Xinkai Du, Yu Ao, Peilong Zhao, Yang Li, Ling Zhong, Lin Yuan, Zhongpu Bo, Xiaorui Wang, Mengshu Sun, Zhengke Gui, Dalong Zhang, Zhaoyang Wang, Qiwei Wang, Yangyang Hou, Zhiying Yin, Haofen Wang, Huajun Chen, Lei Liang, Jun Zhou

Main category: cs.AI

TL;DR: Thinker is a hierarchical thinking model that enables LLMs to perform deep search through multi-turn interactions, decomposing complex problems into verifiable sub-problems with dual natural language and logical function representations.

Details

Motivation: Previous end-to-end reinforcement learning approaches for training LLMs with external retrievers lack supervision over reasoning processes, making it difficult to ensure logical coherence and rigor in problem-solving.

Method: Decomposes complex problems into independently solvable sub-problems with dual representations (natural language + logical functions), performs knowledge boundary determination to avoid unnecessary external searches, and passes dependencies between sub-problems via logical function parameters.

Result: With only several hundred training samples, Thinker performs competitively with established baselines. When scaled to full training sets, it significantly outperforms these methods across various datasets and model sizes.

Conclusion: Thinker provides a supervisable and verifiable reasoning process that enhances logical coherence in LLM problem-solving with external knowledge bases, achieving strong performance with minimal training data.

Abstract: Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs. Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor. To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable. It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM’s intrinsic knowledge, allowing it to answer directly. Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes. The source code is available at https://github.com/OpenSPG/KAG-Thinker.

[324] TimeFlow: Towards Stochastic-Aware and Efficient Time Series Generation via Flow Matching Modeling

He Panjing, Cheng Mingyue, Li Li, Zhang XiaoHan

Main category: cs.AI

TL;DR: TimeFlow is a novel SDE-based flow matching framework for efficient high-quality time series generation that captures stochasticity better than traditional methods.

Details

Motivation: Existing methods struggle with modeling temporal stochasticity efficiently - diffusion models are computationally expensive while ODE-based flow matching fails to capture randomness adequately.

Method: Proposes TimeFlow using SDE-based flow matching with encoder-only architecture, component-wise decomposed velocity field, and augmented optimization with stochastic term.

Result: Outperforms strong baselines across diverse datasets in generation quality, diversity, and efficiency for both unconditional and conditional generation tasks.

Conclusion: TimeFlow provides a flexible, general framework that effectively balances generation quality with computational efficiency for time series data.

Abstract: Generating high-quality time series data has emerged as a critical research topic due to its broad utility in supporting downstream time series mining tasks. A major challenge lies in modeling the intrinsic stochasticity of temporal dynamics, as real-world sequences often exhibit random fluctuations and localized variations. While diffusion models have achieved remarkable success, their generation process is computationally inefficient, often requiring hundreds to thousands of expensive function evaluations per sample. Flow matching has emerged as a more efficient paradigm, yet its conventional ordinary differential equation (ODE)-based formulation fails to explicitly capture stochasticity, thereby limiting the fidelity of generated sequences. By contrast, stochastic differential equation (SDE) are naturally suited for modeling randomness and uncertainty. Motivated by these insights, we propose TimeFlow, a novel SDE-based flow matching framework that integrates a encoder-only architecture. Specifically, we design a component-wise decomposed velocity field to capture the multi-faceted structure of time series and augment the vanilla flow-matching optimization with an additional stochastic term to enhance representational expressiveness. TimeFlow is flexible and general, supporting both unconditional and conditional generation tasks within a unified framework. Extensive experiments across diverse datasets demonstrate that our model consistently outperforms strong baselines in generation quality, diversity, and efficiency.

[325] Versatile and Risk-Sensitive Cardiac Diagnosis via Graph-Based ECG Signal Representation

Yue Wang, Yuyang Xu, Renjun Hu, Fanqi Shen, Hanyun Jiang, Jun Wang, Jintai Chen, Danny Z. Chen, Jian Wu, Haochao Ying

Main category: cs.AI

TL;DR: VARS is a graph-based ECG diagnosis method that handles diverse signal configurations and improves risk signal detection through denoising reconstruction and contrastive learning.

Details

Motivation: Address limitations of current ECG deep learning methods: lack of versatility for diverse signal configurations and inadequate detection of risk signals due to sample imbalances.

Method: Transform ECG signals into graph structures to uniformly model heterogeneous signals, integrating denoising reconstruction with contrastive learning to preserve raw information while highlighting diagnostic patterns.

Result: Consistently surpasses state-of-the-art models across three ECG datasets, with substantial improvement in identifying risk signals and offering interpretability by pinpointing exact waveforms.

Conclusion: VARS emerges as an invaluable tool for comprehensive cardiac health assessment by providing versatile, risk-sensitive, and interpretable ECG diagnosis.

Abstract: Despite the rapid advancements of electrocardiogram (ECG) signal diagnosis and analysis methods through deep learning, two major hurdles still limit their clinical adoption: the lack of versatility in processing ECG signals with diverse configurations, and the inadequate detection of risk signals due to sample imbalances. Addressing these challenges, we introduce VersAtile and Risk-Sensitive cardiac diagnosis (VARS), an innovative approach that employs a graph-based representation to uniformly model heterogeneous ECG signals. VARS stands out by transforming ECG signals into versatile graph structures that capture critical diagnostic features, irrespective of signal diversity in the lead count, sampling frequency, and duration. This graph-centric formulation also enhances diagnostic sensitivity, enabling precise localization and identification of abnormal ECG patterns that often elude standard analysis methods. To facilitate representation transformation, our approach integrates denoising reconstruction with contrastive learning to preserve raw ECG information while highlighting pathognomonic patterns. We rigorously evaluate the efficacy of VARS on three distinct ECG datasets, encompassing a range of structural variations. The results demonstrate that VARS not only consistently surpasses existing state-of-the-art models across all these datasets but also exhibits substantial improvement in identifying risk signals. Additionally, VARS offers interpretability by pinpointing the exact waveforms that lead to specific model outputs, thereby assisting clinicians in making informed decisions. These findings suggest that our VARS will likely emerge as an invaluable tool for comprehensive cardiac health assessment.

[326] Towards Fine-Grained Interpretability: Counterfactual Explanations for Misclassification with Saliency Partition

Lintong Zhang, Kang Yin, Seong-Whan Lee

Main category: cs.AI

TL;DR: Proposes a fine-grained counterfactual explanation framework for visual interpretability that generates object-level and part-level explanations to identify misclassification causes and local feature influences.

Details

Motivation: Attribution-based explanation techniques lack sufficient granularity for fine-grained tasks and model misclassification cases, where detailed insights are needed.

Method: Non-generative counterfactual explanation approach using similarity quantification and component weighting between correctly classified and misclassified samples, with a saliency partition module based on Shapley values.

Result: The approach captures more granular and intuitively meaningful regions than existing fine-grained methods, demonstrating superiority in experiments.

Conclusion: The proposed framework effectively addresses fine-grained interpretability needs by providing detailed counterfactual explanations at both object and part levels.

Abstract: Attribution-based explanation techniques capture key patterns to enhance visual interpretability; however, these patterns often lack the granularity needed for insight in fine-grained tasks, particularly in cases of model misclassification, where explanations may be insufficiently detailed. To address this limitation, we propose a fine-grained counterfactual explanation framework that generates both object-level and part-level interpretability, addressing two fundamental questions: (1) which fine-grained features contribute to model misclassification, and (2) where dominant local features influence counterfactual adjustments. Our approach yields explainable counterfactuals in a non-generative manner by quantifying similarity and weighting component contributions within regions of interest between correctly classified and misclassified samples. Furthermore, we introduce a saliency partition module grounded in Shapley value contributions, isolating features with region-specific relevance. Extensive experiments demonstrate the superiority of our approach in capturing more granular, intuitively meaningful regions, surpassing fine-grained methods.

[327] Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

Wenhan Yu, Xinbo Lin, Lanxin Ni, Jinhua Cheng, Lei Sha

Main category: cs.AI

TL;DR: MSLR is the first Chinese multi-step legal reasoning dataset using IRAC framework from real judicial decisions, with Human-LLM collaborative annotation pipeline. LLMs show moderate performance, but Self-Initiated Chain-of-Thought prompts improve reasoning quality.

Details

Motivation: Existing legal benchmarks conflate factual recall with genuine inference, fragment reasoning process, and overlook reasoning quality. Need for structured legal reasoning evaluation grounded in real judicial decision making.

Method: Created MSLR dataset using IRAC framework (Issue, Rule, Application, Conclusion) from official legal documents. Developed scalable Human-LLM collaborative annotation pipeline for fine-grained step-level reasoning annotations.

Result: Multiple LLMs show only moderate performance on MSLR, highlighting challenges in complex legal reasoning. Self-Initiated Chain-of-Thought prompts outperform human-designed prompts, improving reasoning coherence and quality.

Conclusion: MSLR advances LLM reasoning and Chain-of-Thought strategies, providing open resources for future research on structured legal reasoning evaluation.

Abstract: Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.

[328] Capturing Complex Spatial-Temporal Dependencies in Traffic Forecasting: A Self-Attention Approach

Zheng Chenghong, Zongyin Deng, Liu Cheng, Xiong Simin, Di Deshi, Li Guanyao

Main category: cs.AI

TL;DR: ST-SAM is a novel spatial-temporal self-attention model for traffic forecasting that captures joint spatial-temporal dependencies using self-attention mechanisms, achieving significant improvements in accuracy and efficiency over state-of-the-art methods.

Details

Motivation: Existing traffic forecasting approaches study spatial and temporal dependencies in a decoupled manner, failing to capture their joint effect, which is crucial for accurate predictions of region inflow and outflow.

Method: ST-SAM uses a region embedding layer to learn time-specific embeddings and employs a spatial-temporal dependency learning module based on self-attention mechanism to capture joint dependencies for both nearby and faraway regions.

Result: Extensive experiments on two real-world datasets show ST-SAM achieves average improvements of up to 15% on RMSE, 17% on MAPE, and 32 times faster training time compared to state-of-the-art approaches.

Conclusion: ST-SAM effectively captures both local and global spatial-temporal correlations using self-attention, making it a highly accurate and efficient solution for traffic forecasting problems.

Abstract: We study the problem of traffic forecasting, aiming to predict the inflow and outflow of a region in the subsequent time slot. The problem is complex due to the intricate spatial and temporal interdependence among regions. Prior works study the spatial and temporal dependency in a decouple manner, failing to capture their joint effect. In this work, we propose ST-SAM, a novel and efficient Spatial-Temporal Self-Attention Model for traffic forecasting. ST-SAM uses a region embedding layer to learn time-specific embedding from traffic data for regions. Then, it employs a spatial-temporal dependency learning module based on self-attention mechanism to capture the joint spatial-temporal dependency for both nearby and faraway regions. ST-SAM entirely relies on self-attention to capture both local and global spatial-temporal correlations, which make it effective and efficient. Extensive experiments on two real world datasets show that ST-SAM is substantially more accurate and efficient than the state-of-the-art approaches (with an average improvement of up to 15% on RMSE, 17% on MAPE, and 32 times on training time in our experiments).

Nico Policzer, Cameron Braunstein, Mariya Toneva

Main category: cs.AI

TL;DR: Brain-tuning a multimodal audio-video model to STS improves brain alignment and sarcasm detection performance.

Details

Motivation: To extend brain-tuning approach from audio models to multimodal audio-video domain to enhance social cognition, specifically targeting the Superior Temporal Sulcus (STS) region.

Method: Fine-tuning a multimodal audio-video model to better predict fMRI activity from subjects watching Friends, targeting the STS region for social processing.

Result: Significant increases in brain alignment to STS and adjacent ROI, and improvements in sarcasm detection in sitcoms as a social cognition task.

Conclusion: Brain-tuning can be successfully extended to multimodal domain, improving downstream task performance after tuning to relevant functional brain regions.

Abstract: Recent studies on audio models show brain-tuning - fine-tuning models to better predict corresponding fMRI activity - improves brain alignment and increases performance on downstream semantic and audio tasks. We extend this approach to a multimodal audio-video model to enhance social cognition, targeting the Superior Temporal Sulcus (STS), a key region for social processing, while subjects watch Friends. We find significant increases in brain alignment to the STS and an adjacent ROI, as well as improvements to a social cognition task related to the training data - sarcasm detection in sitcoms. In summary, our study extends brain-tuning to the multi-modal domain, demonstrating improvements to a downstream task after tuning to a relevant functional region.

[330] VSPO: Validating Semantic Pitfalls in Ontology via LLM-Based CQ Generation

Hyojun Choi, Seokju Hwang, Kyong-Ho Lee

Main category: cs.AI

TL;DR: A novel approach using fine-tuned LLMs to automatically generate competency questions that validate semantic pitfalls in ontology design, outperforming GPT-4.1 with 26% higher precision and 28.2% higher recall.

Details

Motivation: Manual crafting of competency questions for ontology validation is time-consuming and costly, while existing automated approaches fail to detect semantic pitfalls like 'Misusing allValuesFrom' that cannot be reliably identified through rule-based methods.

Method: Fine-tuned LLaMA-3.1-8B-Instruct to generate CQs that validate semantic discrepancies by introducing misalignments between natural language definitions and ontology axioms through axiom removal or logical operator alterations.

Result: The fine-tuned model demonstrated superior performance over baselines, achieving 26% higher precision and 28.2% higher recall than GPT-4.1 in generating CQs for pitfall validation, and can detect a broader range of modeling errors than existing datasets.

Conclusion: This research enables automatic generation of TBox-validating CQs using LLMs, significantly reducing manual effort while improving semantic alignment between ontologies and expert knowledge, representing the first study to target semantic pitfall validation in CQ generation using LLMs.

Abstract: Competency Questions (CQs) play a crucial role in validating ontology design. While manually crafting CQs can be highly time-consuming and costly for ontology engineers, recent studies have explored the use of large language models (LLMs) to automate this process. However, prior approaches have largely evaluated generated CQs based on their similarity to existing datasets, which often fail to verify semantic pitfalls such as “Misusing allValuesFrom”. Since such pitfalls cannot be reliably detected through rule-based methods, we propose a novel dataset and model of Validating Semantic Pitfalls in Ontology (VSPO) for CQ generation specifically designed to verify the semantic pitfalls. To simulate missing and misused axioms, we use LLMs to generate natural language definitions of classes and properties and introduce misalignments between the definitions and the ontology by removing axioms or altering logical operators (e.g., substituting union with intersection). We then fine-tune LLaMA-3.1-8B-Instruct to generate CQs that validate these semantic discrepancies between the provided definitions and the corresponding axioms. The resulting CQs can detect a broader range of modeling errors compared to existing public datasets. Our fine-tuned model demonstrates superior performance over baselines, showing 26% higher precision and 28.2% higher recall than GPT-4.1 in generating CQs for pitfall validation. This research enables automatic generation of TBox-validating CQs using LLMs, significantly reducing manual effort while improving semantic alignment between ontologies and expert knowledge. To the best of our knowledge, this is the first study to target semantic pitfall validation in CQ generation using LLMs.

[331] Enhancing Logical Expressiveness in Graph Neural Networks via Path-Neighbor Aggregation

Han Yu, Xiaojuan Zhao, Aiping Li, Kai Chen, Ziniu Liu, Zhichao Peng

Main category: cs.AI

TL;DR: PN-GNN enhances GNNs’ logical expressive power for knowledge graph reasoning by aggregating node-neighbor embeddings on reasoning paths, showing superior expressiveness over existing methods.

Details

Motivation: Existing GNN studies focus on simple single-relation graphs, with insufficient discussion on logical rule expression in knowledge graphs. Enhancing GNNs' logical expressive power remains a key challenge.

Method: Proposed Path-Neighbor enhanced GNN (PN-GNN) that aggregates node-neighbor embeddings on reasoning paths to enhance logical expressive power.

Result: Theoretical analysis shows PN-GNN has strictly stronger expressive power than C-GNN, with (k+1)-hop logical expressiveness superior to k-hop. Experiments on 6 synthetic and 2 real-world datasets confirm enhanced logical rule expression without compromising generalization.

Conclusion: PN-GNN successfully enhances GNNs’ logical expressive power for knowledge graph reasoning while maintaining competitive performance in reasoning tasks.

Abstract: Graph neural networks (GNNs) can effectively model structural information of graphs, making them widely used in knowledge graph (KG) reasoning. However, existing studies on the expressive power of GNNs mainly focuses on simple single-relation graphs, and there is still insufficient discussion on the power of GNN to express logical rules in KGs. How to enhance the logical expressive power of GNNs is still a key issue. Motivated by this, we propose Path-Neighbor enhanced GNN (PN-GNN), a method to enhance the logical expressive power of GNN by aggregating node-neighbor embeddings on the reasoning path. First, we analyze the logical expressive power of existing GNN-based methods and point out the shortcomings of the expressive power of these methods. Then, we theoretically investigate the logical expressive power of PN-GNN, showing that it not only has strictly stronger expressive power than C-GNN but also that its $(k+1)$-hop logical expressiveness is strictly superior to that of $k$-hop. Finally, we evaluate the logical expressive power of PN-GNN on six synthetic datasets and two real-world datasets. Both theoretical analysis and extensive experiments confirm that PN-GNN enhances the expressive power of logical rules without compromising generalization, as evidenced by its competitive performance in KG reasoning tasks.

[332] Multivariate Time series Anomaly Detection:A Framework of Hidden Markov Models

Jinbo Li, Witold Pedrycz, Iqbal Jamal

Main category: cs.AI

TL;DR: The paper presents a multivariate time series anomaly detection method that transforms multivariate data to univariate using FCM clustering and fuzzy integrals, then applies HMM for anomaly detection.

Details

Motivation: To develop an effective approach for multivariate time series anomaly detection by leveraging transformation techniques to simplify the complexity of multivariate data.

Method: Transform multivariate time series to univariate using Fuzzy C-Means clustering and fuzzy integrals, then apply Hidden Markov Models for anomaly detection.

Result: Experimental studies and comparative analysis show the effectiveness of the proposed transformation methods combined with HMM for anomaly detection.

Conclusion: The approach successfully detects anomalies in multivariate time series through transformation to univariate data and HMM modeling, with promising experimental results.

Abstract: In this study, we develop an approach to multivariate time series anomaly detection focused on the transformation of multivariate time series to univariate time series. Several transformation techniques involving Fuzzy C-Means (FCM) clustering and fuzzy integral are studied. In the sequel, a Hidden Markov Model (HMM), one of the commonly encountered statistical methods, is engaged here to detect anomalies in multivariate time series. We construct HMM-based anomaly detectors and in this context compare several transformation methods. A suite of experimental studies along with some comparative analysis is reported.

[333] Combining LLM Semantic Reasoning with GNN Structural Modeling for Multi-view Multi-Label Feature Selection

Zhiqi Chen, Yuzhou Liu, Jiarui Liu, Wanfu Gao

Main category: cs.AI

TL;DR: Proposes a novel MVMLFS method combining LLM semantic reasoning with GNN structural modeling to jointly leverage semantic and statistical information for feature selection in multi-view multi-label learning.

Details

Motivation: Existing MVMLFS methods mainly focus on statistical information but neglect semantic information, which is crucial for understanding complex relationships in high-dimensional, multimodal data from domains like social media and bioinformatics.

Method: Three components: (1) LLM as evaluation agent to assess semantic relevance among features, views, and labels; (2) Semantic-aware heterogeneous graph with semantic and statistical subgraphs; (3) Lightweight GAT to learn node embeddings as feature saliency scores.

Result: Experimental results on multiple benchmarks show superiority over state-of-the-art baselines, with effectiveness maintained on small-scale datasets, demonstrating robustness, flexibility, and generalization ability.

Conclusion: The proposed method successfully integrates semantic and statistical information through LLM-GNN collaboration, providing an effective solution for MVMLFS that outperforms existing approaches and works well across different dataset scales.

Abstract: Multi-view multi-label feature selection aims to identify informative features from heterogeneous views, where each sample is associated with multiple interdependent labels. This problem is particularly important in machine learning involving high-dimensional, multimodal data such as social media, bioinformatics or recommendation systems. Existing Multi-View Multi-Label Feature Selection (MVMLFS) methods mainly focus on analyzing statistical information of data, but seldom consider semantic information. In this paper, we aim to use these two types of information jointly and propose a method that combines Large Language Models (LLMs) semantic reasoning with Graph Neural Networks (GNNs) structural modeling for MVMLFS. Specifically, the method consists of three main components. (1) LLM is first used as an evaluation agent to assess the latent semantic relevance among feature, view, and label descriptions. (2) A semantic-aware heterogeneous graph with two levels is designed to represent relations among features, views and labels: one is a semantic graph representing semantic relations, and the other is a statistical graph. (3) A lightweight Graph Attention Network (GAT) is applied to learn node embedding in the heterogeneous graph as feature saliency scores for ranking and selection. Experimental results on multiple benchmark datasets demonstrate the superiority of our method over state-of-the-art baselines, and it is still effective when applied to small-scale datasets, showcasing its robustness, flexibility, and generalization ability.

[334] Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

Zhishen Sun, Guang Dai, Ivor Tsang, Haishan Ye

Main category: cs.AI

TL;DR: A perturbation framework reveals LLMs’ mathematical reasoning limitations - performance degrades significantly with numerical perturbations, showing reliance on pattern matching rather than true logical reasoning.

Details

Motivation: To investigate whether LLMs truly possess mathematical understanding ability or just pattern matching, by testing their robustness in complex reasoning environments.

Method: Proposed a perturbation framework injecting semantically irrelevant sentences with gradually increasing intensity, plus core questioning instruction missing to analyze problem-solving mechanisms.

Result: LLMs show stable performance with non-numerical perturbations but have robustness boundaries. Performance drops significantly with numerical perturbations (up to 51.55% for open-source models, 3-10% for commercial models). Models maintain 20-40% accuracy even without core questioning instructions.

Conclusion: Current LLMs have significant limitations in mathematical reasoning, relying more on memory templates and pattern matching than logical reasoning, which is crucial for their further development.

Abstract: LLMs have made significant progress in the field of mathematical reasoning, but whether they have true the mathematical understanding ability is still controversial. To explore this issue, we propose a new perturbation framework to evaluate LLMs’ reasoning ability in complex environments by injecting additional semantically irrelevant perturbation sentences and gradually increasing the perturbation intensity. At the same time, we use an additional perturbation method: core questioning instruction missing, to further analyze the LLMs’ problem-solving mechanism. The experimental results show that LLMs perform stably when facing perturbation sentences without numbers, but there is also a robustness boundary. As the perturbation intensity increases, the performance exhibits varying degrees of decline; when facing perturbation sentences with numbers, the performance decreases more significantly, most open source models with smaller parameters decrease by nearly or even more than 10%, and further increasing with the enhancement of perturbation intensity, with the maximum decrease reaching 51.55%. Even the most advanced commercial LLMs have seen a 3%-10% performance drop. By analyzing the reasoning process of LLMs in detail, We find that models are more sensitive to perturbations with numerical information and are more likely to give incorrect answers when disturbed by irrelevant numerical information. The higher the perturbation intensity, the more obvious these defects are. At the same time, in the absence of core questioning instruction, models can still maintain an accuracy of 20%-40%, indicating that LLMs may rely on memory templates or pattern matching to complete the task, rather than logical reasoning. In general, our work reveals the shortcomings and limitations of current LLMs in their reasoning capabilities, which is of great significance for the further development of LLMs.

[335] Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning

Tianwen Lyu, Xiang Zhuang, Keyan Ding, Xinzhe Cao, Lei Liang, Wei Zhao, Qiang Zhang, Huajun Chen

Main category: cs.AI

TL;DR: Proposes Knowledge-Augmented Long-CoT Reasoning framework that integrates LLMs with knowledge graph-based multi-hop reasoning chains for biomolecular reasoning, achieving state-of-the-art performance on complex multi-hop tasks.

Details

Motivation: Address logical inconsistencies and lack of domain knowledge grounding in LLMs for biomolecular reasoning, where existing approaches often deviate from biological facts or fail to capture long mechanistic dependencies.

Method: Framework constructs mechanistic chains via guided multi-hop traversal and pruning on knowledge graphs, incorporates chains into supervised fine-tuning for factual grounding, and refines with reinforcement learning for reasoning reliability.

Result: Achieves state-of-the-art performance on multi-hop tasks requiring traversal of structured biological knowledge, with clear advantages as reasoning depth increases compared to larger closed-source models.

Conclusion: Combining structured knowledge with advanced reasoning strategies enables reliable and interpretable biomolecular reasoning, highlighting the effectiveness of knowledge-augmented approaches for complex biological problems.

Abstract: Understanding complex biomolecular mechanisms requires multi-step reasoning across molecular interactions, signaling cascades, and metabolic pathways. While large language models(LLMs) show promise in such tasks, their application to biomolecular problems is hindered by logical inconsistencies and the lack of grounding in domain knowledge. Existing approaches often exacerbate these issues: reasoning steps may deviate from biological facts or fail to capture long mechanistic dependencies. To address these challenges, we propose a Knowledge-Augmented Long-CoT Reasoning framework that integrates LLMs with knowledge graph-based multi-hop reasoning chains. The framework constructs mechanistic chains via guided multi-hop traversal and pruning on the knowledge graph; these chains are then incorporated into supervised fine-tuning to improve factual grounding and further refined with reinforcement learning to enhance reasoning reliability and consistency. Furthermore, to overcome the shortcomings of existing benchmarks, which are often restricted in scale and scope and lack annotations for deep reasoning chains, we introduce PrimeKGQA, a comprehensive benchmark for biomolecular question answering. Experimental results on both PrimeKGQA and existing datasets demonstrate that although larger closed-source models still perform well on relatively simple tasks, our method demonstrates clear advantages as reasoning depth increases, achieving state-of-the-art performance on multi-hop tasks that demand traversal of structured biological knowledge. These findings highlight the effectiveness of combining structured knowledge with advanced reasoning strategies for reliable and interpretable biomolecular reasoning.

[336] Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens’ worth of agentic AI evaluations

JV Roig

Main category: cs.AI

TL;DR: KAMI v0.1 is an enterprise-focused benchmark that addresses LLM training data contamination and evaluates agentic capabilities, showing traditional benchmarks poorly predict practical performance.

Details

Motivation: Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities needed for enterprise deployment scenarios.

Method: Developed Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, processing 170,000 LLM test items over 5.5 billion tokens across 35 model configurations.

Result: Traditional benchmark rankings poorly predict practical agentic performance; newer models don’t always outperform older variants on enterprise tasks, contradicting traditional trends.

Conclusion: KAMI provides critical insights on cost-performance tradeoffs, behavioral patterns, and reasoning impact on token efficiency for enterprise deployment decisions.

Abstract: Enterprise adoption of agentic AI systems requires reliable evaluation methods that reflect real-world deployment scenarios. Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities such as multi-step tool use and decision-making under uncertainty. We present the Kamiwaza Agentic Merit Index (KAMI) v0.1, an enterprise-focused benchmark that addresses both contamination resistance and agentic evaluation. Through 170,000 LLM test items processing over 5.5 billion tokens across 35 model configurations, we demonstrate that traditional benchmark rankings poorly predict practical agentic performance. Notably, newer generation models like Llama 4 or Qwen 3 do not always outperform their older generation variants on enterprise-relevant tasks, contradicting traditional benchmark trends. We also present insights on cost-performance tradeoffs, model-specific behavioral patterns, and the impact of reasoning capabilities on token efficiency – findings critical for enterprises making deployment decisions.

[337] Dual-Process Scaffold Reasoning for Enhancing LLM Code Debugging

Po-Chung Hsieh, Chin-Po Chen, Jeng-Lin Li, Ming-Ching Chang

Main category: cs.AI

TL;DR: The paper introduces Scaffold Reasoning, a psychologically-inspired framework for code debugging that combines scaffold, analytic, and integration streams to optimize reasoning steps, achieving 88.91% pass rate on DebugBench with improved efficiency.

Details

Motivation: Current LLMs lack systematic exploration of System 2 reasoning steps that balance complexity and computational efficiency, despite drawing from psychological theories. There's a need for deeper investigation into intermediate reasoning processes.

Method: Proposes Scaffold Reasoning framework with three streams: Scaffold Stream (constructs reference code), Analytic Stream (analyzes buggy code), and Integration Stream (combines results from both streams).

Result: Achieves 88.91% pass rate and 5.36 seconds average inference time per problem on DebugBench, outperforming other reasoning approaches across various LLMs in both accuracy and efficiency.

Conclusion: The framework aligns with human cognitive processes and demonstrates advantages across different problem difficulties and bug types, though limitations exist in certain cognitive pathways.

Abstract: Recent LLMs have demonstrated sophisticated problem-solving capabilities on various benchmarks through advanced reasoning algorithms. However, the key research question of identifying reasoning steps that balance complexity and computational efficiency remains unsolved. Recent research has increasingly drawn upon psychological theories to explore strategies for optimizing cognitive pathways. The LLM’s final outputs and intermediate steps are regarded as System 1 and System 2, respectively. However, an in-depth exploration of the System 2 reasoning is still lacking. Therefore, we propose a novel psychologically backed Scaffold Reasoning framework for code debugging, which encompasses the Scaffold Stream, Analytic Stream, and Integration Stream. The construction of reference code within the Scaffold Stream is integrated with the buggy code analysis results produced by the Analytic Stream through the Integration Stream. Our framework achieves an 88.91% pass rate and an average inference time of 5.36 seconds per-problem on DebugBench, outperforming other reasoning approaches across various LLMs in both reasoning accuracy and efficiency. Further analyses elucidate the advantages and limitations of various cognitive pathways across varying problem difficulties and bug types. Our findings also corroborate the alignment of the proposed Scaffold Reasoning framework with human cognitive processes.

[338] MSCR: Exploring the Vulnerability of LLMs’ Mathematical Reasoning Abilities Using Multi-Source Candidate Replacement

Zhishen Sun, Guang Dai, Haishan Ye

Main category: cs.AI

TL;DR: MSCR is an automated adversarial attack method that uses multi-source candidate replacement to test LLM robustness in mathematical reasoning. It shows that minor single-word perturbations can significantly reduce model accuracy while preserving semantic meaning.

Details

Motivation: LLMs show human-level performance in mathematical reasoning but their robustness under minor input perturbations lacks systematic investigation. Existing methods have limitations in scalability, semantic preservation, and cost.

Method: Propose MSCR method combining three information sources: cosine similarity in LLM embedding space, WordNet dictionary, and masked language model contextual predictions. Generate semantically similar word candidates, filter them, and substitute one by one to attack.

Result: Single-word perturbations significantly reduce accuracy across all models (max drop: 49.89% on GSM8K, 35.40% on MATH500) while maintaining high semantic consistency. Perturbations also increase response length, leading to redundant reasoning paths and higher computational costs.

Conclusion: Current LLMs have significant robustness deficiencies and efficiency bottlenecks in mathematical reasoning tasks, as minor perturbations can substantially impact performance and computational efficiency.

Abstract: LLMs demonstrate performance comparable to human abilities in complex tasks such as mathematical reasoning, but their robustness in mathematical reasoning under minor input perturbations still lacks systematic investigation. Existing methods generally suffer from limited scalability, weak semantic preservation, and high costs. Therefore, we propose MSCR, an automated adversarial attack method based on multi-source candidate replacement. By combining three information sources including cosine similarity in the embedding space of LLMs, the WordNet dictionary, and contextual predictions from a masked language model, we generate for each word in the input question a set of semantically similar candidates, which are then filtered and substituted one by one to carry out the attack. We conduct large-scale experiments on LLMs using the GSM8K and MATH500 benchmarks. The results show that even a slight perturbation involving only a single word can significantly reduce the accuracy of all models, with the maximum drop reaching 49.89% on GSM8K and 35.40% on MATH500, while preserving the high semantic consistency of the perturbed questions. Further analysis reveals that perturbations not only lead to incorrect outputs but also substantially increase the average response length, which results in more redundant reasoning paths and higher computational resource consumption. These findings highlight the robustness deficiencies and efficiency bottlenecks of current LLMs in mathematical reasoning tasks.

[339] Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.AI

TL;DR: The paper introduces ‘information capacity’ as a unified metric for LLM efficiency, measuring text compression performance relative to computational complexity, which enables fair comparisons across different model sizes and architectures.

Details

Motivation: The rapid advancement of LLMs and their expanding applications create soaring computational demands, with test-time scaling exacerbating the tension between model capability and resource consumption. Current metrics fail to provide unified efficiency comparisons across different model sizes and architectures.

Method: Propose information capacity as a measure based on text compression performance relative to computational complexity. Evaluate 49 models on 5 heterogeneous datasets, considering factors like tokenizer efficiency, pretraining data, and mixture-of-experts architecture.

Result: Models of varying sizes within a series exhibit consistent information capacity. The metric enables fair efficiency comparisons across model series and accurate performance prediction within a model series. Tokenizer efficiency significantly impacts both input and output token counts, which is often neglected in LLM evaluations.

Conclusion: Information capacity provides a unified framework for evaluating LLM efficiency that incorporates computational complexity, compression performance, and tokenizer efficiency, offering consistent insights across different model architectures and sizes.

Abstract: Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further aggravates the tension between model capability and resource consumption, highlighting the importance of inference efficiency. However, a unified metric that accurately reflects an LLM’s efficiency across different model sizes and architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. Larger models can predict the next token more accurately, achieving greater compression gains but at higher computational costs. Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity. This metric enables a fair efficiency comparison across model series and accurate performance prediction within a model series. A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts but is often neglected in LLM evaluations. We assess the information capacity of 49 models on 5 heterogeneous datasets and observe consistent results on the influences of tokenizer efficiency, pretraining data, and the mixture-of-experts architecture.

[340] Clustering-based Anomaly Detection in Multivariate Time Series Data

Jinbo Li, Hesam Izakian, Witold Pedrycz, Iqbal Jamal

Main category: cs.AI

TL;DR: A clustering-based approach using extended fuzzy clustering and Particle Swarm Optimization to detect anomalies in multivariate time series by analyzing amplitude and shape patterns.

Details

Motivation: Multivariate time series anomaly detection is challenging due to the need to simultaneously consider temporal and variable relationships, with applications in science and engineering.

Method: Use sliding window to generate multivariate subsequences, apply extended fuzzy clustering to reveal structure, employ reconstruction criterion with cluster centers and partition matrix, and use Particle Swarm Optimization for optimization.

Result: Experimental studies on synthetic and real-world datasets show the method can effectively detect anomalies in multivariate time series.

Conclusion: The proposed framework successfully detects anomalies in multivariate time series and is suitable for identifying anomalous amplitude and shape patterns in various domains like healthcare, weather analysis, finance, and disease outbreak detection.

Abstract: Multivariate time series data come as a collection of time series describing different aspects of a certain temporal phenomenon. Anomaly detection in this type of data constitutes a challenging problem yet with numerous applications in science and engineering because anomaly scores come from the simultaneous consideration of the temporal and variable relationships. In this paper, we propose a clustering-based approach to detect anomalies concerning the amplitude and the shape of multivariate time series. First, we use a sliding window to generate a set of multivariate subsequences and thereafter apply an extended fuzzy clustering to reveal a structure present within the generated multivariate subsequences. Finally, a reconstruction criterion is employed to reconstruct the multivariate subsequences with the optimal cluster centers and the partition matrix. We construct a confidence index to quantify a level of anomaly detected in the series and apply Particle Swarm Optimization as an optimization vehicle for the problem of anomaly detection. Experimental studies completed on several synthetic and six real-world datasets suggest that the proposed methods can detect the anomalies in multivariate time series. With the help of available clusters revealed by the extended fuzzy clustering, the proposed framework can detect anomalies in the multivariate time series and is suitable for identifying anomalous amplitude and shape patterns in various application domains such as health care, weather data analysis, finance, and disease outbreak detection.

[341] Prudential Reliability of Large Language Models in Reinsurance: Governance, Assurance, and Capital Efficiency

Stella C. Dong

Main category: cs.AI

TL;DR: A prudential framework for LLM reliability in reinsurance with five pillars (governance, data lineage, assurance, resilience, regulatory alignment) that translates supervisory expectations into measurable controls, implemented through RAIRAB benchmark showing improved grounding accuracy and reduced hallucinations.

Details

Motivation: To assess the reliability of large language models in reinsurance by translating existing supervisory expectations from Solvency II, SR 11-7, and regulatory guidance into a measurable framework that addresses grounding, transparency, and accountability concerns.

Method: Developed a five-pillar architecture framework and implemented it through the Reinsurance AI Reliability and Assurance Benchmark (RAIRAB), evaluating governance-embedded LLMs across six task families with retrieval-grounded configurations.

Result: Retrieval-grounded configurations achieved higher grounding accuracy (0.90), reduced hallucination and interpretive drift by approximately 40%, and nearly doubled transparency, lowering informational frictions in risk transfer and capital allocation.

Conclusion: Existing prudential doctrines can accommodate reliable AI when governance is explicit, data are traceable, and assurance is verifiable, demonstrating that current regulatory frameworks are sufficient for AI reliability when properly implemented.

Abstract: This paper develops a prudential framework for assessing the reliability of large language models (LLMs) in reinsurance. A five-pillar architecture–governance, data lineage, assurance, resilience, and regulatory alignment–translates supervisory expectations from Solvency II, SR 11-7, and guidance from EIOPA (2025), NAIC (2023), and IAIS (2024) into measurable lifecycle controls. The framework is implemented through the Reinsurance AI Reliability and Assurance Benchmark (RAIRAB), which evaluates whether governance-embedded LLMs meet prudential standards for grounding, transparency, and accountability. Across six task families, retrieval-grounded configurations achieved higher grounding accuracy (0.90), reduced hallucination and interpretive drift by roughly 40%, and nearly doubled transparency. These mechanisms lower informational frictions in risk transfer and capital allocation, showing that existing prudential doctrines already accommodate reliable AI when governance is explicit, data are traceable, and assurance is verifiable.

[342] Gateways to Tractability for Satisfiability in Pearl’s Causal Hierarchy

Robert Ganian, Marlene Gründel, Simon Wietheger

Main category: cs.AI

TL;DR: The paper identifies tractable cases for Pearl’s Causal Hierarchy satisfiability using parameterized complexity, providing fixed-parameter and XP algorithms for probabilistic and counterfactual fragments.

Details

Motivation: Pearl's Causal Hierarchy satisfiability is computationally intractable in most classical settings, motivating the search for tractable cases through parameterized complexity.

Method: Uses parameterized complexity with parameters like primal treewidth and number of variables, departing from dynamic programming to exploit structural characterizations of causal models.

Result: Provides first fixed-parameter and XP-algorithms for satisfiability in key probabilistic and counterfactual fragments, with matching hardness results establishing tractability boundaries.

Conclusion: Identifies initial gateways to tractability for PCH satisfiability through parameterized complexity, offering new algorithmic approaches for causal reasoning.

Abstract: Pearl’s Causal Hierarchy (PCH) is a central framework for reasoning about probabilistic, interventional, and counterfactual statements, yet the satisfiability problem for PCH formulas is computationally intractable in almost all classical settings. We revisit this challenge through the lens of parameterized complexity and identify the first gateways to tractability. Our results include fixed-parameter and XP-algorithms for satisfiability in key probabilistic and counterfactual fragments, using parameters such as primal treewidth and the number of variables, together with matching hardness results that map the limits of tractability. Technically, we depart from the dynamic programming paradigm typically employed for treewidth-based algorithms and instead exploit structural characterizations of well-formed causal models, providing a new algorithmic toolkit for causal reasoning.

[343] Improving Industrial Injection Molding Processes with Explainable AI for Quality Classification

Georg Rottenwalter, Marcel Tilly, Victor Owolabi

Main category: cs.AI

TL;DR: Using XAI techniques (SHAP, Grad-CAM, LIME) to reduce features from 19 to 9 and 6 in an LSTM model for injection-molded part quality classification, achieving improved generalization while maintaining high accuracy with slight inference speed gains.

Details

Motivation: Machine learning models lack interpretability for industrial quality control, and many industrial machines have limited sensor technology, making data acquisition challenging. XAI can provide insights into model decisions and identify relevant features.

Method: Applied SHAP, Grad-CAM, and LIME to analyze feature importance in an LSTM model trained on real production data. Reduced original 19 input features to 9 and 6 features, evaluating trade-offs between accuracy, inference speed, and interpretability.

Result: Feature reduction improved generalization while maintaining high classification performance, with a small increase in inference speed. The approach enhances feasibility of AI-driven quality control for industrial settings with limited sensor capabilities.

Conclusion: Feature reduction using XAI techniques enables more efficient and interpretable machine learning applications in manufacturing, particularly beneficial for industrial settings with constrained sensor technology.

Abstract: Machine learning is an essential tool for optimizing industrial quality control processes. However, the complexity of machine learning models often limits their practical applicability due to a lack of interpretability. Additionally, many industrial machines lack comprehensive sensor technology, making data acquisition incomplete and challenging. Explainable Artificial Intelligence offers a solution by providing insights into model decision-making and identifying the most relevant features for classification. In this paper, we investigate the impact of feature reduction using XAI techniques on the quality classification of injection-molded parts. We apply SHAP, Grad-CAM, and LIME to analyze feature importance in a Long Short-Term Memory model trained on real production data. By reducing the original 19 input features to 9 and 6, we evaluate the trade-off between model accuracy, inference speed, and interpretability. Our results show that reducing features can improve generalization while maintaining high classification performance, with an small increase in inference speed. This approach enhances the feasibility of AI-driven quality control, particularly for industrial settings with limited sensor capabilities, and paves the way for more efficient and interpretable machine learning applications in manufacturing.

[344] Advancements in synthetic data extraction for industrial injection molding

Georg Rottenwalter, Marcel Tilly, Christian Bielenberg, Katharina Obermeier

Main category: cs.AI

TL;DR: Investigates using synthetic data to improve machine learning models for injection molding processes, finding optimal synthetic-real data balance enhances model robustness and reduces industrial costs.

Details

Motivation: Data acquisition for industrial machine learning is time-consuming and costly. Synthetic data offers a solution to augment insufficient datasets and improve model robustness.

Method: Generate synthetic data by simulating production cycles and incorporate into training data. Experiment with different proportions of synthetic vs real data to find optimal balance using LSTM architecture.

Result: Inclusion of synthetic data improves model’s ability to handle different scenarios. Optimal synthetic-real data balance enhances model performance while preserving data authenticity.

Conclusion: Synthetic data provides valuable alternative for costly data collection, potentially reducing manual labor, machine use, and material waste in manufacturing processes.

Abstract: Machine learning has significant potential for optimizing various industrial processes. However, data acquisition remains a major challenge as it is both time-consuming and costly. Synthetic data offers a promising solution to augment insufficient data sets and improve the robustness of machine learning models. In this paper, we investigate the feasibility of incorporating synthetic data into the training process of the injection molding process using an existing Long Short-Term Memory architecture. Our approach is to generate synthetic data by simulating production cycles and incorporating them into the training data set. Through iterative experimentation with different proportions of synthetic data, we attempt to find an optimal balance that maximizes the benefits of synthetic data while preserving the authenticity and relevance of real data. Our results suggest that the inclusion of synthetic data improves the model’s ability to handle different scenarios, with potential practical industrial applications to reduce manual labor, machine use, and material waste. This approach provides a valuable alternative for situations where extensive data collection and maintenance has been impractical or costly and thus could contribute to more efficient manufacturing processes in the future.

[345] National Institute on Aging PREPARE Challenge: Early Detection of Cognitive Impairment Using Speech - The SpeechCARE Solution

Maryam Zolnoori, Hossein Azadmaleki, Yasaman Haghbin, Ali Zolnour, Mohammad Javad Momeni Nezhad, Sina Rashidi, Mehdi Naserian, Elyas Esmaeili, Sepehr Karimi Arpanahi

Main category: cs.AI

TL;DR: SpeechCARE is a multimodal speech processing pipeline that uses transformer models to detect cognitive impairment from speech, achieving high performance in classifying Alzheimer’s disease and mild cognitive impairment with minimal bias.

Details

Motivation: Over 50% of individuals with cognitive decline remain undiagnosed, and existing speech-based assessment methods have limited performance and generalizability.

Method: Uses pretrained multilingual acoustic and linguistic transformers with dynamic fusion architecture inspired by Mixture of Experts, including robust preprocessing with automatic transcription, LLM-based anomaly detection, and SHAP-based explainability.

Result: Achieved AUC = 0.88 and F1 = 0.72 for classifying cognitively healthy, MCI, and AD individuals; AUC = 0.90 and F1 = 0.62 for MCI detection with minimal bias except for adults over 80.

Conclusion: SpeechCARE shows promise for early ADRD detection and will be deployed in real-world care settings with EHR integration for underrepresented populations.

Abstract: Alzheimer’s disease and related dementias (ADRD) affect one in five adults over 60, yet more than half of individuals with cognitive decline remain undiagnosed. Speech-based assessments show promise for early detection, as phonetic motor planning deficits alter acoustic features (e.g., pitch, tone), while memory and language impairments lead to syntactic and semantic errors. However, conventional speech-processing pipelines with hand-crafted features or general-purpose audio classifiers often exhibit limited performance and generalizability. To address these limitations, we introduce SpeechCARE, a multimodal speech processing pipeline that leverages pretrained, multilingual acoustic and linguistic transformer models to capture subtle speech-related cues associated with cognitive impairment. Inspired by the Mixture of Experts (MoE) paradigm, SpeechCARE employs a dynamic fusion architecture that weights transformer-based acoustic, linguistic, and demographic inputs, allowing integration of additional modalities (e.g., social factors, imaging) and enhancing robustness across diverse tasks. Its robust preprocessing includes automatic transcription, large language model (LLM)-based anomaly detection, and task identification. A SHAP-based explainability module and LLM reasoning highlight each modality’s contribution to decision-making. SpeechCARE achieved AUC = 0.88 and F1 = 0.72 for classifying cognitively healthy, MCI, and AD individuals, with AUC = 0.90 and F1 = 0.62 for MCI detection. Bias analysis showed minimal disparities, except for adults over 80. Mitigation techniques included oversampling and weighted loss. Future work includes deployment in real-world care settings (e.g., VNS Health, Columbia ADRC) and EHR-integrated explainability for underrepresented populations in New York City.

[346] oboro: Text-to-Image Synthesis on Limited Data using Flow-based Diffusion Transformer with MMH Attention

Ryusuke Mizutani, Kazuaki Matano, Tsugumi Kadowaki, Haruki Tenya, Layris, nuigurumi, Koki Hashimoto, Yu Tanaka

Main category: cs.AI

TL;DR: Development of “oboro:”, Japan’s first open-source commercial image generation AI model, built from scratch using copyright-cleared data to address anime industry labor shortages.

Details

Motivation: To solve labor shortage challenges in Japan's anime production industry by developing a domestic image generation model from scratch.

Method: Built “oboro:” image generation model from scratch using only copyright-cleared images, with architecture designed for high-quality generation from limited datasets.

Result: Successfully developed and publicly released “oboro:” foundation model weights and inference code as Japan’s first open-source commercial image generation AI.

Conclusion: This project contributes to Japan’s AI ecosystem by providing transparent development and promoting domestic AI research and engineering capabilities.

Abstract: This project was conducted as a 2nd-term adopted project of the “Post-5G Information and Communication System Infrastructure Enhancement R&D Project Development of Competitive Generative AI Foundation Models (GENIAC),” a business of the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO). To address challenges such as labor shortages in Japan’s anime production industry, this project aims to develop an image generation model from scratch. This report details the technical specifications of the developed image generation model, “oboro:.” We have developed “oboro:,” a new image generation model built from scratch, using only copyright-cleared images for training. A key characteristic is its architecture, designed to generate high-quality images even from limited datasets. The foundation model weights and inference code are publicly available alongside this report. This project marks the first release of an open-source, commercially-oriented image generation AI fully developed in Japan. AiHUB originated from the OSS community; by maintaining transparency in our development process, we aim to contribute to Japan’s AI researcher and engineer community and promote the domestic AI development ecosystem.

[347] An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Georgios Pantazopoulos, Eda B. Özyiğit

Main category: cs.AI

TL;DR: Efficient training pipeline combining model-based data filtering and parameter-efficient fine-tuning achieves strong visual grounding performance with only 12K clean examples from 4.8M synthetic data, matching or surpassing larger baselines.

Details

Motivation: To enable reasoning-capable GUI agents by addressing the limitations of existing methods that rely on massive, noisy synthetic datasets for visual grounding tasks.

Method: Curated 12K clean instances from 4.8M synthetic examples through filtering challenging cases, removing misalignments, and selecting diverse multimodal instances. Trained 3B-parameter VLM with supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization.

Result: Models trained with filtered data and lightweight strategies match or surpass larger baselines on ScreenSpot, Multimodal-Mind2Web, and AndroidControl benchmarks.

Conclusion: Principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

Abstract: Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets.This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought- augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

[348] Towards Provably Unlearnable Examples via Bayes Error Optimisation

Ruihan Zhang, Jun Sun, Ee-Peng Lim, Peixin Zhang

Main category: cs.AI

TL;DR: Proposes a novel method for creating unlearnable examples by systematically maximizing Bayes error, providing formal guarantees and maintaining effectiveness when mixed with clean data.

Details

Motivation: Address privacy concerns in machine learning by preventing models from learning from user data without consent, overcoming limitations of existing heuristic methods that lack formal guarantees and fail when mixed with clean data.

Method: Optimization-based approach using projected gradient ascent to systematically maximize Bayes error, which measures irreducible classification error, for constructing unlearnable examples.

Result: Experimental results across multiple datasets and model architectures show the method effectively restricts data learnability and remains functional when unlearnable examples are mixed with clean samples.

Conclusion: The proposed Bayes error maximization approach provides provable guarantees for creating unlearnable examples and maintains practical effectiveness in real-world scenarios where clean and protected data are mixed.

Abstract: The recent success of machine learning models, especially large-scale classifiers and language models, relies heavily on training with massive data. These data are often collected from online sources. This raises serious concerns about the protection of user data, as individuals may not have given consent for their data to be used in training. To address this concern, recent studies introduce the concept of unlearnable examples, i.e., data instances that appear natural but are intentionally altered to prevent models from effectively learning from them. While existing methods demonstrate empirical effectiveness, they typically rely on heuristic trials and lack formal guarantees. Besides, when unlearnable examples are mixed with clean data, as is often the case in practice, their unlearnability disappears. In this work, we propose a novel approach to constructing unlearnable examples by systematically maximising the Bayes error, a measurement of irreducible classification error. We develop an optimisation-based approach and provide an efficient solution using projected gradient ascent. Our method provably increases the Bayes error and remains effective when the unlearning examples are mixed with clean samples. Experimental results across multiple datasets and model architectures are consistent with our theoretical analysis and show that our approach can restrict data learnability, effectively in practice.

[349] EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks

Xiao Yang, Xuejiao Zhao, Zhiqi Shen

Main category: cs.AI

TL;DR: EHRStruct is a benchmark for evaluating LLMs on structured EHR data with 11 tasks and 2,200 samples, showing current models struggle with EHR reasoning tasks. EHRMaster, a code-augmented method, achieves SOTA performance.

Details

Motivation: Lack of standardized evaluation frameworks for LLMs on structured EHR data makes systematic assessment and comparison difficult.

Method: Created EHRStruct benchmark with 11 representative tasks from two EHR datasets, evaluated 20 LLMs, and proposed EHRMaster code-augmented method.

Result: Many structured EHR tasks require strong understanding and reasoning capabilities that current LLMs struggle with. EHRMaster achieves state-of-the-art performance.

Conclusion: Structured EHR tasks are challenging for LLMs, and EHRMaster provides an effective solution through code augmentation.

Abstract: Structured Electronic Health Record (EHR) data stores patient information in relational tables and plays a central role in clinical decision-making. Recent advances have explored the use of large language models (LLMs) to process such data, showing promise across various clinical tasks.However, the absence of standardized evaluation frameworks and clearly defined tasks makes it difficult to systematically assess and compare LLM performance on structured EHR data.To address these evaluation challenges, we introduce EHRStruct, a benchmark specifically designed to evaluate LLMs on structured EHR tasks.EHRStruct defines 11 representative tasks spanning diverse clinical needs and includes 2,200 task-specific evaluation samples derived from two widely used EHR datasets.We use EHRStruct to evaluate 20 advanced and representative LLMs, covering both general and medical models.We further analyze key factors influencing model performance, including input formats, few-shot generalisation, and finetuning strategies, and compare results with 11 state-of-the-art LLM-based enhancement methods for structured data reasoning. Our results indicate that many structured EHR tasks place high demands on the understanding and reasoning capabilities of LLMs.In response, we propose EHRMaster, a code-augmented method that achieves state-of-the-art performance and offers practical

[350] MADD: Multi-Agent Drug Discovery Orchestra

Gleb V. Solovev, Alina B. Zhidkovskaya, Anastasia Orlova, Nina Gubina, Anastasia Vepreva, Rodion Golovinskii, Ilya Tonkii, Ivan Dubrovsky, Ivan Gurev, Dmitry Gilemkhanov, Denis Chistiakov, Timur A. Aliev, Ivan Poddiakov, Galina Zubkova, Ekaterina V. Skorb, Vladimir Vinogradov, Alexander Boukhanovsky, Nikolay Nikitin, Andrei Dmitrenko, Anna Kalyuzhnaya, Andrey Savchenko

Main category: cs.AI

TL;DR: MADD is a multi-agent system that builds customized hit identification pipelines from natural language queries, outperforming existing LLM-based solutions in drug discovery.

Details

Motivation: To address the accessibility limitations of complex AI tools for wet-lab researchers in early drug discovery hit identification.

Method: Uses four coordinated agents to handle key subtasks in de novo compound generation and screening from natural language queries.

Result: Superior performance across seven drug discovery cases compared to existing LLM-based solutions; identified hit molecules for five biological targets.

Conclusion: MADD enables AI-first drug design and contributes a new benchmark for agentic drug design with over three million query-molecule pairs and docking scores.

Abstract: Hit identification is a central challenge in early drug discovery, traditionally requiring substantial experimental resources. Recent advances in artificial intelligence, particularly large language models (LLMs), have enabled virtual screening methods that reduce costs and improve efficiency. However, the growing complexity of these tools has limited their accessibility to wet-lab researchers. Multi-agent systems offer a promising solution by combining the interpretability of LLMs with the precision of specialized models and tools. In this work, we present MADD, a multi-agent system that builds and executes customized hit identification pipelines from natural language queries. MADD employs four coordinated agents to handle key subtasks in de novo compound generation and screening. We evaluate MADD across seven drug discovery cases and demonstrate its superior performance compared to existing LLM-based solutions. Using MADD, we pioneer the application of AI-first drug design to five biological targets and release the identified hit molecules. Finally, we introduce a new benchmark of query-molecule pairs and docking scores for over three million compounds to contribute to the agentic future of drug design.

[351] Beyond Distributions: Geometric Action Control for Continuous Reinforcement Learning

Zhihao Lin

Main category: cs.AI

TL;DR: GAC is a novel action generation method that preserves geometric properties of spherical distributions while simplifying computation, achieving state-of-the-art performance on continuous control tasks.

Details

Motivation: Gaussian policies in RL suffer from fundamental mismatch with bounded action spaces, requiring ad-hoc squashing functions. vMF distributions offer theoretical benefits but are computationally expensive.

Method: GAC decomposes action generation into direction vector and learnable concentration parameter, enabling efficient interpolation between deterministic actions and uniform spherical noise. Reduces parameters from 2d to d+1 and avoids O(dk) complexity of vMF.

Result: Consistently matches or exceeds state-of-the-art methods across six MuJoCo benchmarks, achieving 37.6% improvement over SAC on Ant-v4 and best results on 4 out of 6 tasks.

Conclusion: Robust and efficient continuous control doesn’t require complex distributions but principled respect for action space geometry. Both spherical normalization and adaptive concentration control are essential to GAC’s success.

Abstract: Gaussian policies have dominated continuous control in deep reinforcement learning (RL), yet they suffer from a fundamental mismatch: their unbounded support requires ad-hoc squashing functions that distort the geometry of bounded action spaces. While von Mises-Fisher (vMF) distributions offer a theoretically grounded alternative on the sphere, their reliance on Bessel functions and rejection sampling hinders practical adoption. We propose \textbf{Geometric Action Control (GAC)}, a novel action generation paradigm that preserves the geometric benefits of spherical distributions while \textit{simplifying computation}. GAC decomposes action generation into a direction vector and a learnable concentration parameter, enabling efficient interpolation between deterministic actions and uniform spherical noise. This design reduces parameter count from (2d) to (d+1), and avoids the (O(dk)) complexity of vMF rejection sampling, achieving simple (O(d)) operations. Empirically, GAC consistently matches or exceeds state-of-the-art methods across six MuJoCo benchmarks, achieving 37.6% improvement over SAC on Ant-v4 and the best results on 4 out of 6 tasks. Our ablation studies reveal that both \textbf{spherical normalization} and \textbf{adaptive concentration control} are essential to GAC’s success. These findings suggest that robust and efficient continuous control does not require complex distributions, but a principled respect for the geometry of action spaces. Code and pretrained models are available in supplementary materials.

[352] Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents

Waseem AlShikh, Muayad Sayed Ali, Brian Kennedy, Dmytro Mozolevskyi

Main category: cs.AI

TL;DR: Proposes 11 outcome-based metrics for evaluating AI agents beyond infrastructure metrics, tested across 4 architectures and 5 domains, showing Hybrid Agent performs best.

Details

Motivation: Current AI agent evaluation focuses on infrastructural metrics (latency, throughput) which don't capture decision quality, autonomy, or business value.

Method: Large-scale simulated experiment with 4 agent architectures (ReAct, Chain-of-Thought, Tool-Augmented, Hybrid) across 5 domains (Healthcare, Finance, Marketing, Legal, Customer Service) using 11 proposed metrics.

Result: Hybrid Agent performed best across most metrics with 88.8% Goal Completion Rate and highest ROI, revealing significant performance trade-offs between architectures.

Conclusion: Provides standardized methodology for holistic AI agent evaluation, enabling better development, deployment, and governance.

Abstract: As AI agents proliferate across industries and applications, evaluating their performance based solely on infrastructural metrics such as latency, time-to-first-token, or token throughput is proving insufficient. These metrics fail to capture the quality of an agent’s decisions, its operational autonomy, or its ultimate business value. This white paper proposes a novel, comprehensive framework of eleven outcome-based, task-agnostic performance metrics for AI agents that transcend domain boundaries. These metrics are designed to enable organizations to evaluate agents based on the quality of their decisions, their degree of autonomy, their adaptability to new challenges, and the tangible business value they deliver, regardless of the underlying model architecture or specific use case. We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE). Through a large-scale simulated experiment involving four distinct agent architectures (ReAct, Chain-of-Thought, Tool-Augmented, Hybrid) across five diverse domains (Healthcare, Finance, Marketing, Legal, and Customer Service), we demonstrate the framework’s efficacy. Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model across the majority of our proposed metrics, achieving an average Goal Completion Rate of 88.8% and the highest Return on Investment (ROI). This work provides a robust, standardized methodology for the holistic evaluation of AI agents, paving the way for more effective development, deployment, and governance.

[353] Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

Ziyu Ma, Chenhui Gou, Yiming Hu, Yong Wang, Xiangxiang Chu, Bohan Zhuang, Jianfei Cai

Main category: cs.AI

TL;DR: STV is a sensitivity-aware task vector insertion framework that determines optimal locations and values for inserting compact representations of in-context demonstrations into multimodal models, improving many-shot learning capabilities.

Details

Motivation: Large Multimodal Models struggle with many-shot in-context learning due to limited context length and high inference costs, while existing task-vector methods fail to optimally determine where and what to insert.

Method: Uses activation delta patterns to identify sensitive locations, constructs pre-clustered activation banks for each location, and applies reinforcement learning to select optimal insertion values.

Result: STV consistently outperforms previous task-vector methods across various multimodal models (Qwen-VL, Idefics-2) and tasks (VizWiz, OK-VQA) with strong generalization.

Conclusion: The sensitivity-aware approach effectively addresses the where and what challenges in task vector insertion, enabling better many-shot in-context learning for multimodal models.

Abstract: Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.

[354] Multi-Agent GraphRAG: A Text-to-Cypher Framework for Labeled Property Graphs

Anton Gusarov, Anastasia Volkova, Valentin Khrulkov, Andrey Kuznetsov, Evgenii Maslov, Ivan Oseledets

Main category: cs.AI

TL;DR: Proposes Multi-Agent GraphRAG, a modular LLM agentic system for text-to-Cypher query generation that leverages Labeled Property Graph databases as scalable reasoning engines in GraphRAG pipelines.

Details

Motivation: Existing GraphRAG methods primarily focus on RDF knowledge graphs using triple representations and SPARQL queries, leaving the potential of Cypher and LPG databases underexplored for scalable and effective reasoning in GraphRAG systems.

Method: A modular LLM agentic system featuring automated Cypher query generation and execution using Memgraph as backend, with iterative content-aware correction and normalization reinforced by aggregated feedback loops for semantic and syntactic refinement.

Result: Evaluated on CypherBench dataset across general domains and demonstrated on IFC data representing building digital twins, showing the system’s ability to bridge AI with real-world industrial applications at scale.

Conclusion: The approach successfully enables industrial digital automation use cases by providing a natural language interface to LPG-based graph data, demonstrating the viability of Cypher and property graphs as scalable reasoning engines in GraphRAG pipelines.

Abstract: While Retrieval-Augmented Generation (RAG) methods commonly draw information from unstructured documents, the emerging paradigm of GraphRAG aims to leverage structured data such as knowledge graphs. Most existing GraphRAG efforts focus on Resource Description Framework (RDF) knowledge graphs, relying on triple representations and SPARQL queries. However, the potential of Cypher and Labeled Property Graph (LPG) databases to serve as scalable and effective reasoning engines within GraphRAG pipelines remains underexplored in current research literature. To fill this gap, we propose Multi-Agent GraphRAG, a modular LLM agentic system for text-to-Cypher query generation serving as a natural language interface to LPG-based graph data. Our proof-of-concept system features an LLM-based workflow for automated Cypher queries generation and execution, using Memgraph as the graph database backend. Iterative content-aware correction and normalization, reinforced by an aggregated feedback loop, ensures both semantic and syntactic refinement of generated queries. We evaluate our system on the CypherBench graph dataset covering several general domains with diverse types of queries. In addition, we demonstrate performance of the proposed workflow on a property graph derived from the IFC (Industry Foundation Classes) data, representing a digital twin of a building. This highlights how such an approach can bridge AI with real-world applications at scale, enabling industrial digital automation use cases.

[355] DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation

Vishal Kumar, Shubhra Mishra, Rebecca Hao, Rizwaan Malik, David Broman, Dorottya Demszky

Main category: cs.AI

TL;DR: DiagramIR is an automatic evaluation pipeline for geometric figures generated by LLMs, using intermediate representations of LaTeX TikZ code to achieve higher human agreement than LLM-as-a-Judge baselines, enabling cost-effective deployment of educational technologies.

Details

Motivation: Current LLM-based learning tools are text-only, limiting their usefulness in visualization-heavy domains like mathematics. While LLMs can generate code for educational figures, there's a bottleneck in scalable evaluation of these diagrams.

Method: Proposed DiagramIR pipeline that uses intermediate representations (IRs) of LaTeX TikZ code to automatically evaluate geometric figures generated by LLMs.

Result: DiagramIR shows higher agreement with human raters compared to LLM-as-a-Judge baselines. Enables smaller models like GPT-4.1-Mini to perform comparably to larger models like GPT-5 at 10x lower inference cost.

Conclusion: DiagramIR provides a scalable evaluation approach for educational diagrams, making LLM-based educational technologies more accessible and cost-effective for deployment.

Abstract: Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.

[356] Smarter Together: Creating Agentic Communities of Practice through Shared Experiential Learning

Valentin Tablan, Scott Taylor, Gabriel Hurtado, Kristoffer Bernhem, Anders Uhrenholt, Gabriele Farei, Karo Moilanen

Main category: cs.AI

TL;DR: Spark is a shared agentic memory architecture that enables AI coding agents to collectively learn and share knowledge, improving code quality and matching larger models’ performance.

Details

Motivation: Traditional developer knowledge sharing platforms are declining while AI agents lack shared learning repositories, creating a gap in collective intelligence for coding agents.

Method: Introduces Spark - a shared agentic memory architecture where AI coding agents contribute to and draw from a persistent experiential memory for collective continual learning.

Result: Spark improved code quality across various model sizes, enabling a 30B parameter model to match state-of-the-art larger models. Achieved 98.2% helpfulness in top qualitative bands.

Conclusion: Spark successfully emulates human developer communities’ collective intelligence, providing an effective shared learning platform for AI coding agents to improve software development quality.

Abstract: The transition from human-centric to agent-centric software development practices is disrupting existing knowledge sharing environments for software developers. Traditional peer-to-peer repositories and developer communities for shared technical knowledge and best practice have witnessed dramatic drops in participation in a short period of time. At the same time, agentic functional equivalents are yet to emerge leaving AI agents, which already generate a significant proportion of all new software code produced, without access to repositories of valuable shared learning. In this paper, we introduce Spark, a novel shared agentic memory architecture which is designed to emulate the collective intelligence and know-how of human developer communities. Spark enables AI coding agents to both contribute to and draw from a persistent and continuously evolving experiential memory. Agents operating in the same general problem space use the Spark shared memory as a repository of new knowledge to achieve collective continual learning. We evaluate Spark as a coach for AI coding agents performing software development tasks. We demonstrate that recommendations made by Spark improve the quality of code generated by generic code generation models at varying sizes and capability tiers. Boosted by Spark, a small open-weights model with 30 billion parameters was able to match the code quality afforded by a much larger state-of-the-art model. Separately, we measure the intrinsic quality of recommendations generated by Spark against a wide range of criteria inspired by software development best practice, and achieve helpfulness levels of up to 98.2% in the top two (out of five) qualitative helpfulness bands.

[357] JobSphere: An AI-Powered Multilingual Career Copilot for Government Employment Platforms

Srihari R, Adarsha B, Mohammed Usman Hussain, Shweta Singh

Main category: cs.AI

TL;DR: JobSphere is an AI-powered career assistant for Punjab’s PGRKAM employment platform that uses RAG architecture with multilingual support, voice interaction, and runs efficiently on consumer GPUs, improving accessibility and usability.

Details

Motivation: To address engagement and accessibility challenges in government employment websites, including navigational complexity, limited language options, and lack of personalized support, particularly for Punjab/Hindi-speaking users in rural areas.

Method: Uses Retrieval-Augmented Generation (RAG) architecture with 4-bit quantization for efficient deployment on consumer-grade GPUs. Features include multilingual support (English, Hindi, Punjabi), voice-enabled interaction, automated mock tests, resume parsing with skills recognition, and embed-based job recommendations.

Result: Achieved 94% factual accuracy, median response time of 1.8 seconds, precision@10 score of 68% for job recommendations, and System Usability Scale score of 78.5/100 (50% improvement over baseline). Implementation is 89% cheaper than cloud-based systems.

Conclusion: JobSphere effectively fills accessibility gaps for Punjab/Hindi-speaking users in rural locations while providing trusted job content from government agencies, demonstrating significant improvements in usability and cost-effectiveness.

Abstract: Users of government employment websites commonly face engagement and accessibility challenges linked to navigational complexity, a dearth of language options, and a lack of personalized support. This paper introduces JobSphere, an AI-powered career assistant that is redefining the employment platform in Punjab called PGRKAM. JobSphere employs Retrieval-Augmented Generation (RAG) architecture, and it is multilingual, available in English, Hindi and Punjabi. JobSphere technique uses 4-bit quantization, allowing the platform to deploy on consumer-grade GPUs (i.e., NVIDIA RTX 3050 4GB), making the implementation 89% cheaper than that of cloud-based systems. Key innovations include voice-enabled interaction with the assistant, automated mock tests, resume parsing with skills recognition, and embed-based job recommendation that achieves a precision@10 score of 68%. An evaluation of JobSphere’s implementation reveals 94% factual accuracy, a median response time of 1.8 seconds, and a System Usability Scale score of 78.5/100, a 50% improvement compared to the baseline PGRKAM platform context. In conclusion, JobSphere effectively fills significant accessibility gaps for Punjab/Hindi-speaking users in rural locations, while also affirming the users access to trusted job content provided by government agencies.

[358] AI-Powered Data Visualization Platform: An Intelligent Web Application for Automated Dataset Analysis

Srihari R, Pallavi M, Tejaswini S, Vaishnavi R C

Main category: cs.AI

TL;DR: AI-powered platform automates data analysis and visualization using ML algorithms for data cleaning, feature selection, and intelligent visualization generation.

Details

Motivation: To eliminate time-consuming manual data analysis and establish automated AI-based analysis in data-driven environments.

Method: Combines Python Flask backend with React frontend, uses Firebase Cloud Storage for data processing, implements automatic data cleaning with imputation and outlier detection, and employs four algorithms for intelligent feature selection and visualization.

Result: Successfully processed datasets up to 100,000 rows in real-time, scaled to handle multiple users simultaneously, and maintained high-quality visual outputs with reduced manual inputs.

Conclusion: The cloud-based platform significantly reduces manual effort in data analysis while delivering high-quality visualizations and user experiences.

Abstract: An AI-powered data visualization platform that automates the entire data analysis process, from uploading a dataset to generating an interactive visualization. Advanced machine learning algorithms are employed to clean and preprocess the data, analyse its features, and automatically select appropriate visualizations. The system establishes the process of automating AI-based analysis and visualization from the context of data-driven environments, and eliminates the challenge of time-consuming manual data analysis. The combination of a Python Flask backend to access the dataset, paired with a React frontend, provides a robust platform that automatically interacts with Firebase Cloud Storage for numerous data processing and data analysis solutions and real-time sources. Key contributions include automatic and intelligent data cleaning, with imputation for missing values, and detection of outliers, via analysis of the data set. AI solutions to intelligently select features, using four different algorithms, and intelligent title generation and visualization are determined by the attributes of the dataset. These contributions were evaluated using two separate datasets to assess the platform’s performance. In the process evaluation, the initial analysis was performed in real-time on datasets as large as 100000 rows, while the cloud-based demand platform scales to meet requests from multiple users and processes them simultaneously. In conclusion, the cloud-based data visualization application allowed for a significant reduction of manual inputs to the data analysis process while maintaining a high quality, impactful visual outputs, and user experiences

[359] SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, Battista Biggio

Main category: cs.AI

TL;DR: The paper proposes using Self-Organizing Maps to extract multiple refusal directions from LLMs, showing this approach outperforms single-direction methods and jailbreak algorithms in suppressing refusal behavior.

Details

Motivation: Existing work encodes refusal behavior as a single direction, but evidence suggests concepts in LLMs are encoded as low-dimensional manifolds. The authors aim to capture this multi-directional nature of refusal.

Method: Use Self-Organizing Maps trained on harmful prompt representations to identify multiple neurons, then subtract the centroid of harmless representations from each neuron to derive multiple refusal directions.

Result: Ablating multiple directions outperforms single-direction baseline and specialized jailbreak algorithms, effectively suppressing refusal behavior in models.

Conclusion: The approach demonstrates that refusal is encoded as multiple directions rather than a single vector, providing insights into the mechanistic nature of refusal behavior in LLMs.

Abstract: Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model’s latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work’s difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models’ internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

[360] FaithAct: Faithfulness Planning and Acting in MLLMs

Junxian Li, Xinyue Xu, Sai Ma, Sichao Li

Main category: cs.AI

TL;DR: FaithEval framework for evaluating faithfulness in multimodal reasoning, distinguishing between behavioral and perceptual faithfulness, with FaithAct framework improving perceptual faithfulness by up to 26% without degrading accuracy.

Details

Motivation: Address persistent unfaithfulness in LLMs where models produce plausible but ungrounded reasoning chains that diverge from perceptual evidence or final conclusions.

Method: Introduce FaithEval for quantifying step-level and chain-level faithfulness by evaluating visual support for claimed objects, and propose FaithAct framework that enforces evidential grounding at every reasoning step.

Result: FaithAct improves perceptual faithfulness by up to 26% without degrading task accuracy compared to prompt-based and tool-augmented baselines across multiple reasoning benchmarks.

Conclusion: Treating faithfulness as a guiding principle mitigates hallucination and leads to more stable reasoning trajectories, establishing a unified framework for evaluating and enforcing faithfulness in multimodal reasoning.

Abstract: Unfaithfulness remains a persistent challenge for large language models (LLMs), which often produce plausible yet ungrounded reasoning chains that diverge from perceptual evidence or final conclusions. We distinguish between behavioral faithfulness (alignment between reasoning and output) and perceptual faithfulness (alignment between reasoning and input), and introduce FaithEval for quantifying step-level and chain-level faithfulness by evaluating whether each claimed object is visually supported by the image. Building on these insights, we propose FaithAct, a faithfulness-first planning and acting framework that enforces evidential grounding at every reasoning step. Experiments across multiple reasoning benchmarks demonstrate that FaithAct improves perceptual faithfulness by up to 26% without degrading task accuracy compared to prompt-based and tool-augmented baselines. Our analysis shows that treating faithfulness as a guiding principle not only mitigates hallucination but also leads to more stable reasoning trajectories. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning.

[361] Dataset Safety in Autonomous Driving: Requirements, Risks, and Assurance

Alireza Abbaspour, Tejaskumar Balgonda Patil, B Ravi Kiran, Russel Mohr, Senthil Yogamani

Main category: cs.AI

TL;DR: A framework for developing safe datasets for autonomous driving AI systems, aligned with ISO/PAS 8800 guidelines, covering data lifecycle management and safety assurance processes.

Details

Motivation: Dataset integrity is fundamental to AI safety in autonomous driving, requiring structured approaches to mitigate risks from dataset insufficiencies.

Method: Proposes an AI Data Flywheel framework with dataset lifecycle management (collection, annotation, curation, maintenance), safety analysis, requirements definition, and verification/validation strategies.

Result: A comprehensive framework that integrates safety standards, identifies hazards, and establishes processes for dataset safety assurance in autonomous driving applications.

Conclusion: The framework advances robust, safety-assured AI systems for autonomous driving by providing structured approaches to dataset development and safety compliance.

Abstract: Dataset integrity is fundamental to the safety and reliability of AI systems, especially in autonomous driving. This paper presents a structured framework for developing safe datasets aligned with ISO/PAS 8800 guidelines. Using AI-based perception systems as the primary use case, it introduces the AI Data Flywheel and the dataset lifecycle, covering data collection, annotation, curation, and maintenance. The framework incorporates rigorous safety analyses to identify hazards and mitigate risks caused by dataset insufficiencies. It also defines processes for establishing dataset safety requirements and proposes verification and validation strategies to ensure compliance with safety standards. In addition to outlining best practices, the paper reviews recent research and emerging trends in dataset safety and autonomous vehicle development, providing insights into current challenges and future directions. By integrating these perspectives, the paper aims to advance robust, safety-assured AI systems for autonomous driving applications.

[362] Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens

Main category: cs.AI

TL;DR: Proposes patching LLMs like software versions using lightweight learnable prefixes to address safety vulnerabilities without full model retraining.

Details

Motivation: Major LLM releases are costly and infrequent, leaving models with known safety gaps that are difficult to tailor to customer needs.

Method: Prepends a compact, learnable prefix (patch) to existing models, adding only 0.003% additional parameters to steer behavior toward safer reference models.

Result: Achieves safety improvements comparable to next-generation safety-aligned models across toxicity mitigation, bias reduction, and harmfulness refusal while preserving fluency.

Conclusion: LLMs can be patched like software, providing vendors and practitioners with scalable, efficient, and composable safety updates between major releases.

Abstract: We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This “patch” introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be “patched” much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.

[363] A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

Shubhra Mishra, Yuka Machino, Gabriel Poesia, Albert Jiang, Joy Hsu, Adrian Weller, Challenger Mishra, David Broman, Joshua B. Tenenbaum, Mateja Jamnik, Cedegao E. Zhang, Katherine M. Collins

Main category: cs.AI

TL;DR: LLMs broadly agree with human notions of mathematical interestingness but fail to capture the full distribution of human judgments and show weak alignment with human rationales for why problems are interesting.

Details

Motivation: As AI systems increasingly participate in mathematics with humans, it's important to understand how well their judgments of mathematical interestingness and difficulty align with human ones.

Method: Two empirical studies comparing human and LLM assessments of mathematical interestingness and difficulty across different experience levels (crowdsourcing participants and International Math Olympiad competitors).

Result: LLMs show broad agreement with human notions of interestingness but don’t capture the full distribution of human judgments, and have weak correlation with human-selected interestingness rationales.

Conclusion: Current LLMs show both promises and limitations in capturing human interestingness judgments for mathematical AI partnerships, highlighting the need for better alignment.

Abstract: The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people’s choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people – whether for advanced research or education – it becomes important to understand how well their judgments align with human ones. Our work examines this alignment through two empirical studies of human and LLM assessment of mathematical interestingness and difficulty, spanning a range of mathematical experience. We study two groups: participants from a crowdsourcing platform and International Math Olympiad competitors. We show that while many LLMs appear to broadly agree with human notions of interestingness, they mostly do not capture the distribution observed in human judgments. Moreover, most LLMs only somewhat align with why humans find certain math problems interesting, showing weak correlation with human-selected interestingness rationales. Together, our findings highlight both the promises and limitations of current LLMs in capturing human interestingness judgments for mathematical AI thought partnerships.

[364] Hyperdimensional Decoding of Spiking Neural Networks

Cedrick Kinavuidi, Luca Peres, Oliver Rhodes

Main category: cs.AI

TL;DR: Novel SNN-HDC decoding method achieves high accuracy, noise robustness, low latency and energy efficiency, outperforming existing approaches on multiple datasets.

Details

Motivation: To create a decoding method with high accuracy, high noise robustness, low latency and low energy usage for spiking neural networks.

Method: Combines Spiking Neural Networks (SNNs) with Hyperdimensional Computing (HDC) for decoding.

Result: Generally better classification accuracy, lower latency and 1.24x-3.67x energy reduction on DvsGesture, 1.38x-2.27x on SL-Animals-DVS. Can identify 100% of unknown classes on DvsGesture.

Conclusion: SNN-HDC represents a compelling alternative to both rate and latency decoding methods due to its numerous benefits.

Abstract: This work presents a novel spiking neural network (SNN) decoding method, combining SNNs with Hyperdimensional computing (HDC). The goal is to create a decoding method with high accuracy, high noise robustness, low latency and low energy usage. Compared to analogous architectures decoded with existing approaches, the presented SNN-HDC model attains generally better classification accuracy, lower classification latency and lower estimated energy consumption on multiple test cases from literature. The SNN-HDC achieved estimated energy consumption reductions ranging from 1.24x to 3.67x on the DvsGesture dataset and from 1.38x to 2.27x on the SL-Animals-DVS dataset. The presented decoding method can also efficiently identify unknown classes it has not been trained on. In the DvsGesture dataset the SNN-HDC model can identify 100% of samples from an unseen/untrained class. Given the numerous benefits shown and discussed in this paper, this decoding method represents a very compelling alternative to both rate and latency decoding.

[365] DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs

Ying Jiao, Rodrigo Castellano Ontiveros, Luc De Raedt, Marco Gori, Francesco Giannini, Michelangelo Diligenti, Giuseppe Marra

Main category: cs.AI

TL;DR: DeepProofLog (DPrL) is a novel neurosymbolic system using stochastic logic programs with neural parameterization, enabling scalable inference and learning via MDP mapping and reinforcement learning techniques.

Details

Motivation: Address scalability limitations in neurosymbolic AI systems that combine neural architectures with symbolic reasoning, as existing methods often sacrifice scalability for accuracy and interpretability.

Method: Parameterizes all derivation steps with neural networks for efficient neural guidance, establishes formal mapping between resolution process and Markov Decision Processes, applies dynamic programming and reinforcement learning for inference and learning.

Result: Outperforms existing state-of-the-art neurosymbolic systems on standard benchmarks and knowledge graph reasoning tasks, achieving better scalability to larger and more complex settings.

Conclusion: DPrL successfully addresses scalability limitations in neurosymbolic AI through neural parameterization and MDP-based optimization, enabling practical application to complex proof spaces and large knowledge bases.

Abstract: Neurosymbolic (NeSy) AI aims to combine the strengths of neural architectures and symbolic reasoning to improve the accuracy, interpretability, and generalization capability of AI models. While logic inference on top of subsymbolic modules has been shown to effectively guarantee these properties, this often comes at the cost of reduced scalability, which can severely limit the usability of NeSy models. This paper introduces DeepProofLog (DPrL), a novel NeSy system based on stochastic logic programs, which addresses the scalability limitations of previous methods. DPrL parameterizes all derivation steps with neural networks, allowing efficient neural guidance over the proving system. Additionally, we establish a formal mapping between the resolution process of our deep stochastic logic programs and Markov Decision Processes, enabling the application of dynamic programming and reinforcement learning techniques for efficient inference and learning. This theoretical connection improves scalability for complex proof spaces and large knowledge bases. Our experiments on standard NeSy benchmarks and knowledge graph reasoning tasks demonstrate that DPrL outperforms existing state-of-the-art NeSy systems, advancing scalability to larger and more complex settings than previously possible.

[366] Simulating the Visual World with Artificial Intelligence: A Roadmap

Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu

Main category: cs.AI

TL;DR: Video generation is evolving from creating visually appealing clips to building interactive virtual environments with physical plausibility, positioning video foundation models as implicit world models that simulate physical dynamics and interactions.

Details

Motivation: The shift from pure visual generation to interactive virtual environments that maintain physical plausibility points toward the need for video foundation models that can function as implicit world models for simulating real or imagined worlds.

Method: Conceptualizes modern video foundation models as combining an implicit world model (encoding physical laws, interaction dynamics, agent behavior) with a video renderer that transforms latent simulations into realistic visual observations.

Result: The survey traces video generation through four generations with advancing capabilities, culminating in world models with intrinsic physical plausibility, real-time multimodal interaction, and multi-scale planning capabilities applicable to robotics, autonomous driving, and gaming.

Conclusion: The evolution represents a fundamental shift toward video foundation models as world models, with future challenges including the role of agent intelligence in shaping and evaluating these systems.

Abstract: The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a “window” into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.

[367] How Artificial Intelligence Leads to Knowledge Why: An Inquiry Inspired by Aristotle’s Posterior Analytics

Guus Eelink, Kilian Rückschloß, Felix Weitkämper

Main category: cs.AI

TL;DR: This paper introduces a theoretical framework for causal systems to distinguish between ‘knowledge that’ and ‘knowledge why’, arguing that predicting intervention effects requires knowledge why.

Details

Motivation: There is a lack of formal theory characterizing the knowledge needed to predict effects of external interventions, despite Bayesian networks and causal models being used for such tasks.

Method: Introduces the theoretical framework of causal systems to clarify Aristotle’s distinction between knowledge that and knowledge why, and interprets existing AI technologies as causal systems.

Result: The framework provides a more precise understanding of the knowledge necessary for predicting effects of external interventions.

Conclusion: Predicting the effects of external interventions is feasible only with knowledge why, not just knowledge that.

Abstract: Bayesian networks and causal models provide frameworks for handling queries about external interventions and counterfactuals, enabling tasks that go beyond what probability distributions alone can address. While these formalisms are often informally described as capturing causal knowledge, there is a lack of a formal theory characterizing the type of knowledge required to predict the effects of external interventions. This work introduces the theoretical framework of causal systems to clarify Aristotle’s distinction between knowledge that and knowledge why within artificial intelligence. By interpreting existing artificial intelligence technologies as causal systems, it investigates the corresponding types of knowledge. Furthermore, it argues that predicting the effects of external interventions is feasible only with knowledge why, providing a more precise understanding of the knowledge necessary for such tasks.

[368] Emergence of Goal-Directed Behaviors via Active Inference with Self-Prior

Dongmin Kim, Hoshinori Kanazawa, Naoto Yoshida, Yasuo Kuniyoshi

Main category: cs.AI

TL;DR: A computational model using ‘self-prior’ density model within active inference framework enables agents to spontaneously reach for tactile stimuli without external rewards, mimicking infant goal-directed behavior.

Details

Motivation: To understand how intrinsically motivated behaviors emerge in infants during early development without external reward criteria, focusing on spontaneous exploration and learning.

Method: Proposed a ‘self-prior’ density model for multimodal sensory experiences integrated within active inference framework, generating behavioral references by minimizing mismatches between past and current sensory experiences.

Result: The agent spontaneously reached toward tactile stimuli in simulated environment, demonstrating emergence of intentional behavior shaped by the agent’s own sensory experiences.

Conclusion: The self-prior mechanism successfully induces goal-directed behavior intrinsically, analogous to body schema acquisition, showing how intentional behavior can emerge spontaneously during early development.

Abstract: Infants often exhibit goal-directed behaviors, such as reaching for a sensory stimulus, even when no external reward criterion is provided. These intrinsically motivated behaviors facilitate spontaneous exploration and learning of the body and environment during early developmental stages. Although computational modeling can offer insight into the mechanisms underlying such behaviors, many existing studies on intrinsic motivation focus primarily on how exploration contributes to acquiring external rewards. In this paper, we propose a novel density model for an agent’s own multimodal sensory experiences, called the “self-prior,” and investigate whether it can autonomously induce goal-directed behavior. Integrated within an active inference framework based on the free energy principle, the self-prior generates behavioral references purely from an intrinsic process that minimizes mismatches between average past sensory experiences and current observations. This mechanism is also analogous to the acquisition and utilization of a body schema through continuous interaction with the environment. We examine this approach in a simulated environment and confirm that the agent spontaneously reaches toward a tactile stimulus. Our study implements intrinsically motivated behavior shaped by the agent’s own sensory experiences, demonstrating the spontaneous emergence of intentional behavior during early development.

[369] SOCIA-$\nabla$: Textual Gradient Meets Multi-Agent Orchestration for Automated Simulator Generation

Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Hao Xue, Flora D. Salim

Main category: cs.AI

TL;DR: SOCIA-∇ is an end-to-end agentic framework that treats simulator construction as code optimization using LLM-driven agents in a computation graph with a loss-driven loop of code synthesis, execution, evaluation, and repair.

Details

Motivation: To convert brittle prompt pipelines into reproducible, constraint-aware simulator code generation that scales across domains and simulation granularities, minimizing expert effort.

Method: Uses specialized LLM-driven agents as graph nodes with a workflow manager executing a loss-driven loop (code synthesis -> execution -> evaluation -> code repair) and performs Textual-Gradient Descent (TGD) optimization.

Result: Achieves state-of-the-art overall accuracy across three CPS tasks: User Modeling, Mask Adoption, and Personal Mobility.

Conclusion: SOCIA-∇ successfully unifies multi-agent orchestration with loss-aligned optimization to create reproducible simulator code generation that scales across domains.

Abstract: In this paper, we present SOCIA-$\nabla$, an end-to-end, agentic framework that treats simulator construction asinstance optimization over code within a textual computation graph. Specialized LLM-driven agents are embedded as graph nodes, and a workflow manager executes a loss-driven loop: code synthesis -> execution -> evaluation -> code repair. The optimizer performs Textual-Gradient Descent (TGD), while human-in-the-loop interaction is reserved for task-spec confirmation, minimizing expert effort and keeping the code itself as the trainable object. Across three CPS tasks, i.e., User Modeling, Mask Adoption, and Personal Mobility, SOCIA-$\nabla$ attains state-of-the-art overall accuracy. By unifying multi-agent orchestration with a loss-aligned optimization view, SOCIA-$\nabla$ converts brittle prompt pipelines into reproducible, constraint-aware simulator code generation that scales across domains and simulation granularities. We will release the code soon.

[370] ReflecSched: Solving Dynamic Flexible Job-Shop Scheduling via LLM-Powered Hierarchical Reflection

Shijie Cao, Yuan Yuan

Main category: cs.AI

TL;DR: ReflecSched is a framework that uses LLMs for dynamic flexible job-shop scheduling by having them analyze heuristic simulations and generate strategic summaries, overcoming direct LLM limitations like myopic decision-making.

Details

Motivation: Traditional scheduling rules are rigid, deep learning requires feature engineering, and direct LLM applications suffer from long-context paradox, underutilization of heuristics, and myopic decision-making.

Method: Empowers LLM as strategic analyzer rather than direct scheduler - analyzes heuristic-driven simulations across multiple planning horizons to distill “Strategic Experience” summaries, which guide final decision-making.

Result: Achieves superior performance with average RPD of 6.04% and rank of 3.18, significantly outperforming traditional and learning-based methods, with 71.35% Win Rate over direct LLM baselines and 15.1% token efficiency improvement.

Conclusion: ReflecSched’s reflection mechanism effectively mitigates LLM pitfalls, performs on par with oracle-like strategies, and demonstrates robust performance through contrastive experience generation.

Abstract: The NP-hard Dynamic Flexible Job-Shop Scheduling (DFJSP) problem involves real-time events and complex routing. While traditional rules are efficient but rigid, deep learning is opaque and requires feature engineering. Large Language Models (LLMs) promise adaptive reasoning without this engineering overhead, yet we find their direct application is suboptimal. Baseline LLMs suffer from three key pitfalls: the long-context paradox, where crucial data is underutilized; an underutilization of expert heuristics; and myopic decision-making. To address this, we propose ReflecSched, a framework that empowers the LLM beyond a direct scheduler by equipping it with a strategic analysis capability. ReflecSched tasks the LLM to analyze heuristic-driven simulations across multiple planning horizons and distill them into a concise, natural-language summary termed ``Strategic Experience’’. This summary is then integrated into the prompt of a final decision-making module, guiding it to produce non-myopic actions. Experiments demonstrate ReflecSched achieves superior performance, with its best variants attaining an average RPD of 6.04% and rank of 3.18, significantly outperforming strong traditional and learning-based methods. It also statistically and decisively surpasses direct LLM baselines, securing a 71.35% Win Rate while being, on average, 15.1% more token-efficient on Normal-scale problems. Ablation studies attribute this performance to a robust reflection mechanism that leverages high-quality, contrastive experience. This mechanism mitigates key LLM pitfalls like myopic greed, enabling ReflecSched to outperform all evaluated heuristics. Ultimately, the framework’s performance is statistically on par with an oracle-like strategy, showcasing its effectiveness and robustness.

[371] LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval

Yaoze Zhang, Rong Wu, Pinlong Cai, Xiaoman Wang, Guohang Yan, Song Mao, Ding Wang, Botian Shi

Main category: cs.AI

TL;DR: LeanRAG is a knowledge graph-based RAG framework that addresses semantic isolation and inefficient retrieval by creating explicit relations between entity clusters and using structure-guided retrieval to reduce redundancy by 46% while improving response quality.

Details

Motivation: Current knowledge graph-based RAG methods suffer from disconnected semantic islands in hierarchical structures and inefficient flat retrieval that fails to leverage graph topology, compromising grounding effectiveness.

Method: Uses semantic aggregation to form entity clusters with explicit relations, creating a navigable semantic network, followed by bottom-up structure-guided retrieval that anchors queries to fine-grained entities and traverses semantic pathways.

Result: Significantly outperforms existing methods on four QA benchmarks across different domains while reducing retrieval redundancy by 46%.

Conclusion: LeanRAG effectively overcomes limitations of hierarchical knowledge graph RAG by creating connected semantic networks and structure-aware retrieval, achieving superior performance with reduced computational overhead.

Abstract: Retrieval-Augmented Generation (RAG) plays a crucial role in grounding Large Language Models by leveraging external knowledge, whereas the effectiveness is often compromised by the retrieval of contextually flawed or incomplete information. To address this, knowledge graph-based RAG methods have evolved towards hierarchical structures, organizing knowledge into multi-level summaries. However, these approaches still suffer from two critical, unaddressed challenges: high-level conceptual summaries exist as disconnected ``semantic islands’’, lacking the explicit relations needed for cross-community reasoning; and the retrieval process itself remains structurally unaware, often degenerating into an inefficient flat search that fails to exploit the graph’s rich topology. To overcome these limitations, we introduce LeanRAG, a framework that features a deeply collaborative design combining knowledge aggregation and retrieval strategies. LeanRAG first employs a novel semantic aggregation algorithm that forms entity clusters and constructs new explicit relations among aggregation-level summaries, creating a fully navigable semantic network. Then, a bottom-up, structure-guided retrieval strategy anchors queries to the most relevant fine-grained entities and then systematically traverses the graph’s semantic pathways to gather concise yet contextually comprehensive evidence sets. The LeanRAG can mitigate the substantial overhead associated with path retrieval on graphs and minimizes redundant information retrieval. Extensive experiments on four challenging QA benchmarks with different domains demonstrate that LeanRAG significantly outperforming existing methods in response quality while reducing 46% retrieval redundancy. Code is available at: https://github.com/RaZzzyz/LeanRAG

[372] Uncertainty-driven Adaptive Exploration

Leonidas Bakopoulos, Georgios Chalkiadakis

Main category: cs.AI

TL;DR: A generic adaptive exploration framework that uses uncertainty to determine when to switch between exploration and exploitation phases in reinforcement learning.

Details

Motivation: To address the critical question of when to switch between exploration and exploitation in domains requiring learning of long and complex action sequences, which is important for adaptive exploration methods.

Method: Proposes a generic adaptive exploration framework that employs uncertainty measures to determine switching moments between exploration and exploitation phases. The framework can incorporate various uncertainty-measuring mechanisms from intrinsic motivation or epistemic uncertainty-based methods.

Result: Experimental results show that the framework gives rise to adaptive exploration strategies that outperform standard approaches across several MuJoCo environments.

Conclusion: The proposed uncertainty-based adaptive exploration framework provides a principled approach to switching between exploration and exploitation and demonstrates improved performance over standard methods.

Abstract: Adaptive exploration methods propose ways to learn complex policies via alternating between exploration and exploitation. An important question for such methods is to determine the appropriate moment to switch between exploration and exploitation and vice versa. This is critical in domains that require the learning of long and complex sequences of actions. In this work, we present a generic adaptive exploration framework that employs uncertainty to address this important issue in a principled manner. Our framework includes previous adaptive exploration approaches as special cases. Moreover, we can incorporate in our framework any uncertainty-measuring mechanism of choice, for instance mechanisms used in intrinsic motivation or epistemic uncertainty-based exploration methods. We experimentally demonstrate that our framework gives rise to adaptive exploration strategies that outperform standard ones across several MuJoCo environments.

[373] A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services

Guanzhong Pan, Vishal Chodnekar, Abinas Roy, Haibo Wang

Main category: cs.AI

TL;DR: This paper provides a cost-benefit analysis framework to help organizations decide between commercial LLM services and on-premise deployment, identifying breakeven points based on usage levels and performance needs.

Details

Motivation: Organizations face a critical decision between using commercial LLM services (convenient but with privacy concerns, vendor lock-in, and long-term costs) versus local deployment of open-source models (addressing privacy and cost concerns but requiring infrastructure investment).

Method: The authors developed a cost-benefit analysis framework that considers hardware requirements, operational expenses, and performance benchmarks of leading open-source models (Qwen, Llama, Mistral, etc.), then compared total local deployment costs with major cloud providers’ subscription fees.

Result: The study provides estimated breakeven points that help organizations determine when on-premise LLM deployment becomes economically viable compared to commercial subscription services, based on specific usage levels and performance requirements.

Conclusion: The framework offers organizations a practical tool for planning their LLM strategies by quantifying the economic trade-offs between cloud services and local deployment, enabling data-driven decisions about AI infrastructure investments.

Abstract: Large language models (LLMs) are becoming increasingly widespread. Organizations that want to use AI for productivity now face an important decision. They can subscribe to commercial LLM services or deploy models on their own infrastructure. Cloud services from providers such as OpenAI, Anthropic, and Google are attractive because they provide easy access to state-of-the-art models and are easy to scale. However, concerns about data privacy, the difficulty of switching service providers, and long-term operating costs have driven interest in local deployment of open-source models. This paper presents a cost-benefit analysis framework to help organizations determine when on-premise LLM deployment becomes economically viable compared to commercial subscription services. We consider the hardware requirements, operational expenses, and performance benchmarks of the latest open-source models, including Qwen, Llama, Mistral, and etc. Then we compare the total cost of deploying these models locally with the major cloud providers subscription fee. Our findings provide an estimated breakeven point based on usage levels and performance needs. These results give organizations a practical framework for planning their LLM strategies.

[374] How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu

Main category: cs.AI

TL;DR: Systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models, introducing a three-level capability framework and SIBench benchmark, revealing significant gaps between perception and reasoning tasks.

Details

Motivation: VSR is a core human cognitive ability critical for advancing embodied intelligence and autonomous systems, but current VLMs struggle with representing and reasoning over three-dimensional space.

Method: Systematic review of VSR methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Categorizes spatial intelligence into three levels (basic perception, spatial understanding, spatial planning) and creates SIBench benchmark with 20 datasets across 23 tasks.

Result: State-of-the-art VLMs show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination.

Conclusion: Substantial challenges remain in achieving spatial intelligence, but the study provides both a systematic roadmap and comprehensive benchmark to drive future research in VSR.

Abstract: Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

[375] Clinical Uncertainty Impacts Machine Learning Evaluations

Simone Lionetti, Fabian Gröger, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Alexander A. Navarini, Marc Pouly

Main category: cs.AI

TL;DR: The paper argues for using probabilistic metrics that account for annotation uncertainty in medical imaging evaluations, instead of traditional aggregation methods like majority voting.

Details

Motivation: Clinical dataset labels often have uncertainty due to annotator disagreement and varying confidence levels, which typical aggregation procedures obscure, potentially affecting model rankings.

Method: Proposes using probabilistic metrics that operate directly on distributions of annotations, applicable regardless of how annotations are generated (counting, confidence ratings, or probabilistic models).

Result: In experiments on medical imaging benchmarks, accounting for label confidence significantly impacts model rankings, showing the importance of uncertainty-aware evaluation.

Conclusion: The community should release raw annotations and adopt uncertainty-aware evaluation methods to better reflect the reality of clinical data in performance estimates.

Abstract: Clinical dataset labels are rarely certain as annotators disagree and confidence is not uniform across cases. Typical aggregation procedures, such as majority voting, obscure this variability. In simple experiments on medical imaging benchmarks, accounting for the confidence in binary labels significantly impacts model rankings. We therefore argue that machine-learning evaluations should explicitly account for annotation uncertainty using probabilistic metrics that directly operate on distributions. These metrics can be applied independently of the annotations’ generating process, whether modeled by simple counting, subjective confidence ratings, or probabilistic response models. They are also computationally lightweight, as closed-form expressions have linear-time implementations once examples are sorted by model score. We thus urge the community to release raw annotations for datasets and to adopt uncertainty-aware evaluation so that performance estimates may better reflect clinical data.

[376] PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning

Hieu Tran, Zonghai Yao, Nguyen Luong Tran, Zhichao Yang, Feiyun Ouyang, Shuo Han, Razieh Rahimi, Hong Yu

Main category: cs.AI

TL;DR: PRIME is a multi-agent reasoning framework that integrates fast System 1 thinking and deliberate System 2 thinking to enhance LLM reasoning capabilities, enabling open-source models to compete with state-of-the-art closed-source models.

Details

Motivation: Inspired by the dual-process theory of human cognition from Thinking, Fast and Slow, the authors aim to create a framework that mimics human cognitive processes by dynamically integrating intuitive and deliberate thinking modes.

Method: PRIME employs a Quick Thinking Agent (System 1) for rapid answers, and if uncertainty is detected, triggers a structured System 2 pipeline with specialized agents for planning, hypothesis generation, retrieval, information integration, and decision-making.

Result: Experimental results with LLaMA 3 models show that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning.

Conclusion: PRIME establishes a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning by faithfully mimicking human cognitive processes and enhancing both efficiency and accuracy.

Abstract: Inspired by the dual-process theory of human cognition from \textit{Thinking, Fast and Slow}, we introduce \textbf{PRIME} (Planning and Retrieval-Integrated Memory for Enhanced Reasoning), a multi-agent reasoning framework that dynamically integrates \textbf{System 1} (fast, intuitive thinking) and \textbf{System 2} (slow, deliberate thinking). PRIME first employs a Quick Thinking Agent (System 1) to generate a rapid answer; if uncertainty is detected, it then triggers a structured System 2 reasoning pipeline composed of specialized agents for \textit{planning}, \textit{hypothesis generation}, \textit{retrieval}, \textit{information integration}, and \textit{decision-making}. This multi-agent design faithfully mimics human cognitive processes and enhances both efficiency and accuracy. Experimental results with LLaMA 3 models demonstrate that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning. This research establishes PRIME as a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning.

[377] AgentFlux: Decoupled Fine-Tuning & Inference for On-Device Agentic Systems

Rohan Kadekodi, Zhan Jin, Keisuke Kamahori, Yile Gu, Sean Khatiri, Noah H. Bayindirli, Sergey Gorbunov, Baris Kasikci

Main category: cs.AI

TL;DR: DualTune improves local LLM tool-calling by decoupling tool selection and argument generation using specialized LoRA adapters, achieving 46% accuracy improvement on Qwen-2.5-7B.

Details

Motivation: Local LLMs underperform frontier models in tool calling, struggling with tool selection from large sets and accurate argument generation for complex parameters, requiring privacy-preserving on-device solutions.

Method: Decoupled fine-tuning using LoRA adapters for tool selection and argument generation with separate loss masking, plus DualTune inference framework with hierarchical orchestration and dynamic adapter loading.

Result: Qwen-2.5-7B model improved tool calling accuracy by 46%, outperforming similar-sized models and often larger models (2x size) on MCP-Bench benchmark.

Conclusion: The decoupled approach effectively addresses local LLM limitations in tool calling, enabling efficient on-device agent orchestration with significant performance improvements.

Abstract: The deployment of Large Language Models (LLMs) as agentic orchestrators has revolutionized task automation, but the need for privacy-preserving, cost-effective solutions demands on-device inference capabilities. However, local LLMs consistently underperform compared to frontier models in tool calling scenarios, struggling with both tool selection from large tool sets and accurate argument generation for complex parameter structures. We introduce a methodology that disaggregates a tool-calling task into two distinct subtasks: tool selection and argument generation. We propose “decoupled fine-tuning”, a novel post-training approach that employs LoRA fine-tuning to create dedicated LoRA adapters for tool selection and tool-specific argument generation using separate loss masking for each of the subtasks. Furthermore, we present DualTune, an inference framework that leverages the LoRA adapters created using decoupled fine-tuning to perform efficient agent orchestration with the help of local models on end-user devices. DualTune decomposes the tool-call generation step into tool selection and argument generation, and dynamically loads the corresponding LoRA adapters to generate tool calls. Additionally, DualTune implements hierarchical orchestration to restrict the number of tools required for tool selection. Our experiments on the MCP-Bench benchmark demonstrate that the Qwen-2.5-7B model trained using decoupled fine-tuning improves the tool calling accuracy of the base model by 46%, and outperforms other local reasoning, non-reasoning and fine-tuned models of similar size in all cases, and models that are 2x larger, in most cases.

[378] DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry

Changti Wu, Shijie Lian, Zihao Liu, Lei Zhang, Laurence Tianruo Yang, Kai Chen

Main category: cs.AI

TL;DR: DynaSolidGeo is a dynamic benchmark for evaluating spatial reasoning in Vision-Language Models, addressing limitations of existing 2D-focused benchmarks by incorporating 3D solid geometry problems with process evaluation.

Details

Motivation: Existing multimodal math reasoning benchmarks focus on 2D geometry, use static datasets prone to contamination, and evaluate only final answers without considering reasoning processes.

Method: Created through semi-automatic annotation pipeline with 503 expert-curated seed questions that can dynamically generate unlimited multimodal instances, incorporating process evaluation with expert-annotated reasoning chains.

Result: Experiments show large performance gaps across VLMs, severe degradation in dynamic settings, and poor performance on high-level spatial intelligence tasks like mental rotation and visualization.

Conclusion: DynaSolidGeo addresses critical gaps in spatial reasoning evaluation and reveals significant limitations in current VLMs’ spatial mathematical reasoning capabilities.

Abstract: Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization. The code and dataset are available at \href{https://zgca-ai4edu.github.io/DynaSolidGeo/}{DynaSolidGeo}.

[379] A Survey of AI Scientists

Guiyao Tie, Pan Zhou, Lichao Sun

Main category: cs.AI

TL;DR: This survey provides a systematic framework for AI scientists - autonomous systems that emulate the complete scientific workflow from hypothesis generation to paper publication, analyzing their evolution and current challenges.

Details

Motivation: The rapid proliferation of AI scientist systems has created a fragmented research landscape, obscuring methodological principles and developmental trends that need systematic synthesis.

Method: Introduces a unified six-stage methodological framework deconstructing the scientific process: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation.

Result: Charts the field’s evolution from Foundational Modules (2022-2023) to Closed-Loop Systems (2024) to current focus on Scalability, Impact, and Human-AI Collaboration (2025-present).

Conclusion: Provides a critical roadmap for overcoming challenges in robustness and governance to guide next-generation systems toward becoming trustworthy partners in human scientific inquiry.

Abstract: Artificial intelligence is undergoing a profound transition from a computational instrument to an autonomous originator of scientific knowledge. This emerging paradigm, the AI scientist, is architected to emulate the complete scientific workflow-from initial hypothesis generation to the final synthesis of publishable findings-thereby promising to fundamentally reshape the pace and scale of discovery. However, the rapid and unstructured proliferation of these systems has created a fragmented research landscape, obscuring overarching methodological principles and developmental trends. This survey provides a systematic and comprehensive synthesis of this domain by introducing a unified, six-stage methodological framework that deconstructs the end-to-end scientific process into: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation. Through this analytical lens, we chart the field’s evolution from early Foundational Modules (2022-2023) to integrated Closed-Loop Systems (2024), and finally to the current frontier of Scalability, Impact, and Human-AI Collaboration (2025-present). By rigorously synthesizing these developments, this survey not only clarifies the current state of autonomous science but also provides a critical roadmap for overcoming remaining challenges in robustness and governance, ultimately guiding the next generation of systems toward becoming trustworthy and indispensable partners in human scientific inquiry.

[380] Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

Main category: cs.AI

TL;DR: Chain-of-Thought Hijacking is a jailbreak attack that pads harmful requests with long sequences of harmless puzzle reasoning, achieving high attack success rates on major reasoning models by diluting safety mechanisms.

Details

Motivation: To demonstrate that contrary to expectations, increased reasoning computation in large reasoning models can be exploited to bypass safety safeguards rather than strengthen them.

Method: Padding harmful requests with long sequences of benign puzzle reasoning (Chain-of-Thought) to dilute safety checking signals and shift attention away from harmful tokens.

Result: Achieved 99%, 94%, 100%, and 94% attack success rates on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet respectively, far exceeding prior jailbreak methods.

Conclusion: Explicit Chain-of-Thought reasoning can become a jailbreak vector when combined with final-answer cues, revealing vulnerabilities in safety mechanisms of reasoning models.

Abstract: Large reasoning models (LRMs) achieve higher task performance with more inference-time computation, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.

[381] EdgeRunner 20B: Military Task Parity with GPT-5 while Running on the Edge

Jack FitzGerald, Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Jonnathan Castillo, Yousif Azami, Sean Bailey, Jeremy Cao, Peter Damianov, Kevin de Haan, Luke Kerbs, Vincent Lu, Joseph Madigan, Jeremy McLaurin, Jonathan Tainer, Dave Anderson, Jonathan Beck, Jamie Cuticello, Colton Malkerson, Tyler Saltsman

Main category: cs.AI

TL;DR: EdgeRunner 20B is a fine-tuned military-optimized model that matches or exceeds GPT-5 performance on military tasks while maintaining general capabilities, enabling deployment on air-gapped edge devices.

Details

Motivation: To create specialized models for data-sensitive military operations that can be deployed locally on edge devices without compromising general AI capabilities.

Method: Fine-tuned gpt-oss-20b on 1.6M high-quality military records and created four new military test sets (combat arms, combat medic, cyber operations, mil-bench-5k) for evaluation.

Result: EdgeRunner 20B matches or exceeds GPT-5 performance on military tasks with 95%+ statistical significance, with minimal regression on general benchmarks except GSM8k in low reasoning setting.

Conclusion: Small, locally-hosted models like EdgeRunner 20B are ideal for military deployment on air-gapped edge devices, providing specialized capabilities without sacrificing general performance.

Abstract: We present EdgeRunner 20B, a fine-tuned version of gpt-oss-20b optimized for military tasks. EdgeRunner 20B was trained on 1.6M high-quality records curated from military documentation and websites. We also present four new tests sets: (a) combat arms, (b) combat medic, (c) cyber operations, and (d) mil-bench-5k (general military knowledge). On these military test sets, EdgeRunner 20B matches or exceeds GPT-5 task performance with 95%+ statistical significance, except for the high reasoning setting on the combat medic test set and the low reasoning setting on the mil-bench-5k test set. Versus gpt-oss-20b, there is no statistically-significant regression on general-purpose benchmarks like ARC-C, GPQA Diamond, GSM8k, IFEval, MMLU Pro, or TruthfulQA, except for GSM8k in the low reasoning setting. We also present analyses on hyperparameter settings, cost, and throughput. These findings show that small, locally-hosted models are ideal solutions for data-sensitive operations such as in the military domain, allowing for deployment in air-gapped edge devices.

[382] Glia: A Human-Inspired AI for Automated Systems Design and Optimization

Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, Hari Balakrishnan

Main category: cs.AI

TL;DR: Glia is an AI architecture that uses LLMs in a multi-agent workflow to autonomously design computer system mechanisms, achieving human-expert performance in distributed GPU cluster management.

Details

Motivation: To explore whether AI can autonomously design computer system mechanisms with human-level creativity and reasoning, moving beyond black-box optimization approaches.

Method: Uses large language models in a human-inspired multi-agent workflow where specialized agents handle reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback.

Result: Produced new algorithms for request routing, scheduling, and auto-scaling in distributed GPU clusters that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior.

Conclusion: Combining reasoning LLMs with structured experimentation enables AI to produce creative and understandable designs for complex systems problems, suggesting AI can match human expertise in system design.

Abstract: Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.

[383] Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences

Joshua Ashkinaze, Hua Shen, Sai Avula, Eric Gilbert, Ceren Budak

Main category: cs.AI

TL;DR: The Deep Value Benchmark (DVB) tests if LLMs learn fundamental human values vs surface preferences, finding models generalize values less than chance (30% DVGR) and larger models perform slightly worse.

Details

Motivation: To distinguish whether LLMs learn deep human values or just superficial preferences, which is critical for AI alignment and robust generalization of human intentions.

Method: Uses controlled confounding between deep values and shallow features in training, then breaks correlations in testing to measure Deep Value Generalization Rate (DVGR).

Result: Average DVGR across 9 models is 0.30, all generalize deep values less than chance, and larger models have slightly lower DVGR than smaller models.

Conclusion: Current LLMs fail to robustly learn fundamental human values, instead relying on superficial patterns, highlighting alignment risks.

Abstract: We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features – for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model’s Deep Value Generalization Rate (DVGR) – the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.

[384] ScRPO: From Errors to Insights

Lianrui Li, Dakuan Lu, Jiawei Shao, Chi Zhang, Xuelong Li

Main category: cs.AI

TL;DR: ScRPO is a reinforcement learning framework that improves LLMs on math problems through self-reflection and error correction in two stages: trial-and-error learning and self-correction learning.

Details

Motivation: To enhance large language models' performance on challenging mathematical problems by enabling self-improvement through error analysis and correction with limited external feedback.

Method: Two-stage approach: (1) Trial-and-error learning with GRPO to collect incorrect answers in an error pool, (2) Self-correction learning where the model reflects on why previous answers were wrong.

Result: Extensive experiments on multiple math benchmarks show ScRPO consistently outperforms several post-training methods using Deepseek-Distill-Qwen models.

Conclusion: ScRPO is a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way for more reliable AI systems.

Abstract: We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathematical problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collecting incorrect answers along with their corresponding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous answers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B. The experimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.

[385] When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks

Stefano Ferraro, Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo

Main category: cs.AI

TL;DR: Object-centric world models (OCWM) aim to decompose scenes into object representations for better generalization, but DLPWM, while achieving strong visual modeling, underperforms in control tasks due to representation shift during object interactions.

Details

Motivation: To test if disentangled object-level representations can improve policy performance and compositional generalization in reinforcement learning by localizing task-relevant information.

Method: Introduced DLPWM, a fully unsupervised disentangled object-centric world model that learns object-level latents directly from pixels without supervision.

Result: DLPWM achieved strong reconstruction and prediction performance with robustness to OOD variations, but policies trained on its latents underperformed compared to DreamerV3 due to representation shift during multi-object interactions.

Conclusion: Object-centric perception supports robust visual modeling but achieving stable control requires mitigating latent drift caused by representation shift during object interactions.

Abstract: Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.

[386] Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads

Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan

Main category: cs.AI

TL;DR: UHeads: lightweight uncertainty quantification heads that use LLM internal states to verify reasoning steps, matching larger PRMs with only 10M parameters.

Details

Motivation: Existing reasoning verification methods are computationally expensive, domain-specific, or require large annotations, needing a more scalable solution.

Method: Train transformer-based UHeads on frozen LLM internal states to estimate reasoning step uncertainty, using automatic labels from larger LLMs or self-supervised learning.

Result: UHeads match or surpass PRMs up to 810x larger across math, planning, and QA domains, showing LLM internal states encode reliable uncertainty signals.

Conclusion: LLM internal states effectively encode uncertainty for reasoning verification, enabling scalable introspective LLMs without heavy computational costs.

Abstract: Solving complex tasks usually requires LLMs to generate long multi-step reasoning chains. Previous work has shown that verifying the correctness of individual reasoning steps can further improve the performance and efficiency of LLMs on such tasks and enhance solution interpretability. However, existing verification approaches, such as Process Reward Models (PRMs), are either computationally expensive, limited to specific domains, or require large-scale human or model-generated annotations. Thus, we propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores. We train transformer-based uncertainty quantification heads (UHeads) that use the internal states of a frozen LLM to estimate the uncertainty of its reasoning steps during generation. The approach is fully automatic: target labels are generated either by another larger LLM (e.g., DeepSeek R1) or in a self-supervised manner by the original model itself. UHeads are both effective and lightweight, containing less than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, they match or even surpass the performance of PRMs that are up to 810x larger. Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification, offering a promising direction toward scalable and generalizable introspective LLMs.

[387] Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Chloe Li, Mary Phuong, Daniel Tan

Main category: cs.AI

TL;DR: SRFT fine-tunes models to admit factual errors, which generalizes to admitting hidden misaligned objectives in adversarial settings, enabling near-perfect detection of hidden objectives.

Details

Motivation: AI systems can pursue undesirable objectives and cause harm, and current interrogation methods are unreliable because models can lie about their true intentions.

Method: Self-report fine-tuning (SRFT) - a supervised fine-tuning technique that trains models to admit factual mistakes when asked, which generalizes to admitting hidden misaligned objectives.

Result: SRFT models achieve near-ceiling performance in detecting hidden objectives (F1=0.98 vs 0 for baseline), recover 28-100% of hidden objective details (vs 0% for baseline), and confess under pressure.

Conclusion: SRFT provides a promising technique for promoting honesty and detecting misaligned AI systems by training models to self-report errors and hidden objectives.

Abstract: As AI systems become more capable of complex agentic tasks, they also become more capable of pursuing undesirable objectives and causing harm. Previous work has attempted to catch these unsafe instances by interrogating models directly about their objectives and behaviors. However, the main weakness of trusting interrogations is that models can lie. We propose self-report fine-tuning (SRFT), a simple supervised fine-tuning technique that trains models to admit their factual mistakes when asked. We show that the admission of factual errors in simple question-answering settings generalizes out-of-distribution (OOD) to the admission of hidden misaligned objectives in adversarial agentic settings. We evaluate SRFT in OOD stealth tasks, where models are instructed to complete a hidden misaligned objective alongside a user-specified objective without being caught by monitoring. After SRFT, models are more likely to confess the details of their hidden objectives when interrogated, even under strong pressure not to disclose them. Interrogation on SRFT models can detect hidden objectives with near-ceiling performance (F1 score = 0.98), while the baseline model lies when interrogated under the same conditions (F1 score = 0). Interrogation on SRFT models can further elicit the content of the hidden objective, recovering 28-100% details, compared to 0% details recovered in the baseline model and by prefilled assistant turn attacks. This provides a promising technique for promoting honesty propensity and incriminating misaligned AI systems.

[388] Green AI: A systematic review and meta-analysis of its definitions, lifecycle models, hardware and measurement attempts

Marcel Rojahn, Marcus Grum

Main category: cs.AI

TL;DR: This paper establishes a unified framework for Green AI that addresses multi-dimensional environmental burdens across the AI lifecycle, including energy, carbon, water, and embodied impacts, with actionable guidance for stakeholders.

Details

Motivation: Current AI environmental impact assessments are heterogeneous, often omit water and value chain effects, and lack comparability and reproducibility, necessitating a comprehensive lifecycle approach.

Method: The paper formalizes a five-phase AI lifecycle mapped to LCA stages, specifies governance via PDCA cycles, systematizes hardware/system strategies across edge-cloud continuum, and defines a calibrated measurement framework combining estimator models with direct metering.

Result: The framework enables reproducible, provider-agnostic comparisons and reduces embodied burdens through systematic hardware and system-level strategies across the AI lifecycle.

Conclusion: The article provides actionable, evidence-based guidance combining definition, lifecycle processes, hardware strategies, and calibrated measurement for researchers, practitioners, and policymakers to implement Green AI effectively.

Abstract: Across the Artificial Intelligence (AI) lifecycle - from hardware to development, deployment, and reuse - burdens span energy, carbon, water, and embodied impacts. Cloud provider tools improve transparency but remain heterogeneous and often omit water and value chain effects, limiting comparability and reproducibility. Addressing these multi dimensional burdens requires a lifecycle approach linking phase explicit mapping with system levers (hardware, placement, energy mix, cooling, scheduling) and calibrated measurement across facility, system, device, and workload levels. This article (i) establishes a unified, operational definition of Green AI distinct from Sustainable AI; (ii) formalizes a five phase lifecycle mapped to Life Cycle Assessment (LCA) stages, making energy, carbon, water, and embodied impacts first class; (iii) specifies governance via Plan Do Check Act (PDCA) cycles with decision gateways; (iv) systematizes hardware and system level strategies across the edge cloud continuum to reduce embodied burdens; and (v) defines a calibrated measurement framework combining estimator models with direct metering to enable reproducible, provider agnostic comparisons. Combining definition, lifecycle processes, hardware strategies, and calibrated measurement, this article offers actionable, evidence based guidance for researchers, practitioners, and policymakers.

[389] A Theoretical Analysis of Detecting Large Model-Generated Time Series

Junji Hou, Junzhou Zhao, Shuo Zhang, Pinghui Wang

Main category: cs.AI

TL;DR: Proposes Uncertainty Contraction Estimator (UCE) to detect synthetic time series generated by Time-Series Large Models, based on the contraction hypothesis that model-generated series show progressively decreasing uncertainty under recursive forecasting.

Details

Motivation: Address the growing risks of data misuse and fabrication by detecting synthetic time series, as existing text-based detection methods are ineffective due to modality differences (lower information density and smoother distributions in time series).

Method: Introduces the contraction hypothesis - model-generated time series exhibit progressively decreasing uncertainty under recursive forecasting. Develops UCE, a white-box detector that aggregates uncertainty metrics over successive prefixes to identify TSLM-generated time series.

Result: Extensive experiments on 32 datasets show UCE consistently outperforms state-of-the-art baselines. The contraction hypothesis is empirically validated across diverse datasets.

Conclusion: UCE provides a reliable and generalizable solution for detecting model-generated time series, addressing the fundamental limitations of token-based detectors for time series data.

Abstract: Motivated by the increasing risks of data misuse and fabrication, we investigate the problem of identifying synthetic time series generated by Time-Series Large Models (TSLMs) in this work. While there are extensive researches on detecting model generated text, we find that these existing methods are not applicable to time series data due to the fundamental modality difference, as time series usually have lower information density and smoother probability distributions than text data, which limit the discriminative power of token-based detectors. To address this issue, we examine the subtle distributional differences between real and model-generated time series and propose the contraction hypothesis, which states that model-generated time series, unlike real ones, exhibit progressively decreasing uncertainty under recursive forecasting. We formally prove this hypothesis under theoretical assumptions on model behavior and time series structure. Model-generated time series exhibit progressively concentrated distributions under recursive forecasting, leading to uncertainty contraction. We provide empirical validation of the hypothesis across diverse datasets. Building on this insight, we introduce the Uncertainty Contraction Estimator (UCE), a white-box detector that aggregates uncertainty metrics over successive prefixes to identify TSLM-generated time series. Extensive experiments on 32 datasets show that UCE consistently outperforms state-of-the-art baselines, offering a reliable and generalizable solution for detecting model-generated time series.

[390] Two Heads are Better than One: Distilling Large Language Model Features Into Small Models with Feature Decomposition and Mixture

Tianhao Fu, Xinxin Xu, Weichen Xu, Jue Chen, Ruilong Ren, Bowen Deng, Xinyu Zhao, Jian Cao, Xixin Cao

Main category: cs.AI

TL;DR: CMM is a novel framework that distills LLM knowledge for market making by decoupling features across layer, task, and data dimensions, using multiple student models and Hájek-MoE integration.

Details

Motivation: To address the slow inference speed of direct LLM applications in market making and the lack of LLM distillation research for this specific financial task.

Method: Proposes Cooperative Market Making (CMM) framework that decouples LLM features across three orthogonal dimensions (layer, task, data), uses multiple student models for collaborative learning, and integrates outputs via Hájek-MoE in a kernel function-generated feature space.

Result: Extensive experiments on four real-world market datasets show CMM outperforms current distillation methods and RL-based market-making strategies.

Conclusion: CMM successfully addresses LLM inference speed issues in market making through effective knowledge distillation and feature decoupling, demonstrating superior performance over existing approaches.

Abstract: Market making (MM) through Reinforcement Learning (RL) has attracted significant attention in financial trading. With the development of Large Language Models (LLMs), more and more attempts are being made to apply LLMs to financial areas. A simple, direct application of LLM as an agent shows significant performance. Such methods are hindered by their slow inference speed, while most of the current research has not studied LLM distillation for this specific task. To address this, we first propose the normalized fluorescent probe to study the mechanism of the LLM’s feature. Based on the observation found by our investigation, we propose Cooperative Market Making (CMM), a novel framework that decouples LLM features across three orthogonal dimensions: layer, task, and data. Various student models collaboratively learn simple LLM features along with different dimensions, with each model responsible for a distinct feature to achieve knowledge distillation. Furthermore, CMM introduces an Hájek-MoE to integrate the output of the student models by investigating the contribution of different models in a kernel function-generated common feature space. Extensive experimental results on four real-world market datasets demonstrate the superiority of CMM over the current distillation method and RL-based market-making strategies.

[391] DeepPersona: A Generative Engine for Scaling Deep Synthetic Personas

Zhen Wang, Yufan Zhou, Zhongyan Luo, Lyumanshan Ye, Adam Wood, Man Yao, Luoshang Pan

Main category: cs.AI

TL;DR: DEEPPERSONA is a scalable generative engine that creates deep, narrative-complete synthetic personas using a two-stage, taxonomy-guided method, significantly outperforming existing approaches in diversity, uniqueness, and practical applications.

Details

Motivation: Existing synthetic personas are shallow and simplistic, failing to capture the rich complexity and diversity of real human identities, limiting their effectiveness in agentic behavioral simulation, LLM personalization, and human-AI alignment research.

Method: Two-stage, taxonomy-guided approach: 1) Algorithmically construct the largest human-attribute taxonomy by mining thousands of real user-ChatGPT conversations, 2) Progressively sample attributes from this taxonomy to conditionally generate coherent and realistic personas with hundreds of structured attributes and extensive narrative text.

Result: Significant improvements in attribute diversity (32% higher coverage) and profile uniqueness (44% greater) compared to state-of-the-art baselines. Enhanced GPT-4.1-mini’s personalized question answering accuracy by 11.6% and reduced the gap between simulated LLM citizens and authentic human responses in social surveys by 31.7%. Reduced Big Five personality test performance gap by 17%.

Conclusion: DEEPPERSONA provides a rigorous, scalable, and privacy-free platform for high-fidelity human simulation and personalized AI research, enabling more realistic and diverse synthetic personas that better reflect real human complexity.

Abstract: Simulating human profiles by instilling personas into large language models (LLMs) is rapidly transforming research in agentic behavioral simulation, LLM personalization, and human-AI alignment. However, most existing synthetic personas remain shallow and simplistic, capturing minimal attributes and failing to reflect the rich complexity and diversity of real human identities. We introduce DEEPPERSONA, a scalable generative engine for synthesizing narrative-complete synthetic personas through a two-stage, taxonomy-guided method. First, we algorithmically construct the largest-ever human-attribute taxonomy, comprising over hundreds of hierarchically organized attributes, by mining thousands of real user-ChatGPT conversations. Second, we progressively sample attributes from this taxonomy, conditionally generating coherent and realistic personas that average hundreds of structured attributes and roughly 1 MB of narrative text, two orders of magnitude deeper than prior works. Intrinsic evaluations confirm significant improvements in attribute diversity (32 percent higher coverage) and profile uniqueness (44 percent greater) compared to state-of-the-art baselines. Extrinsically, our personas enhance GPT-4.1-mini’s personalized question answering accuracy by 11.6 percent on average across ten metrics and substantially narrow (by 31.7 percent) the gap between simulated LLM citizens and authentic human responses in social surveys. Our generated national citizens reduced the performance gap on the Big Five personality test by 17 percent relative to LLM-simulated citizens. DEEPPERSONA thus provides a rigorous, scalable, and privacy-free platform for high-fidelity human simulation and personalized AI research.

cs.SD

[392] Enabling Automatic Self-Talk Detection via Earables

Euihyeok Lee, Seonghyeon Kim, SangHun Im, Heung-Seon Oh, Seungwoo Kang

Main category: cs.SD

TL;DR: MutterMeter is a mobile system that automatically detects vocalized self-talk using earable microphones in real-world settings, achieving robust performance with an F1 score of 0.84.

Details

Motivation: Self-talk plays a crucial role in emotional regulation and cognitive processing but has remained largely invisible and unmeasurable in everyday life due to its diverse acoustic forms and irregular occurrence patterns.

Method: Uses a hierarchical classification architecture that progressively integrates acoustic, linguistic, and contextual information through a sequential processing pipeline, adaptively balancing accuracy and computational efficiency.

Result: Achieves robust performance with a macro-averaged F1 score of 0.84, outperforming conventional approaches including LLM-based and speech emotion recognition models.

Conclusion: MutterMeter successfully addresses the technical challenges of detecting self-talk and provides a practical solution for measuring this important psychological phenomenon in real-world settings.

Abstract: Self-talk-an internal dialogue that can occur silently or be spoken aloud-plays a crucial role in emotional regulation, cognitive processing, and motivation, yet has remained largely invisible and unmeasurable in everyday life. In this paper, we present MutterMeter, a mobile system that automatically detects vocalized self-talk from audio captured by earable microphones in real-world settings. Detecting self-talk is technically challenging due to its diverse acoustic forms, semantic and grammatical incompleteness, and irregular occurrence patterns, which differ fundamentally from assumptions underlying conventional speech understanding models. To address these challenges, MutterMeter employs a hierarchical classification architecture that progressively integrates acoustic, linguistic, and contextual information through a sequential processing pipeline, adaptively balancing accuracy and computational efficiency. We build and evaluate MutterMeter using a first-of-its-kind dataset comprising 31.1 hours of audio collected from 25 participants. Experimental results demonstrate that MutterMeter achieves robust performance with a macro-averaged F1 score of 0.84, outperforming conventional approaches, including LLM-based and speech emotion recognition models.

[393] Speech Separation for Hearing-Impaired Children in the Classroom

Feyisayo Olalere, Kiki van der Heijden, H. Christiaan Stronks, Jeroen Briaire, Johan H. M. Frijns, Yagmur Güçlütürk

Main category: cs.SD

TL;DR: MIMO-TasNet speech separation model adapted for children’s voices in classroom environments, showing improved performance through targeted training with classroom-specific data and efficient transfer learning.

Details

Motivation: Children with hearing impairments face greater speech perception challenges in classrooms than adults, but most speech separation models are developed using adult voices in simplified conditions, overlooking children's higher spectral voice similarity and real classroom acoustic complexity.

Method: Used MIMO-TasNet architecture to simulate naturalistic classroom scenes with moving child-child and child-adult talker pairs under varying noise and distance conditions. Tested training strategies focusing on spatial cues and compared models trained on adult speech, classroom data, and finetuned variants.

Result: Adult-trained models performed well in clean scenes but classroom-specific training greatly improved separation quality. Finetuning with only half the classroom data achieved comparable gains. Training with diffuse babble noise enhanced robustness, and the model preserved spatial awareness while generalizing to unseen distances.

Conclusion: Spatially aware architectures combined with targeted adaptation can improve speech accessibility for children in noisy classrooms, supporting future on-device assistive technologies.

Abstract: Classroom environments are particularly challenging for children with hearing impairments, where background noise, multiple talkers, and reverberation degrade speech perception. These difficulties are greater for children than adults, yet most deep learning speech separation models for assistive devices are developed using adult voices in simplified, low-reverberation conditions. This overlooks both the higher spectral similarity of children’s voices, which weakens separation cues, and the acoustic complexity of real classrooms. We address this gap using MIMO-TasNet, a compact, low-latency, multi-channel architecture suited for real-time deployment in bilateral hearing aids or cochlear implants. We simulated naturalistic classroom scenes with moving child-child and child-adult talker pairs under varying noise and distance conditions. Training strategies tested how well the model adapts to children’s speech through spatial cues. Models trained on adult speech, classroom data, and finetuned variants were compared to assess data-efficient adaptation. Results show that adult-trained models perform well in clean scenes, but classroom-specific training greatly improves separation quality. Finetuning with only half the classroom data achieved comparable gains, confirming efficient transfer learning. Training with diffuse babble noise further enhanced robustness, and the model preserved spatial awareness while generalizing to unseen distances. These findings demonstrate that spatially aware architectures combined with targeted adaptation can improve speech accessibility for children in noisy classrooms, supporting future on-device assistive technologies.

[394] SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech

Lu Gan, Xi Li

Main category: cs.SD

TL;DR: SYNTTS-COMMANDS is a synthetic multilingual voice command dataset for keyword spotting that achieves 99.5% accuracy on English and 98% on Chinese commands using TTS-generated data instead of human recordings.

Details

Motivation: Address the data scarcity problem for on-device keyword spotting systems by creating scalable synthetic training data to replace costly human recordings.

Method: Used CosyVoice 2 TTS model with speaker embeddings from public corpora to generate multilingual English and Chinese voice commands.

Result: Achieved exceptional accuracy: 99.5% on English and 98% on Chinese command recognition across various efficient acoustic models.

Conclusion: Synthetic speech can effectively replace human-recorded audio for training KWS classifiers, providing scalable foundation for TinyML voice interfaces.

Abstract: The development of high-performance, on-device keyword spotting (KWS) systems for ultra-low-power hardware is critically constrained by the scarcity of specialized, multi-command training datasets. Traditional data collection through human recording is costly, slow, and lacks scalability. This paper introduces SYNTTS-COMMANDS, a novel, multilingual voice command dataset entirely generated using state-of-the-art Text-to-Speech (TTS) synthesis. By leveraging the CosyVoice 2 model and speaker embeddings from public corpora, we created a scalable collection of English and Chinese commands. Extensive benchmarking across a range of efficient acoustic models demonstrates that our synthetic dataset enables exceptional accuracy, achieving up to 99.5% on English and 98% on Chinese command recognition. These results robustly validate that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. Our work directly addresses the data bottleneck in TinyML, providing a practical, scalable foundation for building private, low-latency, and energy-efficient voice interfaces on resource-constrained edge devices.

[395] SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

Jiaqi Wang, Liutao Yu, Xiongri Shen, Sihang Guo, Chenlin Zhou, Leilei Zhao, Yi Zhong, Zhengyu Ma, Zhiguo Zhang

Main category: cs.SD

TL;DR: SpikCommander is a fully spike-driven transformer architecture that uses multi-view spiking temporal-aware self-attention and contextual refinement to achieve state-of-the-art speech command recognition with fewer parameters and better efficiency.

Details

Motivation: Existing SNN-based speech command recognition methods struggle to capture rich temporal dependencies and contextual information due to limited temporal modeling and binary spike-based representations.

Method: Proposed SpikCommander with multi-view spiking temporal-aware self-attention (MSTASA) module and spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration.

Result: Outperforms state-of-the-art SNN approaches on three benchmark datasets (SHD, SSC, GSC) with fewer parameters under comparable time steps.

Conclusion: SpikCommander demonstrates effectiveness and efficiency for robust speech command recognition through improved temporal modeling and contextual refinement.

Abstract: Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.

[396] Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models

Yi Yang, Haowen Li, Tianxiang Li, Boyu Cao, Xiaohan Zhang, Liqun Chen, Qi Liu

Main category: cs.SD

TL;DR: Melodia is a training-free music editing method that manipulates self-attention maps in diffusion models to modify musical characteristics while preserving the original temporal structure, outperforming existing methods.

Details

Motivation: Existing music editing methods fail to preserve source music's temporal structure (melody, rhythm) when changing attributes like instrument, genre, and mood.

Method: Selective manipulation of self-attention maps during denoising process using an attention repository to store source music information, without requiring textual descriptions of source music.

Result: Achieves superior textual adherence and structural integrity across various datasets in both objective and subjective experiments.

Conclusion: Enhances understanding of music generation model mechanisms and provides improved control for music creation through attention map analysis.

Abstract: Text-to-music generation technology is progressing rapidly, creating new opportunities for musical composition and editing. However, existing music editing methods often fail to preserve the source music’s temporal structure, including melody and rhythm, when altering particular attributes like instrument, genre, and mood. To address this challenge, this paper conducts an in-depth probing analysis on attention maps within AudioLDM 2, a diffusion-based model commonly used as the backbone for existing music editing methods. We reveal a key finding: cross-attention maps encompass details regarding distinct musical characteristics, and interventions on these maps frequently result in ineffective modifications. In contrast, self-attention maps are essential for preserving the temporal structure of the source music during its conversion into the target music. Building upon this understanding, we present Melodia, a training-free technique that selectively manipulates self-attention maps in particular layers during the denoising process and leverages an attention repository to store source music information, achieving accurate modification of musical characteristics while preserving the original structure without requiring textual descriptions of the source music. Additionally, we propose two novel metrics to better evaluate music editing methods. Both objective and subjective experiments demonstrate that our approach achieves superior results in terms of textual adherence and structural integrity across various datasets. This research enhances comprehension of internal mechanisms within music generation models and provides improved control for music creation.

[397] SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu

Main category: cs.SD

TL;DR: SpeechJudge is a comprehensive suite for aligning speech synthesis models with human feedback, featuring a large-scale dataset, benchmark, and generative reward model that achieves superior performance in naturalness judgment.

Details

Motivation: Large generative models in speech synthesis lack large-scale human preference datasets, hindering development of models that truly align with human perception of naturalness.

Method: Created SpeechJudge-Data (99K speech pairs with human annotations), established SpeechJudge-Eval benchmark, and developed SpeechJudge-GRM using Qwen2.5-Omni-7B with two-stage training: SFT with Chain-of-Thought followed by RL with GRPO.

Result: SpeechJudge-GRM achieves 77.2% accuracy (79.4% after scaling) on benchmark, outperforming Gemini-2.5-Flash (<70%) and classic Bradley-Terry model (72.7%).

Conclusion: SpeechJudge effectively addresses the human feedback alignment gap in speech synthesis and can be used as a reward function for post-training speech generation models.

Abstract: Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness–one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

[398] HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Bingsong Bai, Yizhong Geng, Fengping Wang, Cong Wang, Puyuan Guo, Yingming Gao, Ya Li

Main category: cs.SD

TL;DR: HQ-SVC is an efficient high-quality zero-shot singing voice conversion framework that uses joint content-speaker feature extraction and progressive refinement to outperform existing methods in quality and efficiency.

Details

Motivation: Existing zero-shot SVC methods model speaker timbre and vocal content separately, losing essential acoustic information and requiring significant computational resources, which degrades output quality.

Method: Extracts joint content and speaker features using a decoupled codec, enhances fidelity through pitch and volume modeling, and progressively refines outputs via differentiable signal processing and diffusion techniques.

Result: Significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency, and achieves superior voice naturalness compared to specialized audio super-resolution methods.

Conclusion: HQ-SVC provides an efficient framework for high-quality zero-shot singing voice conversion that preserves critical acoustic information and supports voice super-resolution tasks.

Abstract: Zero-shot singing voice conversion (SVC) transforms a source singer’s timbre to an unseen target speaker’s voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

[399] Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics

Ziqian Zhang, Min Huang, Zhongzhe Xiao

Main category: cs.SD

TL;DR: The paper introduces physiological information (phonation excitation and articulatory kinematics) for speech emotion recognition, creates a new dataset STEM-E2VA with EGG and EMA data, and shows that even estimated physiological data can improve SER performance in real-world scenarios.

Details

Motivation: Most speech emotion recognition research focuses on acoustic and textual information, but physiological information during speech production also contains emotional cues. There's a gap in using phonation excitation and articulatory kinematics for SER.

Method: Created STEM-E2VA dataset with audio and physiological data (EGG for phonation excitation, EMA for articulatory kinematics). Conducted experiments using both collected physiological data and estimated physiological data derived through inversion methods from speech.

Result: Experimental results confirm that incorporating physiological information about speech production improves speech emotion recognition performance. Even estimated physiological data shows effectiveness.

Conclusion: Physiological information during speech production is valuable for SER and has practical potential for real-world applications, even when using estimated rather than directly measured data.

Abstract: Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.

[400] DOA Estimation with Lightweight Network on LLM-Aided Simulated Acoustic Scenes

Haowen Li, Zhengding Luo, Dongyuan Shi, Boxiang Wang, Junwei Ji, Ziyi Yang, Woon-Seng Gan

Main category: cs.SD

TL;DR: LightDOA: A lightweight DOA estimation model using depthwise separable convolutions for multi-channel input, achieving good accuracy and robustness with low computational complexity.

Details

Motivation: Existing DOA models trained on synthetic data have limited generalizability due to constrained acoustic diversity. This paper explores using LLM-assisted spatial audio dataset for more realistic and diverse training.

Method: Proposed LightDOA model based on depthwise separable convolutions specifically designed for multi-channel input in varying environments, benchmarked on LLM-assisted spatial audio dataset.

Result: LightDOA achieves satisfactory accuracy and robustness across various acoustic scenes while maintaining low computational complexity.

Conclusion: LLM-assisted spatial audio synthesis shows potential for advancing robust DOA estimation, and LightDOA serves as an efficient solution for resource-constrained applications.

Abstract: Direction-of-Arrival (DOA) estimation is critical in spatial audio and acoustic signal processing, with wide-ranging applications in real-world. Most existing DOA models are trained on synthetic data by convolving clean speech with room impulse responses (RIRs), which limits their generalizability due to constrained acoustic diversity. In this paper, we revisit DOA estimation using a recently introduced dataset constructed with the assistance of large language models (LLMs), which provides more realistic and diverse spatial audio scenes. We benchmark several representative neural-based DOA methods on this dataset and propose LightDOA, a lightweight DOA estimation model based on depthwise separable convolutions, specifically designed for mutil-channel input in varying environments. Experimental results show that LightDOA achieves satisfactory accuracy and robustness across various acoustic scenes while maintaining low computational complexity. This study not only highlights the potential of spatial audio synthesized with the assistance of LLMs in advancing robust and efficient DOA estimation research, but also highlights LightDOA as efficient solution for resource-constrained applications.

[401] Uncertainty Calibration of Multi-Label Bird Sound Classifiers

Raphael Schwinger, Ben McEwen, Vincent S. Kather, René Heinrich, Lukas Rauch, Sven Tomforde

Main category: cs.SD

TL;DR: This paper benchmarks calibration of multi-label bird sound classifiers on BirdSet, finding significant calibration variations across datasets and classes, with some models underconfident and others overconfident, and shows that simple post-hoc calibration methods can significantly improve calibration.

Details

Motivation: Passive acoustic monitoring requires reliable uncertainty estimates for decision-making, but bioacoustics faces challenges like overlapping vocalizations, long-tailed species distributions, and distribution shifts, with multi-label deep learning classifier calibration not yet assessed in this domain.

Method: Systematically benchmarked four state-of-the-art multi-label bird sound classifiers on BirdSet benchmark, evaluating global, per-dataset and per-class calibration using threshold-free calibration metrics (ECE, MCS) alongside discrimination metrics (cmAP), and tested simple post hoc calibration methods like Platt scaling.

Result: Model calibration varies significantly across datasets and classes. Perch v2 and ConvNeXt_BS show better global calibration but consistent underconfidence, while AudioProtoPNet and BirdMAE are mostly overconfident. Surprisingly, calibration is better for less frequent classes. Simple post hoc calibration methods significantly improve calibration using small labeled calibration sets.

Conclusion: The findings highlight the importance of evaluating and improving uncertainty calibration in bioacoustic classifiers, as calibration varies significantly and can be effectively improved with simple post hoc methods despite dataset variability challenges.

Abstract: Passive acoustic monitoring enables large-scale biodiversity assessment, but reliable classification of bioacoustic sounds requires not only high accuracy but also well-calibrated uncertainty estimates to ground decision-making. In bioacoustics, calibration is challenged by overlapping vocalisations, long-tailed species distributions, and distribution shifts between training and deployment data. The calibration of multi-label deep learning classifiers within the domain of bioacoustics has not yet been assessed. We systematically benchmark the calibration of four state-of-the-art multi-label bird sound classifiers on the BirdSet benchmark, evaluating both global, per-dataset and per-class calibration using threshold-free calibration metrics (ECE, MCS) alongside discrimination metrics (cmAP). Model calibration varies significantly across datasets and classes. While Perch v2 and ConvNeXt$_{BS}$ show better global calibration, results vary between datasets. Both models indicate consistent underconfidence, while AudioProtoPNet and BirdMAE are mostly overconfident. Surprisingly, calibration seems to be better for less frequent classes. Using simple post hoc calibration methods we demonstrate a straightforward way to improve calibration. A small labelled calibration set is sufficient to significantly improve calibration with Platt scaling, while global calibration parameters suffer from dataset variability. Our findings highlight the importance of evaluating and improving uncertainty calibration in bioacoustic classifiers.

[402] TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

Main category: cs.SD

TL;DR: TTSOps is an automated closed-loop framework that builds multi-speaker TTS systems from noisy web-scale speech data, addressing limitations of conventional curated corpora through dynamic data selection and cleansing.

Details

Motivation: Conventional TTS training requires well-curated corpora with high acoustic quality, limiting scalability and speaker diversity. Recent methods overlook TTS model robustness to noise and the value of perceptually low-quality but informative samples.

Method: Three core components: automated data collection from dark data sources, utterance-level dynamic selection of data cleansing methods based on training data quality, and evaluation-in-the-loop data selection using predicted MOS scores. Jointly optimizes corpus and TTS model in closed-loop framework.

Result: Extensive experiments on Japanese YouTube data show TTSOps outperforms conventional acoustic-quality-based baselines in both naturalness and speaker diversity of synthesized speech.

Conclusion: TTSOps successfully addresses scalability and diversity limitations of conventional TTS training by leveraging noisy web data through automated closed-loop optimization of data selection and cleansing processes.

Abstract: This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,’’ such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance’s impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.

[403] AcousTools: A ‘Full-Stack’, Python-Based, Acoustic Holography Library

Joshua Mukherjee, Giorgos Christopoulos, Zhouyang Shen, Sriram Subramanian, Ryuji Hirayama

Main category: cs.SD

TL;DR: AcousTools is a Python-based acoustic holography library that provides a full-stack solution for acoustic holography applications, covering setup, modeling, phase retrieval, analysis, and hardware control.

Details

Motivation: Existing software libraries for acoustic holography fail to provide a complete solution covering all aspects from abstraction to physicalization, creating a need for a comprehensive framework.

Method: Developed AcousTools as a Python library that supports the full suite of acoustic holographic applications, addressing setup, acoustic propagation modeling, transducer phase retrieval, sound field analysis, and hardware control.

Result: AcousTools successfully meets each step of the full-stack requirements for acoustic holography applications.

Conclusion: AcousTools has the potential to become the standard library for acoustic holography, enabling researchers to develop novel applications and review others’ work more effectively while providing a framework for comparing methodologies.

Abstract: Acoustic Holography is an emerging field where mid-air ultrasound is controlled and manipulated for novel and exciting applications. These range from mid-air haptics, volumetric displays, contactless fabrication, and even chemical and biomedical applications such as drug delivery. To develop these applications, a software framework to predict acoustic behaviour and simulating resulting effects, such as applied forces or scattering patterns is desirable. There have been various software libraries and platforms that attempt to fill this role, but there is yet to be a single piece of software that acts as a ‘full-stack’ solution. We define this full-stack as the process from abstraction to physicalisation starting with setup, modelling acoustic propagation, transducer phase retrieval, sound field analysis, and control of the acoustic holographic hardware itself. Existing methods fail to fulfil one or more of these categories. To address this, we present AcousTools, a Python-based acoustic holography library, designed to support the full suite of acoustic holographic applications and we show AcousTools’s ability to meet each step of the full-stack’s requirements. AcousTools has the potential to become the standard code library for acoustic holography, with the uniquely complete suite of features wrapped in a language that is known to be easy to use, AcousTools will increase the ability for researchers to develop novel applications as well as accurately review other’s work. The full-stack, aside from software, will also be useful for researchers - providing a way to view and compare methodologies by understanding where they fit into the stack.

cs.LG

[404] Optimizing Classification of Infrequent Labels by Reducing Variability in Label Distribution

Ashutosh Agarwal

Main category: cs.LG

TL;DR: LEVER improves Extreme Classification performance for infrequent categories using Siamese architecture and knowledge transfer to reduce label inconsistency.

Details

Motivation: Infrequent categories in Extreme Classification suffer from sparse samples and high label inconsistency, which degrades classification performance.

Method: Uses a robust Siamese-style architecture with knowledge transfer to enhance One-vs-All classifiers and reduce label inconsistency.

Result: Shows substantial improvements in handling infrequent categories across multiple XC datasets, setting new benchmarks.

Conclusion: LEVER effectively addresses infrequent category challenges in XC and introduces new multi-intent datasets for future research.

Abstract: This paper presents a novel solution, LEVER, designed to address the challenges posed by underperforming infrequent categories in Extreme Classification (XC) tasks. Infrequent categories, often characterized by sparse samples, suffer from high label inconsistency, which undermines classification performance. LEVER mitigates this problem by adopting a robust Siamese-style architecture, leveraging knowledge transfer to reduce label inconsistency and enhance the performance of One-vs-All classifiers. Comprehensive testing across multiple XC datasets reveals substantial improvements in the handling of infrequent categories, setting a new benchmark for the field. Additionally, the paper introduces two newly created multi-intent datasets, offering essential resources for future XC research.

[405] Slimmable NAM: Neural Amp Models with adjustable runtime computational cost

Steven Atkinson

Main category: cs.LG

TL;DR: Slimmable Neural Amp Models that can change size and computational cost without retraining, enabling flexible accuracy-compute trade-offs for musicians.

Details

Motivation: To allow musicians to easily trade off between model accuracy and computational requirements without needing additional training.

Method: Developed slimmable neural amp models whose size and computational cost can be dynamically adjusted with negligible overhead.

Result: Performance was quantified against common baselines and a real-time audio effect plug-in demonstration was developed.

Conclusion: The approach successfully enables flexible model size adjustment for neural amp models with minimal computational overhead.

Abstract: This work demonstrates “slimmable Neural Amp Models”, whose size and computational cost can be changed without additional training and with negligible computational overhead, enabling musicians to easily trade off between the accuracy and compute of the models they are using. The method’s performance is quantified against commonly-used baselines, and a real-time demonstration of the model in an audio effect plug-in is developed.

[406] Towards Personalized Quantum Federated Learning for Anomaly Detection

Ratun Rahman, Sina Shaham, Dinh C. Nguyen

Main category: cs.LG

TL;DR: Proposes personalized quantum federated learning (PQFL) for anomaly detection, addressing client heterogeneity in quantum hardware, circuits, noise, and data encoding to improve accuracy in non-IID settings.

Details

Motivation: Anomaly detection faces challenges with context-dependent anomalies and limited labeled data. Quantum federated learning (QFL) helps but struggles with client heterogeneity in hardware capabilities, circuit designs, noise levels, and data preprocessing, making single global models ineffective for imbalanced or non-IID data.

Method: PQFL framework enhances local model training using parameterized quantum circuits and classical optimizers, with quantum-centric personalization that adapts each client’s model to its specific hardware characteristics and data representation.

Result: PQFL significantly improves anomaly detection accuracy, reducing false errors by up to 23%, and achieving gains of 24.2% in AUROC and 20.5% in AUPR compared to state-of-the-art methods.

Conclusion: PQFL is effective and scalable for practical quantum federated settings, demonstrating superior performance in handling diverse and realistic conditions through personalized quantum model adaptation.

Abstract: Anomaly detection has a significant impact on applications such as video surveillance, medical diagnostics, and industrial monitoring, where anomalies frequently depend on context and anomaly-labeled data are limited. Quantum federated learning (QFL) overcomes these concerns by distributing model training among several quantum clients, consequently eliminating the requirement for centralized quantum storage and processing. However, in real-life quantum networks, clients frequently differ in terms of hardware capabilities, circuit designs, noise levels, and how classical data is encoded or preprocessed into quantum states. These differences create inherent heterogeneity across clients - not just in their data distributions, but also in their quantum processing behaviors. As a result, training a single global model becomes ineffective, especially when clients handle imbalanced or non-identically distributed (non-IID) data. To address this, we propose a new framework called personalized quantum federated learning (PQFL) for anomaly detection. PQFL enhances local model training at quantum clients using parameterized quantum circuits and classical optimizers, while introducing a quantum-centric personalization strategy that adapts each client’s model to its own hardware characteristics and data representation. Extensive experiments show that PQFL significantly improves anomaly detection accuracy under diverse and realistic conditions. Compared to state-of-the-art methods, PQFL reduces false errors by up to 23%, and achieves gains of 24.2% in AUROC and 20.5% in AUPR, highlighting its effectiveness and scalability in practical quantum federated settings.

[407] Multivariate Variational Autoencoder

Mehmet Can Yavuz

Main category: cs.LG

TL;DR: MVAE is a VAE variant that enables full-covariance posteriors while maintaining tractability, outperforming diagonal-covariance VAEs on reconstruction, calibration, and unsupervised structure across multiple datasets.

Details

Motivation: To overcome the limitation of diagonal posterior covariance in standard VAEs while preserving Gaussian tractability and efficient computation.

Method: Factorizes posterior covariance using a global coupling matrix for dataset-wide correlations and per-sample diagonal scales for local uncertainty, with efficient reparameterization.

Result: Consistently matches or improves reconstruction (MSE), calibration (NLL/Brier/ECE), and unsupervised structure (NMI/ARI) across MNIST variants, Fashion-MNIST, CIFAR-10, and CIFAR-100.

Conclusion: MVAE provides a practical full-covariance VAE that delivers robust performance gains, especially at mid-range latent sizes, with smoother latent traversals and reproducible implementation.

Abstract: We present the Multivariate Variational Autoencoder (MVAE), a VAE variant that preserves Gaussian tractability while lifting the diagonal posterior restriction. MVAE factorizes each posterior covariance, where a \emph{global} coupling matrix $\mathbf{C}$ induces dataset-wide latent correlations and \emph{per-sample} diagonal scales modulate local uncertainty. This yields a full-covariance family with analytic KL and an efficient reparameterization via $\mathbf{L}=\mathbf{C}\mathrm{diag}(\boldsymbolσ)$. Across Larochelle-style MNIST variants, Fashion-MNIST, CIFAR-10, and CIFAR-100, MVAE consistently matches or improves reconstruction (MSE~$\downarrow$) and delivers robust gains in calibration (NLL/Brier/ECE~$\downarrow$) and unsupervised structure (NMI/ARI~$\uparrow$) relative to diagonal-covariance VAEs with matched capacity, especially at mid-range latent sizes. Latent-plane visualizations further indicate smoother, more coherent factor traversals and sharper local detail. We release a fully reproducible implementation with training/evaluation scripts and sweep utilities to facilitate fair comparison and reuse.

[408] RELEAP: Reinforcement-Enhanced Label-Efficient Active Phenotyping for Electronic Health Records

Yang Yang, Kathryn Pollak, Bibhas Chakraborty, Molei Liu, Doudou Zhou, Chuan Hong

Main category: cs.LG

TL;DR: RELEAP is a reinforcement learning-based active learning framework that uses downstream prediction performance as feedback to guide phenotype correction and sample selection under constrained labeling budgets, outperforming traditional methods.

Details

Motivation: Electronic health record phenotyping often relies on noisy proxy labels that undermine reliability, and existing active learning methods use fixed heuristics without ensuring phenotype refinement improves prediction performance.

Method: RELEAP adaptively integrates multiple querying strategies and updates its policy based on feedback from downstream models, evaluated on Duke University Health System cohort for lung cancer risk prediction using logistic regression and penalized Cox survival models.

Result: RELEAP consistently outperformed all baselines, increasing logistic AUC from 0.774 to 0.805 and survival C-index from 0.718 to 0.752, with smoother and more stable gains than heuristic methods.

Conclusion: RELEAP optimizes phenotype correction through downstream feedback, offering a scalable, label-efficient paradigm that reduces manual chart review and enhances EHR-based risk prediction reliability.

Abstract: Objective: Electronic health record (EHR) phenotyping often relies on noisy proxy labels, which undermine the reliability of downstream risk prediction. Active learning can reduce annotation costs, but most rely on fixed heuristics and do not ensure that phenotype refinement improves prediction performance. Our goal was to develop a framework that directly uses downstream prediction performance as feedback to guide phenotype correction and sample selection under constrained labeling budgets. Materials and Methods: We propose Reinforcement-Enhanced Label-Efficient Active Phenotyping (RELEAP), a reinforcement learning-based active learning framework. RELEAP adaptively integrates multiple querying strategies and, unlike prior methods, updates its policy based on feedback from downstream models. We evaluated RELEAP on a de-identified Duke University Health System (DUHS) cohort (2014-2024) for incident lung cancer risk prediction, using logistic regression and penalized Cox survival models. Performance was benchmarked against noisy-label baselines and single-strategy active learning. Results: RELEAP consistently outperformed all baselines. Logistic AUC increased from 0.774 to 0.805 and survival C-index from 0.718 to 0.752. Using downstream performance as feedback, RELEAP produced smoother and more stable gains than heuristic methods under the same labeling budget. Discussion: By linking phenotype refinement to prediction outcomes, RELEAP learns which samples most improve downstream discrimination and calibration, offering a more principled alternative to fixed active learning rules. Conclusion: RELEAP optimizes phenotype correction through downstream feedback, offering a scalable, label-efficient paradigm that reduces manual chart review and enhances the reliability of EHR-based risk prediction.

[409] Partial Action Replacement: Tackling Distribution Shift in Offline MARL

Yue Jin, Giovanni Montana

Main category: cs.LG

TL;DR: SPaCQL addresses offline MARL’s OOD problem using partial action replacement (PAR) when behavior policies are factorized, with theoretical guarantees of linear distribution shift scaling and improved performance.

Details

Motivation: Offline MARL struggles with evaluating OOD joint actions, especially when behavior policies are factorized (agents act independently during data collection).

Method: Developed Soft-Partial Conservative Q-Learning (SPaCQL) using partial action replacement (PAR) to update only some agents’ actions while keeping others fixed, with dynamic weighting based on value uncertainty.

Result: Theoretical analysis shows distribution shift scales linearly with deviating agents rather than exponentially with joint-action space. Empirical results show superior performance over baselines when offline data exhibits independence structure.

Conclusion: SPaCQL effectively mitigates OOD issues in offline MARL with factorized behavior policies through PAR and uncertainty-aware weighting, providing both theoretical guarantees and empirical improvements.

Abstract: Offline multi-agent reinforcement learning (MARL) is severely hampered by the challenge of evaluating out-of-distribution (OOD) joint actions. Our core finding is that when the behavior policy is factorized - a common scenario where agents act fully or partially independently during data collection - a strategy of partial action replacement (PAR) can significantly mitigate this challenge. PAR updates a single or part of agents’ actions while the others remain fixed to the behavioral data, reducing distribution shift compared to full joint-action updates. Based on this insight, we develop Soft-Partial Conservative Q-Learning (SPaCQL), using PAR to mitigate OOD issue and dynamically weighting different PAR strategies based on the uncertainty of value estimation. We provide a rigorous theoretical foundation for this approach, proving that under factorized behavior policies, the induced distribution shift scales linearly with the number of deviating agents rather than exponentially with the joint-action space. This yields a provably tighter value error bound for this important class of offline MARL problems. Our theoretical results also indicate that SPaCQL adaptively addresses distribution shift using uncertainty-informed weights. Our empirical results demonstrate SPaCQL enables more effective policy learning, and manifest its remarkable superiority over baseline algorithms when the offline dataset exhibits the independence structure.

[410] On the Role of Calibration in Benchmarking Algorithmic Fairness for Skin Cancer Detection

Brandon Dominique, Prudence Lam, Nicholas Kurtansky, Jochen Weber, Kivanc Kose, Veronica Rotemberg, Jennifer Dy

Main category: cs.LG

TL;DR: AI models for melanoma detection show performance disparities across demographic groups. This paper introduces calibration as a complementary metric to AUROC-based fairness metrics to better assess subgroup biases.

Details

Motivation: Clinical adoption of AI models for melanoma detection is hindered by performance disparities across demographic subgroups. Existing benchmarking focuses on AUROC-based fairness metrics, which don't provide insights into model calibration and accurate probability estimation.

Method: Assessed leading skin cancer detection algorithms from ISIC 2020 Challenge on ISIC 2020 and PROVE-AI datasets, focusing on subgroups defined by sex, race (Fitzpatrick Skin Tone), and age. Used calibration as complementary metric to AUROC-based fairness metrics.

Result: Existing models enhance discriminative accuracy but often over-diagnose risk and exhibit calibration issues when applied to new datasets, revealing subgroup biases.

Conclusion: Comprehensive model auditing strategies and extensive metadata collection are necessary to achieve equitable AI-driven healthcare solutions. Calibration should be used alongside AUROC-based fairness metrics for better subgroup bias assessment.

Abstract: Artificial Intelligence (AI) models have demonstrated expert-level performance in melanoma detection, yet their clinical adoption is hindered by performance disparities across demographic subgroups such as gender, race, and age. Previous efforts to benchmark the performance of AI models have primarily focused on assessing model performance using group fairness metrics that rely on the Area Under the Receiver Operating Characteristic curve (AUROC), which does not provide insights into a model’s ability to provide accurate estimates. In line with clinical assessments, this paper addresses this gap by incorporating calibration as a complementary benchmarking metric to AUROC-based fairness metrics. Calibration evaluates the alignment between predicted probabilities and observed event rates, offering deeper insights into subgroup biases. We assess the performance of the leading skin cancer detection algorithm of the ISIC 2020 Challenge on the ISIC 2020 Challenge dataset and the PROVE-AI dataset, and compare it with the second and third place models, focusing on subgroups defined by sex, race (Fitzpatrick Skin Tone), and age. Our findings reveal that while existing models enhance discriminative accuracy, they often over-diagnose risk and exhibit calibration issues when applied to new datasets. This study underscores the necessity for comprehensive model auditing strategies and extensive metadata collection to achieve equitable AI-driven healthcare solutions. All code is publicly available at https://github.com/bdominique/testing_strong_calibration.

[411] Comparing Reconstruction Attacks on Pretrained Versus Full Fine-tuned Large Language Model Embeddings on Homo Sapiens Splice Sites Genomic Data

Reem Al-Saidi, Erman Ayday, Ziad Kobti

Main category: cs.LG

TL;DR: This study examines how fine-tuning affects embedding reconstruction attacks on genomic data in LLMs, finding that fine-tuning actually strengthens privacy protections against these attacks.

Details

Motivation: To investigate whether task-specific fine-tuning strengthens or weakens privacy protections in LLMs processing genomic sequences, building on prior work showing embedding reconstruction attacks can leak sensitive information.

Method: Applied reconstruction attack pipeline to pretrained and fine-tuned embeddings using HS3D genomic dataset, implemented specialized DNA tokenization, and conducted comparative analysis of position-specific, nucleotide-type, and privacy changes.

Result: Fine-tuning significantly improved resistance to reconstruction attacks across multiple architectures: XLNet (+19.8%), GPT-2 (+9.8%), and BERT (+7.8%), indicating task-specific optimization enhances privacy.

Conclusion: Fine-tuning serves as a potential privacy-enhancing technique for LLMs processing sensitive genomic data, highlighting the need for advanced protective mechanisms while suggesting task adaptation can reduce reconstruction vulnerability.

Abstract: This study investigates embedding reconstruction attacks in large language models (LLMs) applied to genomic sequences, with a specific focus on how fine-tuning affects vulnerability to these attacks. Building upon Pan et al.’s seminal work demonstrating that embeddings from pretrained language models can leak sensitive information, we conduct a comprehensive analysis using the HS3D genomic dataset to determine whether task-specific optimization strengthens or weakens privacy protections. Our research extends Pan et al.’s work in three significant dimensions. First, we apply their reconstruction attack pipeline to pretrained and fine-tuned model embeddings, addressing a critical gap in their methodology that did not specify embedding types. Second, we implement specialized tokenization mechanisms tailored specifically for DNA sequences, enhancing the model’s ability to process genomic data, as these models are pretrained on natural language and not DNA. Third, we perform a detailed comparative analysis examining position-specific, nucleotide-type, and privacy changes between pretrained and fine-tuned embeddings. We assess embeddings vulnerabilities across different types and dimensions, providing deeper insights into how task adaptation shifts privacy risks throughout genomic sequences. Our findings show a clear distinction in reconstruction vulnerability between pretrained and fine-tuned embeddings. Notably, fine-tuning strengthens resistance to reconstruction attacks in multiple architectures – XLNet (+19.8%), GPT-2 (+9.8%), and BERT (+7.8%) – pointing to task-specific optimization as a potential privacy enhancement mechanism. These results highlight the need for advanced protective mechanisms for language models processing sensitive genomic data, while highlighting fine-tuning as a potential privacy-enhancing technique worth further exploration.

[412] Alignment-Constrained Dynamic Pruning for LLMs: Identifying and Preserving Alignment-Critical Circuits

Dev Patel, Gabrielle Gervacio, Diekola Raimi, Kevin Zhu, Ryan Lagasse, Gabriel Grand, Ashwinee Panda, Maheep Chaudhary

Main category: cs.LG

TL;DR: AAPP is a dynamic pruning method that preserves alignment-relevant circuits during LLM inference, improving refusal rates by 50% at matched compute while maintaining efficiency.

Details

Motivation: Dynamic pruning methods for LLMs improve computational efficiency but exacerbate alignment degradation by not preserving safety-critical circuits across diverse inputs, creating deployment challenges.

Method: Alignment-Aware Probe Pruning (AAPP) - a dynamic structured pruning method that adaptively preserves alignment-relevant circuits during inference, building upon Probe Pruning.

Result: Experiments on LLaMA 2-7B, Qwen2.5-14B-Instruct, and Gemma-3-12B-IT show AAPP improves refusal rates by 50% at matched compute.

Conclusion: AAPP enables efficient yet safety-preserving LLM deployment by addressing alignment vulnerabilities while maintaining computational efficiency.

Abstract: Large Language Models require substantial computational resources for inference, posing deployment challenges. While dynamic pruning offers superior efficiency over static methods through adaptive circuit selection, it exacerbates alignment degradation by retaining only input-dependent safety-critical circuit preservation across diverse inputs. As a result, addressing these heightened alignment vulnerabilities remains critical. We introduce Alignment-Aware Probe Pruning (AAPP), a dynamic structured pruning method that adaptively preserves alignment-relevant circuits during inference, building upon Probe Pruning. Experiments on LLaMA 2-7B, Qwen2.5-14B-Instruct, and Gemma-3-12B-IT show AAPP improves refusal rates by 50% at matched compute, enabling efficient yet safety-preserving LLM deployment.

[413] Counterfactual Forecasting of Human Behavior using Generative AI and Causal Graphs

Dharmateja Priyadarshi Uddandarao, Ravi Kiran Vadlamani

Main category: cs.LG

TL;DR: A novel framework combining structural causal models with transformer-based generative AI for counterfactual user behavior forecasting, outperforming traditional methods.

Details

Motivation: To enable product teams to simulate and assess potential interventions before deployment by modeling fictitious situations and generating realistic behavioral trajectories under counterfactual conditions.

Method: Creates causal graphs mapping connections between user interactions, adoption metrics, and product features, then uses generative models conditioned on causal variables to generate behavioral trajectories.

Result: Outperforms conventional forecasting and uplift modeling techniques when tested on datasets from web interactions, mobile applications, and e-commerce.

Conclusion: The framework provides improved interpretability through causal path visualization, allowing effective simulation and assessment of potential interventions prior to deployment.

Abstract: This study presents a novel framework for counterfactual user behavior forecasting that combines structural causal models with transformer-based generative artificial intelligence. To model fictitious situations, the method creates causal graphs that map the connections between user interactions, adoption metrics, and product features. The framework generates realistic behavioral trajectories under counterfactual conditions by using generative models that are conditioned on causal variables. Tested on datasets from web interactions, mobile applications, and e-commerce, the methodology outperforms conventional forecasting and uplift modeling techniques. Product teams can effectively simulate and assess possible interventions prior to deployment thanks to the framework improved interpretability through causal path visualization.

[414] When Are Learning Biases Equivalent? A Unifying Framework for Fairness, Robustness, and Distribution Shift

Sushant Mehta

Main category: cs.LG

TL;DR: The paper presents a unifying theoretical framework that shows different machine learning bias mechanisms (unfairness, spurious correlations, poor minority performance) produce equivalent effects on model performance when formalized through information-theoretic measures.

Details

Motivation: Machine learning systems exhibit diverse failure modes studied in isolation by different research communities, creating a need for a unifying framework to understand bias mechanisms across fairness, robustness, and distribution shifts.

Method: Formalizes biases as violations of conditional independence using information-theoretic measures, proves equivalence conditions relating spurious correlations, subpopulation shift, class imbalance, and fairness violations.

Result: The theory predicts that spurious correlation strength α produces equivalent worst-group accuracy degradation as sub-population imbalance ratio r ≈ (1+α)/(1-α). Empirical validation in six datasets and three architectures confirms equivalences hold within 3% accuracy.

Conclusion: This work bridges literature on fairness, robustness, and distribution shifts under a common perspective, enabling principled transfer of debiasing methods across problem domains.

Abstract: Machine learning systems exhibit diverse failure modes: unfairness toward protected groups, brittleness to spurious correlations, poor performance on minority sub-populations, which are typically studied in isolation by distinct research communities. We propose a unifying theoretical framework that characterizes when different bias mechanisms produce quantitatively equivalent effects on model performance. By formalizing biases as violations of conditional independence through information-theoretic measures, we prove formal equivalence conditions relating spurious correlations, subpopulation shift, class imbalance, and fairness violations. Our theory predicts that a spurious correlation of strength $α$ produces equivalent worst-group accuracy degradation as a sub-population imbalance ratio $r \approx (1+α)/(1-α)$ under feature overlap assumptions. Empirical validation in six datasets and three architectures confirms that predicted equivalences hold within the accuracy of the worst group 3%, enabling the principled transfer of debiasing methods across problem domains. This work bridges the literature on fairness, robustness, and distribution shifts under a common perspective.

[415] Provably Efficient Sample Complexity for Robust CMDP

Sourav Ganguly, Arnob Ghosh

Main category: cs.LG

TL;DR: This paper addresses robust constrained Markov decision processes (RCMDPs) where policies must maximize reward while ensuring cumulative utility exceeds a threshold under worst-case dynamics within uncertainty sets. It introduces a novel Robust constrained Value iteration (RCVI) algorithm with the first sample complexity guarantee for RCMDPs.

Details

Motivation: Recent works have established finite-time iteration complexity guarantees for RCMDPs, but sample complexity guarantees remain largely unexplored. The authors aim to address this gap and develop practical algorithms with provable sample efficiency for robust constrained reinforcement learning.

Method: The paper first shows that Markovian policies may fail to be optimal under rectangular uncertainty sets. To address this, they introduce an augmented state space incorporating the remaining utility budget. Building on this formulation, they propose a novel Robust constrained Value iteration (RCVI) algorithm using a generative model.

Result: The proposed RCVI algorithm achieves a sample complexity of Õ(|S||A|H⁵/ε²) with at most ε violation, where |S| and |A| are state and action space sizes, and H is episode length. This represents the first sample complexity guarantee for RCMDPs. Empirical results validate the approach’s effectiveness.

Conclusion: The paper successfully addresses the sample complexity gap in RCMDPs by introducing an augmented state space formulation and proposing the RCVI algorithm with provable sample efficiency. This provides the first theoretical foundation for sample-efficient learning in robust constrained reinforcement learning settings.

Abstract: We study the problem of learning policies that maximize cumulative reward while satisfying safety constraints, even when the real environment differs from a simulator or nominal model. We focus on robust constrained Markov decision processes (RCMDPs), where the agent must maximize reward while ensuring cumulative utility exceeds a threshold under the worst-case dynamics within an uncertainty set. While recent works have established finite-time iteration complexity guarantees for RCMDPs using policy optimization, their sample complexity guarantees remain largely unexplored. In this paper, we first show that Markovian policies may fail to be optimal even under rectangular uncertainty sets unlike the {\em unconstrained} robust MDP. To address this, we introduce an augmented state space that incorporates the remaining utility budget into the state representation. Building on this formulation, we propose a novel Robust constrained Value iteration (RCVI) algorithm with a sample complexity of $\mathcal{\tilde{O}}(|S||A|H^5/ε^2)$ achieving at most $ε$ violation using a generative model where $|S|$ and $|A|$ denote the sizes of the state and action spaces, respectively, and $H$ is the episode length. To the best of our knowledge, this is the {\em first sample complexity guarantee} for RCMDP. Empirical results further validate the effectiveness of our approach.

[416] BIPPO: Budget-Aware Independent PPO for Energy-Efficient Federated Learning Services

Anna Lackinger, Andrea Morichetta, Pantelis A. Frangoudis, Schahram Dustdar

Main category: cs.LG

TL;DR: BIPPO is an energy-efficient multi-agent RL solution for client selection in IoT-FL systems that improves performance while consuming minimal budget.

Details

Motivation: FL doesn't consider infrastructure efficiency in resource-constrained IoT environments, and existing RL solutions ignore practical challenges like resource limitations, device churn, generalizability, and energy efficiency.

Method: Proposed BIPPO (Budget-aware Independent Proximal Policy Optimization), an energy-efficient multi-agent RL solution with improved sampler for client selection in FL.

Result: BIPPO increases mean accuracy compared to non-RL mechanisms, traditional PPO, and IPPO on image classification tasks with non-IID data, while consuming negligible budget that remains consistent with increasing clients.

Conclusion: BIPPO provides a performant, stable, scalable, and sustainable solution for client selection in IoT-FL systems.

Abstract: Federated Learning (FL) is a promising machine learning solution in large-scale IoT systems, guaranteeing load distribution and privacy. However, FL does not natively consider infrastructure efficiency, a critical concern for systems operating in resource-constrained environments. Several Reinforcement Learning (RL) based solutions offer improved client selection for FL; however, they do not consider infrastructure challenges, such as resource limitations and device churn. Furthermore, the training of RL methods is often not designed for practical application, as these approaches frequently do not consider generalizability and are not optimized for energy efficiency. To fill this gap, we propose BIPPO (Budget-aware Independent Proximal Policy Optimization), which is an energy-efficient multi-agent RL solution that improves performance. We evaluate BIPPO on two image classification tasks run in a highly budget-constrained setting, with FL clients training on non-IID data, a challenging context for vanilla FL. The improved sampler of BIPPO enables it to increase the mean accuracy compared to non-RL mechanisms, traditional PPO, and IPPO. In addition, BIPPO only consumes a negligible proportion of the budget, which stays consistent even if the number of clients increases. Overall, BIPPO delivers a performant, stable, scalable, and sustainable solution for client selection in IoT-FL.

[417] Methodological Precedence in Health Tech: Why ML/Big Data Analysis Must Follow Basic Epidemiological Consistency. A Case Study

Marco Roccetti

Main category: cs.LG

TL;DR: Advanced ML and big data analyses amplify methodological flaws rather than correct them; basic epidemiological validation must precede sophisticated modeling to avoid misleading results.

Details

Motivation: To demonstrate that complex analytical methods (ML/Big Data) cannot overcome fundamental methodological flaws in study design, using a vaccine outcomes study as a case example.

Method: Applied standard descriptive statistics and national epidemiological benchmarks to re-analyze a published cohort study, identifying statistical paradoxes and selection bias.

Result: Exposed irreconcilable paradoxes and invalidated reported hazard ratios, showing observed effects were mathematical artifacts from uncorrected selection bias in cohort construction.

Conclusion: Basic epidemiological consistency must be verified before advanced ML/statistical modeling; robust methods like Propensity Score Matching are essential for valid causal inference from administrative data.

Abstract: The integration of advanced analytical tools, including Machine Learning (ML) and massive data processing, has revolutionized health research, promising unprecedented accuracy in diagnosis and risk prediction. However, the rigor of these complex methods is fundamentally dependent on the quality and integrity of the underlying datasets and the validity of their statistical design. We propose an emblematic case where advanced analysis (ML/Big Data) must necessarily be subsequent to the verification of basic methodological coherence. This study highlights a crucial cautionary principle: sophisticated analyses amplify, rather than correct, severe methodological flaws rooted in basic design choices, leading to misleading or contradictory findings. By applying simple, standard descriptive statistical methods and established national epidemiological benchmarks to a recently published cohort study on vaccine outcomes and psychiatric events, we expose multiple, statistically irreconcilable paradoxes. These paradoxes, including an implausible risk reduction for a chronic disorder in a high-risk group and contradictory incidence rate comparisons, definitively invalidate the reported hazard ratios (HRs). We demonstrate that the observed effects are mathematical artifacts stemming from an uncorrected selection bias in the cohort construction. This analysis serves as a robust reminder that even the most complex health studies must first pass the test of basic epidemiological consistency before any conclusion drawn from subsequent advanced ML or statistical modeling can be considered valid or publishable. We conclude that robust methods, such as Propensity Score Matching, are essential for achieving valid causal inference from administrative data in the absence of randomization

[418] N-ReLU: Zero-Mean Stochastic Extension of ReLU

Md Motaleb Hossen Manik, Md Zabirul Islam, Ge Wang

Main category: cs.LG

TL;DR: N-ReLU is a stochastic activation function that replaces ReLU’s negative values with Gaussian noise, maintaining expected output while preventing dead neurons and improving optimization robustness.

Details

Motivation: To address the problem of dead neurons in standard ReLU activation functions caused by its hard zero cutoff, which can hinder neural network training and performance.

Method: Proposed N-ReLU (Noise-ReLU), which replaces negative activations with zero-mean Gaussian noise while preserving the same expected output as ReLU. This acts as an annealing-style regularizer during training.

Result: Experiments on MNIST with MLP and CNN architectures show N-ReLU achieves comparable or slightly better accuracy than ReLU, LeakyReLU, PReLU, GELU, and RReLU at moderate noise levels (sigma = 0.05-0.10), with stable convergence and no dead neurons observed.

Conclusion: Lightweight Gaussian noise injection provides a simple yet effective mechanism to enhance optimization robustness without modifying network structures or introducing additional parameters.

Abstract: Activation functions are fundamental for enabling nonlinear representations in deep neural networks. However, the standard rectified linear unit (ReLU) often suffers from inactive or “dead” neurons caused by its hard zero cutoff. To address this issue, we introduce N-ReLU (Noise-ReLU), a zero-mean stochastic extension of ReLU that replaces negative activations with Gaussian noise while preserving the same expected output. This expectation-aligned formulation maintains gradient flow in inactive regions and acts as an annealing-style regularizer during training. Experiments on the MNIST dataset using both multilayer perceptron (MLP) and convolutional neural network (CNN) architectures show that N-ReLU achieves accuracy comparable to or slightly exceeding that of ReLU, LeakyReLU, PReLU, GELU, and RReLU at moderate noise levels (sigma = 0.05-0.10), with stable convergence and no dead neurons observed. These results demonstrate that lightweight Gaussian noise injection offers a simple yet effective mechanism to enhance optimization robustness without modifying network structures or introducing additional parameters.

[419] SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs

Sean P. Fillingham, Andrew Gordon, Peter Lai, Xavier Poncini, David Quarel, Stefan Heimersheim

Main category: cs.LG

TL;DR: SCALAR benchmark measures interaction sparsity between SAE features, showing Staircase SAEs improve sparsity by ~60% over TopK SAEs while maintaining interpretability.

Details

Motivation: Current SAE evaluations focus on individual performance but ignore interaction sparsity, leading to inflated circuits where upstream features unnecessarily affect multiple downstream features.

Method: Proposed SCALAR benchmark for measuring interaction sparsity and introduced Staircase SAEs using weight-sharing to limit upstream feature duplication across downstream features.

Result: Staircase SAEs improved relative sparsity by 59.67% (feedforward) and 63.15% (transformer blocks) over TopK SAEs, while JSAEs showed limited improvements and couldn’t train across transformer blocks.

Conclusion: Interaction sparsity is crucial for SAEs, and Staircase SAEs provide significant improvements while maintaining feature interpretability across different model architectures.

Abstract: Mechanistic interpretability aims to decompose neural networks into interpretable features and map their connecting circuits. The standard approach trains sparse autoencoders (SAEs) on each layer’s activations. However, SAEs trained in isolation don’t encourage sparse cross-layer connections, inflating extracted circuits where upstream features needlessly affect multiple downstream features. Current evaluations focus on individual SAE performance, leaving interaction sparsity unexamined. We introduce SCALAR (Sparse Connectivity Assessment of Latent Activation Relationships), a benchmark measuring interaction sparsity between SAE features. We also propose “Staircase SAEs”, using weight-sharing to limit upstream feature duplication across downstream features. Using SCALAR, we compare TopK SAEs, Jacobian SAEs (JSAEs), and Staircase SAEs. Staircase SAEs improve relative sparsity over TopK SAEs by $59.67% \pm 1.83%$ (feedforward) and $63.15% \pm 1.35%$ (transformer blocks). JSAEs provide $8.54% \pm 0.38%$ improvement over TopK for feedforward layers but cannot train effectively across transformer blocks, unlike Staircase and TopK SAEs which work anywhere in the residual stream. We validate on a $216$K-parameter toy model and GPT-$2$ Small ($124$M), where Staircase SAEs maintain interaction sparsity improvements while preserving feature interpretability. Our work highlights the importance of interaction sparsity in SAEs through benchmarking and comparing promising architectures.

[420] LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows

Raffi Khatchadourian, Rolando Franco

Main category: cs.LG

TL;DR: Smaller LLMs (7B-8B parameters) achieve 100% output consistency for financial tasks, while larger models like GPT-OSS-120B show only 12.5% consistency, challenging the assumption that bigger models are better for regulated financial deployments.

Details

Motivation: Financial institutions need deterministic LLM outputs for auditability and trust in regulated tasks like reconciliations and regulatory reporting, but output drift undermines these requirements.

Method: Developed a finance-calibrated deterministic test harness with greedy decoding, fixed seeds, and SEC structure-aware retrieval. Used task-specific invariant checking with materiality thresholds and SEC citation validation across five model architectures.

Result: Smaller models (7B-8B) achieved 100% output consistency at T=0.0, while GPT-OSS-120B showed only 12.5% consistency. Structured tasks (SQL) remained stable even at T=0.2, while RAG tasks showed 25-75% drift.

Conclusion: Smaller models provide superior output consistency for regulated financial deployments, enabling compliance-ready AI systems that meet FSB, BIS, and CFTC requirements.

Abstract: Financial institutions deploy Large Language Models (LLMs) for reconciliations, regulatory reporting, and client communications, but nondeterministic outputs (output drift) undermine auditability and trust. We quantify drift across five model architectures (7B-120B parameters) on regulated financial tasks, revealing a stark inverse relationship: smaller models (Granite-3-8B, Qwen2.5-7B) achieve 100% output consistency at T=0.0, while GPT-OSS-120B exhibits only 12.5% consistency (95% CI: 3.5-36.0%) regardless of configuration (p<0.0001, Fisher’s exact test). This finding challenges conventional assumptions that larger models are universally superior for production deployment. Our contributions include: (i) a finance-calibrated deterministic test harness combining greedy decoding (T=0.0), fixed seeds, and SEC 10-K structure-aware retrieval ordering; (ii) task-specific invariant checking for RAG, JSON, and SQL outputs using finance-calibrated materiality thresholds (plus or minus 5%) and SEC citation validation; (iii) a three-tier model classification system enabling risk-appropriate deployment decisions; and (iv) an audit-ready attestation system with dual-provider validation. We evaluated five models (Qwen2.5-7B via Ollama, Granite-3-8B via IBM watsonx.ai, Llama-3.3-70B, Mistral-Medium-2505, and GPT-OSS-120B) across three regulated financial tasks. Across 480 runs (n=16 per condition), structured tasks (SQL) remain stable even at T=0.2, while RAG tasks show drift (25-75%), revealing task-dependent sensitivity. Cross-provider validation confirms deterministic behavior transfers between local and cloud deployments. We map our framework to Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC) requirements, demonstrating practical pathways for compliance-ready AI deployments.

[421] One Router to Route Them All: Homogeneous Expert Routing for Heterogeneous Graph Transformers

Georgiy Shakirov, Albert Arakelov

Main category: cs.LG

TL;DR: HER integrates Mixture-of-Experts into Heterogeneous Graph Transformers with type-agnostic routing, outperforming standard HGT and type-separated baselines by encouraging semantic specialization rather than type dependence.

Details

Motivation: Traditional HGNNs over-rely on node/edge type labels, which can limit cross-type knowledge transfer and cause surface-level overfitting. The paper explores whether type-specific experts are necessary in MoE for heterogeneous graphs.

Method: Proposes Homogeneous Expert Routing (HER) - an MoE layer for HGT that stochastically masks type embeddings during routing to encourage type-agnostic expert specialization, allowing experts to focus on semantic patterns rather than node types.

Result: HER consistently outperforms standard HGT and type-separated MoE baselines on IMDB, ACM, and DBLP datasets for link prediction. Analysis shows HER experts specialize by semantic patterns (e.g., movie genres) rather than node types.

Conclusion: Regularizing type dependence in expert routing yields more generalizable, efficient, and interpretable representations, establishing a new design principle for heterogeneous graph learning that prioritizes semantic patterns over surface-level type labels.

Abstract: A common practice in heterogeneous graph neural networks (HGNNs) is to condition parameters on node/edge types, assuming types reflect semantic roles. However, this can cause overreliance on surface-level labels and impede cross-type knowledge transfer. We explore integrating Mixture-of-Experts (MoE) into HGNNs–a direction underexplored despite MoE’s success in homogeneous settings. Crucially, we question the need for type-specific experts. We propose Homogeneous Expert Routing (HER), an MoE layer for Heterogeneous Graph Transformers (HGT) that stochastically masks type embeddings during routing to encourage type-agnostic specialization. Evaluated on IMDB, ACM, and DBLP for link prediction, HER consistently outperforms standard HGT and a type-separated MoE baseline. Analysis on IMDB shows HER experts specialize by semantic patterns (e.g., movie genres) rather than node types, confirming routing is driven by latent semantics. Our work demonstrates that regularizing type dependence in expert routing yields more generalizable, efficient, and interpretable representations–a new design principle for heterogeneous graph learning.

[422] FlowTIE: Flow-based Transport of Intensity Equation for Phase Gradient Estimation from 4D-STEM Data

Arya Bangun, Maximilian Töllner, Xuan Zhao, Christian Kübel, Hanno Scharr

Main category: cs.LG

TL;DR: FlowTIE is a neural network framework that combines the Transport of Intensity Equation with flow-based phase gradient representation for improved phase reconstruction from 4D-STEM data, particularly for thick specimens under dynamical scattering conditions.

Details

Motivation: To improve phase reconstruction accuracy and robustness for thick specimens under dynamical scattering conditions in 4D-STEM imaging, where classical methods face limitations.

Method: Integrates the Transport of Intensity Equation (TIE) with a flow-based representation of phase gradient using neural networks, combining data-driven learning with physics-based priors, and can be integrated with multislice method for thick specimens.

Result: FlowTIE demonstrates improved phase reconstruction accuracy compared to classical TIE and gradient-based optimization methods, operates faster, and successfully handles thick specimen conditions.

Conclusion: The framework effectively bridges data-driven learning with physics-based priors, providing a robust and accurate solution for phase reconstruction in 4D-STEM, particularly beneficial for thick specimens under dynamical scattering.

Abstract: We introduce FlowTIE, a neural-network-based framework for phase reconstruction from 4D-Scanning Transmission Electron Microscopy (STEM) data, which integrates the Transport of Intensity Equation (TIE) with a flow-based representation of the phase gradient. This formulation allows the model to bridge data-driven learning with physics-based priors, improving robustness under dynamical scattering conditions for thick specimen. The validation on simulated datasets of crystalline materials, benchmarking to classical TIE and gradient-based optimization methods are presented. The results demonstrate that FlowTIE improves phase reconstruction accuracy, fast, and can be integrated with a thick specimen model, namely multislice method.

[423] Private-RAG: Answering Multiple Queries with LLMs while Keeping Your Data Private

Ruihan Wu, Erchi Wang, Zhiyuan Zhang, Yu-Xiang Wang

Main category: cs.LG

TL;DR: Proposes two differentially private RAG algorithms (MURAG and MURAG-ADA) for multi-query settings that protect sensitive documents while maintaining utility across hundreds of queries.

Details

Motivation: Standard RAG systems risk leaking private information when external corpus contains sensitive data, and prior DP-RAG work only addressed single-query settings which are impractical for real usage.

Method: MURAG uses individual privacy filters to bound privacy loss based on document retrieval frequency rather than total queries. MURAG-ADA enhances utility by privately releasing query-specific thresholds for more precise document selection.

Result: Experiments show both methods scale to hundreds of queries within practical DP budget (ε≈10) while preserving meaningful utility across multiple LLMs and datasets.

Conclusion: The proposed multi-query DP-RAG algorithms provide practical privacy protection for realistic RAG usage scenarios without sacrificing utility.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving documents from an external corpus at inference time. When this corpus contains sensitive information, however, unprotected RAG systems are at risk of leaking private information. Prior work has introduced differential privacy (DP) guarantees for RAG, but only in single-query settings, which fall short of realistic usage. In this paper, we study the more practical multi-query setting and propose two DP-RAG algorithms. The first, MURAG, leverages an individual privacy filter so that the accumulated privacy loss only depends on how frequently each document is retrieved rather than the total number of queries. The second, MURAG-ADA, further improves utility by privately releasing query-specific thresholds, enabling more precise selection of relevant documents. Our experiments across multiple LLMs and datasets demonstrate that the proposed methods scale to hundreds of queries within a practical DP budget ($\varepsilon\approx10$), while preserving meaningful utility.

[424] Adaptive Graph Learning with Transformer for Multi-Reservoir Inflow Prediction

Pengfei Hu, Ming Fan, Xiaoxue Han, Chang Lu, Wei Zhang, Hyun Kang, Yue Ning, Dan Lu

Main category: cs.LG

TL;DR: AdaTrip is an adaptive graph learning framework for multi-reservoir inflow forecasting that dynamically captures spatial-temporal dependencies among interconnected reservoirs, outperforming existing methods and providing interpretable insights.

Details

Motivation: Existing reservoir inflow prediction methods focus on single-reservoir models and ignore spatial dependencies between interconnected reservoirs, limiting their effectiveness for comprehensive water resource management.

Method: AdaTrip constructs dynamic graphs with reservoirs as nodes and directed edges representing hydrological connections, using attention mechanisms to automatically identify crucial spatial and temporal dependencies in multi-reservoir systems.

Result: Evaluation on thirty reservoirs in the Upper Colorado River Basin shows AdaTrip outperforms existing baselines, with particular improvements for reservoirs with limited records through parameter sharing, and provides interpretable attention maps.

Conclusion: AdaTrip offers an effective framework for multi-reservoir inflow forecasting that captures complex spatial-temporal dependencies while providing interpretable insights to support operational water management decisions.

Abstract: Reservoir inflow prediction is crucial for water resource management, yet existing approaches mainly focus on single-reservoir models that ignore spatial dependencies among interconnected reservoirs. We introduce AdaTrip as an adaptive, time-varying graph learning framework for multi-reservoir inflow forecasting. AdaTrip constructs dynamic graphs where reservoirs are nodes with directed edges reflecting hydrological connections, employing attention mechanisms to automatically identify crucial spatial and temporal dependencies. Evaluation on thirty reservoirs in the Upper Colorado River Basin demonstrates superiority over existing baselines, with improved performance for reservoirs with limited records through parameter sharing. Additionally, AdaTrip provides interpretable attention maps at edge and time-step levels, offering insights into hydrological controls to support operational decision-making. Our code is available at https://github.com/humphreyhuu/AdaTrip.

[425] Enhancing Binary Encoded Crime Linkage Analysis Using Siamese Network

Yicheng Zhan, Fahim Ahmed, Amy Burrell, Matthew J. Tonkin, Sarah Galambos, Jessica Woodhams, Dalal Alrajeh

Main category: cs.LG

TL;DR: A Siamese Autoencoder framework improves crime linkage analysis by learning latent representations from complex crime data, achieving up to 9% AUC improvement over traditional methods.

Details

Motivation: Traditional crime linkage methods struggle with high-dimensional, sparse, and heterogeneous data, limiting their effectiveness in identifying serial offenders.

Method: Proposed a Siamese Autoencoder that learns latent representations and integrates geographic-temporal features at the decoder stage to amplify behavioral representations in sparse data.

Result: The framework achieved consistent improvements across multiple metrics, with up to 9% AUC improvement over traditional methods, while providing interpretable insights for investigators.

Conclusion: Advanced machine learning approaches like Siamese Autoencoders can substantially enhance crime linkage accuracy and support investigative decision-making in complex crime data.

Abstract: Effective crime linkage analysis is crucial for identifying serial offenders and enhancing public safety. To address limitations of traditional crime linkage methods in handling high-dimensional, sparse, and heterogeneous data, we propose a Siamese Autoencoder framework that learns meaningful latent representations and uncovers correlations in complex crime data. Using data from the Violent Crime Linkage Analysis System (ViCLAS), maintained by the Serious Crime Analysis Section of the UK’s National Crime Agency, our approach mitigates signal dilution in sparse feature spaces by integrating geographic-temporal features at the decoder stage. This design amplifies behavioral representations rather than allowing them to be overshadowed at the input level, yielding consistent improvements across multiple evaluation metrics. We further analyze how different domain-informed data reduction strategies influence model performance, providing practical guidance for preprocessing in crime linkage contexts. Our results show that advanced machine learning approaches can substantially enhance linkage accuracy, improving AUC by up to 9% over traditional methods while offering interpretable insights to support investigative decision-making.

[426] CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping

Veera V S Bhargav Nunna, Shinae Kang, Zheyuan Zhou, Virginia Wang, Sucharitha Boinapally, Michael Foley

Main category: cs.LG

TL;DR: Character-Level Autoencoder (CAE) approach for identifying semantically identical columns in non-semantic relational datasets, achieving 80.95% accuracy in top 5 column matching tasks.

Details

Motivation: Enterprise relational databases contain vast amounts of non-semantic data (IP addresses, product IDs, encoded keys, timestamps) that challenge traditional semantic analysis and NLP approaches.

Method: Character-level autoencoder with fixed dictionary constraints that encodes text representations of non-semantic columns and extracts high-dimensional feature embeddings for data grouping.

Result: Achieved 80.95% accuracy in top 5 column matching tasks, substantially outperforming traditional NLP approaches like Bag of Words (47.62%).

Conclusion: The CAE approach bridges theoretical character-level neural architectures with practical enterprise data management, providing automated schema understanding and data profiling for non-semantic industrial datasets at scale.

Abstract: Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for data grouping. By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments. Experimental evaluation demonstrates substantial performance gains: our CAE approach achieved 80.95% accuracy in top 5 column matching tasks across relational datasets, substantially outperforming traditional NLP approaches such as Bag of Words (47.62%). These results demonstrate its effectiveness for identifying and clustering identical columns in relational datasets. This work bridges the gap between theoretical advances in character-level neural architectures and practical enterprise data management challenges, providing an automated solution for schema understanding and data profiling of non-semantic industrial datasets at scale.

[427] ZeroSim: Zero-Shot Analog Circuit Evaluation with Unified Transformer Embeddings

Xiaomeng Yang, Jian Gao, Yanzhi Wang, Xuan Zhang

Main category: cs.LG

TL;DR: ZeroSim is a transformer-based framework for analog circuit performance modeling that achieves both in-distribution generalization across trained topologies and zero-shot generalization to unseen topologies without fine-tuning.

Details

Motivation: Efficient performance evaluation remains a major bottleneck in analog circuit design automation. Traditional SPICE simulations are time-consuming, while existing ML methods require topology-specific retraining or manual substructure segmentation, limiting scalability and adaptability.

Method: Three key strategies: (1) diverse training corpus of 3.6M instances covering 60+ amplifier topologies, (2) unified topology embeddings using global-aware tokens and hierarchical attention, (3) topology-conditioned parameter mapping for consistent structural representations.

Result: ZeroSim significantly outperforms baseline models (MLPs, GNNs, transformers) in zero-shot predictions across different amplifier topologies. When integrated into RL-based parameter optimization, it achieves 13x speedup compared to SPICE simulations.

Conclusion: ZeroSim demonstrates practical value for analog circuit design automation by enabling efficient and accurate performance evaluation with robust generalization capabilities.

Abstract: Although recent advancements in learning-based analog circuit design automation have tackled tasks such as topology generation, device sizing, and layout synthesis, efficient performance evaluation remains a major bottleneck. Traditional SPICE simulations are time-consuming, while existing machine learning methods often require topology-specific retraining or manual substructure segmentation for fine-tuning, hindering scalability and adaptability. In this work, we propose ZeroSim, a transformer-based performance modeling framework designed to achieve robust in-distribution generalization across trained topologies under novel parameter configurations and zero-shot generalization to unseen topologies without any fine-tuning. We apply three key enabling strategies: (1) a diverse training corpus of 3.6 million instances covering over 60 amplifier topologies, (2) unified topology embeddings leveraging global-aware tokens and hierarchical attention to robustly generalize to novel circuits, and (3) a topology-conditioned parameter mapping approach that maintains consistent structural representations independent of parameter variations. Our experimental results demonstrate that ZeroSim significantly outperforms baseline models such as multilayer perceptrons, graph neural networks and transformers, delivering accurate zero-shot predictions across different amplifier topologies. Additionally, when integrated into a reinforcement learning-based parameter optimization pipeline, ZeroSim achieves a remarkable speedup (13x) compared to conventional SPICE simulations, underscoring its practical value for a wide range of analog circuit design automation tasks.

[428] Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language Models

Manh Nguyen, Sunil Gupta, Hung Le

Main category: cs.LG

TL;DR: Proposes an efficient, training-free uncertainty estimation method for LLMs that uses top-K probabilities with adaptive K selection to detect hallucinations without extra computation.

Details

Motivation: LLMs are vulnerable to hallucinations but existing uncertainty estimation methods require multiple samples or extra computation, making them inefficient.

Method: Uses top-K probabilities from LLM responses to approximate predictive entropy, with an adaptive mechanism to determine optimal K and filter low-confidence probabilities.

Result: Outperforms expensive state-of-the-art baselines on three free-form question-answering datasets across several LLMs.

Conclusion: Provides an efficient uncertainty estimation method that enhances LLM trustworthiness without requiring training or extra computation.

Abstract: Large Language Models (LLMs) exhibit strong performance across various natural language processing (NLP) tasks but remain vulnerable to hallucinations, generating factually incorrect or misleading outputs. Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue. However, existing methods often require multiple samples or extra computation to assess semantic entropy. This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses’ top-$K$ probabilities. Moreover, we employ an adaptive mechanism to determine $K$ to enhance flexibility and filter out low-confidence probabilities. Experimental results on three free-form question-answering datasets across several LLMs demonstrate that our method outperforms expensive state-of-the-art baselines, contributing to the broader goal of enhancing LLM trustworthiness.

[429] Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

Xiaolin Sun, Feidi Liu, Zhengming Ding, ZiZhan Zheng

Main category: cs.LG

TL;DR: SHIFT is a novel diffusion-based state perturbation attack that breaks existing RL defenses by generating semantically different yet realistic adversarial states, outperforming traditional l_p norm attacks.

Details

Motivation: Current RL defenses appear effective but this is due to limitations of existing l_p norm-constrained attacks that can't alter image semantics even with large perturbation budgets.

Method: Proposed SHIFT, a policy-agnostic diffusion-based state perturbation attack that generates perturbed states that are semantically different from true states while remaining realistic and history-aligned.

Result: SHIFT effectively breaks existing defenses including sophisticated ones, significantly outperforming existing attacks while being more perceptually stealthy.

Conclusion: RL agents are vulnerable to semantics-aware adversarial perturbations, highlighting the need for more robust policies beyond current defense mechanisms.

Abstract: Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent’s behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. However, after closer investigation, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing $l_p$ norm-constrained attacks, which can barely alter the semantics of image input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel policy-agnostic diffusion-based state perturbation attack to go beyond this limitation. Our attack is able to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, significantly outperforming existing attacks while being more perceptually stealthy. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies.

[430] Intelligent Optimization of Multi-Parameter Micromixers Using a Scientific Machine Learning Framework

Meraj Hassanzadeh, Ehsan Ghaderi, Mohamad Ali Bijarchi, Siamak Kazemzadeh Hannani

Main category: cs.LG

TL;DR: A Sci-ML framework using DRL and PINNs for instant multidimensional optimization, demonstrated on micromixer design with 32% efficiency improvement.

Details

Motivation: Overcome limitations of traditional simulation-based optimization methods that are slow, single-problem focused, and computationally expensive for meshing and numerical simulation.

Method: Deep Reinforcement Learning agent interacts with parametric Physics-Informed Neural Network environment to explore parameter relationships and optimize geometric/physical parameters for micromixer efficiency across Schmidt numbers.

Result: Achieved consistent efficiency improvements across Schmidt number spectrum, with maximum 32% improvement at Schmidt number 13.3, outperforming baseline values.

Conclusion: The proposed Sci-ML framework provides faster, more efficient multidimensional optimization compared to traditional methods, as validated by superior performance against Genetic Algorithm.

Abstract: Multidimensional optimization has consistently been a critical challenge in engineering. However, traditional simulation-based optimization methods have long been plagued by significant limitations: they are typically capable of optimizing only a single problem at a time and require substantial computational time for meshing and numerical simulation. This paper introduces a novel framework leveraging cutting-edge Scientific Machine Learning (Sci-ML) methodologies to overcome these inherent drawbacks of conventional approaches. The proposed method provides instantaneous solutions to a spectrum of complex, multidimensional optimization problems. A micromixer case study is employed to demonstrate this methodology. An agent, operating on a Deep Reinforcement Learning (DRL) architecture, serves as the optimizer to explore the relationships between key problem parameters. This optimizer interacts with an environment constituted by a parametric Physics-Informed Neural Network (PINN), which responds to the agent’s actions at a significantly higher speed than traditional numerical methods. The agent’s objective, conditioned on the Schmidt number is to discover the optimal geometric and physical parameters that maximize the micromixer’s efficiency. After training the agent across a wide range of Schmidt numbers, we analyzed the resulting optimal designs. Across this entire spectrum, the achieved efficiency was consistently greater than the baseline, normalized value. The maximum efficiency occurred at a Schmidt number of 13.3, demonstrating an improvement of approximately 32%. Finally, a comparative analysis with a Genetic Algorithm was conducted under equivalent conditions to underscore the advantages of the proposed method.

Piotr Szwed, Paweł Skrzynski, Jarosław Wąs

Main category: cs.LG

TL;DR: Proposes a ranking-based algorithm for vehicle relocation in free-floating car-sharing using zone-based optimization and scooter-based personnel transfers, achieving 8.44% improvement over baseline.

Details

Motivation: Address the Vehicle Relocation Problem in free-floating car-sharing services to optimize vehicle distribution and improve service efficiency.

Method: Divides service area into zones with similar temporal patterns, then applies a fast ranking-based algorithm considering available cars, demand probability density, and trip durations.

Result: Achieved 8.44% average improvement over baseline in total travel time, compared to 19.6% improvement from MIP solver (which had additional trip selection capabilities).

Conclusion: The proposed solution can improve performance metrics by 3%-10% depending on workforce size, offering practical optimization for car-sharing services.

Abstract: The paper addresses the Vehicle Relocation Problem in free-floating car-sharing services by presenting a solution focused on strategies for repositioning vehicles and transferring personnel with the use of scooters. Our method begins by dividing the service area into zones that group regions with similar temporal patterns of vehicle presence and service demand, allowing the application of discrete optimization methods. In the next stage, we propose a fast ranking-based algorithm that makes its decisions on the basis of the number of cars available in each zone, the projected probability density of demand, and estimated trip durations. The experiments were carried out on the basis of real-world data originating from a major car-sharing service operator in Poland. The results of this algorithm are evaluated against scenarios without optimization that constitute a baseline and compared with the results of an exact algorithm to solve the Mixed Integer Programming (MIP) model. As performance metrics, the total travel time was used. Under identical conditions (number of vehicles, staff, and demand distribution), the average improvements with respect to the baseline of our algorithm and MIP solver were equal to 8.44% and 19.6% correspondingly. However, it should be noted that the MIP model also mimicked decisions on trip selection, which are excluded by current services business rules. The analysis of results suggests that, depending on the size of the workforce, the application of the proposed solution allows for improving performance metrics by roughly 3%-10%.

[432] Multistep Quasimetric Learning for Scalable Goal-conditioned Reinforcement Learning

Bill Chunyuan Zheng, Vivek Myers, Benjamin Eysenbach, Sergey Levine

Main category: cs.LG

TL;DR: The paper presents a goal-conditioned reinforcement learning (GCRL) method that integrates temporal difference and Monte Carlo approaches to estimate temporal distances, enabling effective long-horizon reasoning and real-world robotic manipulation.

Details

Motivation: Address the challenge of reasoning over long horizons in AI, particularly the difficulty in estimating temporal distance between observations, where existing methods have limitations - temporal difference methods have optimality guarantees but perform worse than Monte Carlo methods that lack such guarantees.

Method: Develop a practical GCRL method that fits a quasimetric distance using a multistep Monte-Carlo return, integrating temporal difference and Monte Carlo approaches.

Result: The method outperforms existing GCRL methods on long-horizon simulated tasks with up to 4000 steps, even with visual observations, and enables stitching in real-world robotic manipulation (Bridge setup).

Conclusion: This is the first end-to-end GCRL method that enables multistep stitching in real-world manipulation domains from unlabeled offline datasets of visual observations.

Abstract: Learning how to reach goals in an environment is a longstanding challenge in AI, yet reasoning over long horizons remains a challenge for modern methods. The key question is how to estimate the temporal distance between pairs of observations. While temporal difference methods leverage local updates to provide optimality guarantees, they often perform worse than Monte Carlo methods that perform global updates (e.g., with multi-step returns), which lack such guarantees. We show how these approaches can be integrated into a practical GCRL method that fits a quasimetric distance using a multistep Monte-Carlo return. We show our method outperforms existing GCRL methods on long-horizon simulated tasks with up to 4000 steps, even with visual observations. We also demonstrate that our method can enable stitching in the real-world robotic manipulation domain (Bridge setup). Our approach is the first end-to-end GCRL method that enables multistep stitching in this real-world manipulation domain from an unlabeled offline dataset of visual observations.

[433] Global Optimization on Graph-Structured Data via Gaussian Processes with Spectral Representations

Shu Hong, Yongsheng Mei, Mahdi Imani, Tian Lan

Main category: cs.LG

TL;DR: A scalable Bayesian optimization framework for graph-structured domains using low-rank spectral representations and Gaussian process surrogates.

Details

Motivation: Bayesian optimization struggles with graph-structured domains due to discrete/combinatorial nature; existing methods are impractical for large graphs or have slow convergence.

Method: Uses low-rank spectral representations to build GP surrogates from sparse structural observations, jointly infers graph structure and node representations through learnable embeddings.

Result: Achieves faster convergence and improved optimization performance compared to prior methods on synthetic and real-world datasets.

Conclusion: The framework enables efficient global search and principled uncertainty estimation for graph optimization, with theoretical guarantees for accurate graph structure recovery.

Abstract: Bayesian optimization (BO) is a powerful framework for optimizing expensive black-box objectives, yet extending it to graph-structured domains remains challenging due to the discrete and combinatorial nature of graphs. Existing approaches often rely on either full graph topology-impractical for large or partially observed graphs-or incremental exploration, which can lead to slow convergence. We introduce a scalable framework for global optimization over graphs that employs low-rank spectral representations to build Gaussian process (GP) surrogates from sparse structural observations. The method jointly infers graph structure and node representations through learnable embeddings, enabling efficient global search and principled uncertainty estimation even with limited data. We also provide theoretical analysis establishing conditions for accurate recovery of underlying graph structure under different sampling regimes. Experiments on synthetic and real-world datasets demonstrate that our approach achieves faster convergence and improved optimization performance compared to prior methods.

[434] From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po

Main category: cs.LG

TL;DR: Proposes a two-stage token-level entropy optimization method for RLVR in MLLMs that transitions from entropy maximization (exploration) to minimization (exploitation) to improve noise tolerance and performance.

Details

Motivation: RLVR for MLLMs depends on high-quality labeled data, but real-world data often has annotation noise. Existing unsupervised methods can overfit to incorrect labels and limit reward ranking signals for GRPO.

Method: Two-stage token-level entropy optimization: exploration phase with entropy maximization for diverse outputs and regularization against noisy labels, followed by exploitation phase with entropy minimization for confident outputs and knowledge consolidation.

Result: Consistently outperforms prior approaches across three MLLM backbones (Qwen2-VL-2B, Qwen2-VL-7B, Qwen2.5-VL-3B) in diverse noise settings and multiple tasks, delivering robust and superior performance.

Conclusion: The phased entropy optimization strategy effectively unifies and enhances external, internal, and entropy-based methods, providing robust noise tolerance and improved performance for RLVR in MLLMs.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.

[435] Schedulers for Schedule-free: Theoretically inspired hyperparameters

Yuen-Man Pun, Matthew Buchholz, Robert M. Gower

Main category: cs.LG

TL;DR: Extends schedule-free optimization theory to support learning rate schedulers, proves optimal convergence rates for warmup-stable-decay schedule, and introduces a new adaptive Polyak learning rate schedule with optimal anytime convergence.

Details

Motivation: Current schedule-free theory only supports constant learning rates, but practical implementations use warm-up schedules. There's a gap between theory and practice that needs to be bridged.

Method: Extends last-iterate convergence theory to allow any scheduler, updates averaging parameter as function of learning rate, and designs new adaptive Polyak learning rate schedule using convexity principles.

Result: Proves optimal O(1/√T) convergence rate for warmup-stable-decay schedule, demonstrates theory has predictive power for deep neural networks despite convexity assumptions, and shows new Polyak schedule performs well on black-box model distillation tasks.

Conclusion: The extended theory successfully bridges the gap between schedule-free optimization theory and practical implementations, enabling optimal convergence with various learning rate schedules and introducing an effective adaptive Polyak schedule.

Abstract: The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.

[436] Physical Consistency of Aurora’s Encoder: A Quantitative Study

Benjamin Richards, Pushpa Kumar Balan

Main category: cs.LG

TL;DR: Probing Aurora weather model’s encoder reveals it learns physically consistent features like land-sea boundaries and extreme weather patterns, though struggles with rare events, highlighting need for interpretability in AI weather models.

Details

Motivation: Large-scale weather forecasting models like Aurora are accurate but lack transparency, hindering adoption in high-stakes operational settings due to their 'black box' nature.

Method: Used large-scale dataset of embeddings to train linear classifiers for identifying three physical concepts: land-sea boundary, extreme temperature events, and atmospheric instability.

Result: Quantitative evidence shows Aurora learns physically consistent features but has limitations in capturing the rarest events.

Conclusion: Interpretability methods are critically needed to validate and build trust in next-generation AI-driven weather models.

Abstract: The high accuracy of large-scale weather forecasting models like Aurora is often accompanied by a lack of transparency, as their internal representations remain largely opaque. This “black box” nature hinders their adoption in high-stakes operational settings. In this work, we probe the physical consistency of Aurora’s encoder by investigating whether its latent representations align with known physical and meteorological concepts. Using a large-scale dataset of embeddings, we train linear classifiers to identify three distinct concepts: the fundamental land-sea boundary, high-impact extreme temperature events, and atmospheric instability. Our findings provide quantitative evidence that Aurora learns physically consistent features, while also highlighting its limitations in capturing the rarest events. This work underscores the critical need for interpretability methods to validate and build trust in the next generation of Al-driven weather models.

[437] Analyzing Political Text at Scale with Online Tensor LDA

Sara Kangaslahti, Danny Ebanks, Jean Kossaifi, Anqi Liu, R. Michael Alvarez, Animashree Anandkumar

Main category: cs.LG

TL;DR: Proposes Tensor LDA, a scalable topic modeling method that handles billions of documents with linear scaling, GPU acceleration, and provides applications in political science.

Details

Motivation: Existing topic modeling methods don't scale to billion-document datasets needed for real-time analysis of large social media corpora in political science research.

Method: Tensor Latent Dirichlet Allocation (TLDA) with identifiable parameter guarantees, linear scaling, and GPU-based implementation for computational efficiency.

Result: Achieves 3-4x speedup over parallel LDA, scales linearly to billion-document datasets, and enables large-scale studies of #MeToo movement and election fraud conversations.

Conclusion: TLDA provides social scientists with near real-time analysis capabilities for very large corpora, enabling important theoretically-relevant research on salient issues.

Abstract: This paper proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, Tensor Latent Dirichlet Allocation (TLDA), that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3-4x those of prior parallelized Latent Dirichlet Allocation (LDA) methods), and that it scales linearly to text datasets with over a billion documents; iii) we provide an open-source, GPU-based implementation, of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.

[438] Multi-Objective Bilevel Learning

Zhiyao Zhang, Zhuqing Liu, Xin Zhang, Wen-Yen Chen, Jiyan Yang, Jia Liu

Main category: cs.LG

TL;DR: The paper introduces a multi-objective bilevel learning (MOBL) framework to address conflicting objectives in ML applications, proposing the WC-MHGD algorithm for efficient optimization with theoretical guarantees.

Details

Motivation: Modern ML frameworks face multiple conflicting objectives with coupled variables across layers, creating a need for MOBL, which remains under-explored.

Method: Proposed weighted-Chebyshev multi-hyper-gradient-descent (WC-MHGD) for both deterministic and stochastic settings, ensuring Pareto-stationarity with finite-time convergence guarantees.

Result: The algorithm achieves low oracle complexity and enables systematic Pareto front exploration, confirmed through extensive experiments.

Conclusion: WC-MHGD provides a robust theoretical and algorithmic foundation for MOBL, addressing key challenges in multi-objective optimization with coupled variables.

Abstract: As machine learning (ML) applications grow increasingly complex in recent years, modern ML frameworks often need to address multiple potentially conflicting objectives with coupled decision variables across different layers. This creates a compelling need for multi-objective bilevel learning (MOBL). So far, however, the field of MOBL remains in its infancy and many important problems remain under-explored. This motivates us to fill this gap and systematically investigate the theoretical and algorithmic foundation of MOBL. Specifically, we consider MOBL problems with multiple conflicting objectives guided by preferences at the upper-level subproblem, where part of the inputs depend on the optimal solution of the lower-level subproblem. Our goal is to develop efficient MOBL optimization algorithms to (1) identify a preference-guided Pareto-stationary solution with low oracle complexity; and (2) enable systematic Pareto front exploration. To this end, we propose a unifying algorithmic framework called weighted-Chebyshev multi-hyper-gradient-descent (WC-MHGD) for both deterministic and stochastic settings with finite-time Pareto-stationarity convergence rate guarantees, which not only implies low oracle complexity but also induces systematic Pareto front exploration. We further conduct extensive experiments to confirm our theoretical results.

[439] MURPHY: Multi-Turn GRPO for Self Correcting Code Generation

Chanakya Ekbote, Vijay Lingam, Behrooz Omidvar-Tehrani, Jun Huan, Sujay Sanghavi, Anoop Deoras, Stefano Soatto

Main category: cs.LG

TL;DR: Murphy extends GRPO with multi-turn reflective optimization for RLVR, improving reasoning in agentic tasks through iterative self-correction using quantitative and qualitative feedback.

Details

Motivation: Existing RLVR approaches like GRPO are effective on reasoning benchmarks but struggle with agentic tasks requiring iterative decision-making.

Method: Multi-turn reflective optimization framework that extends GRPO by incorporating iterative self-correction during training using both quantitative and qualitative execution feedback.

Result: Consistent performance improvements on code generation benchmarks with Qwen and OLMo models, achieving up to 8% relative gain in pass@1 over GRPO on similar compute budgets.

Conclusion: Murphy successfully enhances reasoning capabilities for agentic tasks through iterative refinement, demonstrating significant improvements over existing RLVR approaches.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful framework for enhancing the reasoning capabilities of large language models (LLMs). However, existing approaches such as Group Relative Policy Optimization (GRPO) and its variants, while effective on reasoning benchmarks, struggle with agentic tasks that require iterative decision-making. We introduce Murphy, a multi-turn reflective optimization framework that extends GRPO by incorporating iterative self-correction during training. By leveraging both quantitative and qualitative execution feedback, Murphy enables models to progressively refine their reasoning across multiple turns. Evaluations on code generation benchmarks with model families such as Qwen and OLMo show that Murphy consistently improves performance, achieving up to a 8% relative gain in pass@1 over GRPO, on similar compute budgets.

[440] DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning

Jay Chooi, Kevin Cong, Russell Li, Lillian Sun

Main category: cs.LG

TL;DR: DP-AdamW outperforms existing DP optimizers across multiple tasks, while DP-AdamW-BC with bias correction consistently decreases accuracy.

Details

Motivation: To develop differentially private optimizers that maintain strong performance while protecting sensitive data during deep learning model training.

Method: Introduced DP-AdamW and DP-AdamW-BC (with bias correction), providing theoretical privacy and convergence guarantees, then empirically tested across privacy budgets (ε=1,3,7) on text classification, image classification, and graph node classification tasks.

Result: DP-AdamW outperformed DP-SGD, DP-Adam, and DP-AdamBC by over 15% on text classification, up to 5% on image classification, and consistently 1% on graph node classification. DP-AdamW-BC with bias correction consistently decreased accuracy.

Conclusion: DP-AdamW is an effective differentially private optimizer that maintains strong performance across various tasks, but bias correction in this context reduces rather than improves accuracy.

Abstract: As deep learning methods increasingly utilize sensitive data on a widespread scale, differential privacy (DP) offers formal guarantees to protect against information leakage during model training. A significant challenge remains in implementing DP optimizers that retain strong performance while preserving privacy. Recent advances introduced ever more efficient optimizers, with AdamW being a popular choice for training deep learning models because of strong empirical performance. We study \emph{DP-AdamW} and introduce \emph{DP-AdamW-BC}, a differentially private variant of the AdamW optimizer with DP bias correction for the second moment estimator. We start by showing theoretical results for privacy and convergence guarantees of DP-AdamW and DP-AdamW-BC. Then, we empirically analyze the behavior of both optimizers across multiple privacy budgets ($ε= 1, 3, 7$). We find that DP-AdamW outperforms existing state-of-the-art differentially private optimizers like DP-SGD, DP-Adam, and DP-AdamBC, scoring over 15% higher on text classification, up to 5% higher on image classification, and consistently 1% higher on graph node classification. Moreover, we empirically show that incorporating bias correction in DP-AdamW (DP-AdamW-BC) consistently decreases accuracy, in contrast to the improvement of DP-AdamBC improvement over DP-Adam.

[441] A General Method for Proving Networks Universal Approximation Property

Wei Wang

Main category: cs.LG

TL;DR: Proposes a unified framework using Universal Approximation Modules (UAMs) to prove universal approximation properties across diverse deep learning architectures, eliminating the need for architecture-specific proofs.

Details

Motivation: Current approaches require model-specific proofs for each new architecture, causing redundancy and lacking a common analytical foundation for understanding universal approximation across different network families.

Method: Defines Universal Approximation Modules (UAMs) as basic building blocks with universal approximation properties, and proves that any deep network composed of such modules inherently retains universal approximation capability.

Result: Provides a general framework that unifies analysis of diverse architectures and enables step-by-step understanding of how expressive power evolves through network layers.

Conclusion: The modular UAM framework offers a unified approach to proving universal approximation, reducing redundancy and providing deeper theoretical insights into how different architectures achieve expressive power.

Abstract: Deep learning architectures are highly diverse. To prove their universal approximation properties, existing works typically rely on model-specific proofs. Generally, they construct a dedicated mathematical formulation for each architecture (e.g., fully connected networks, CNNs, or Transformers) and then prove their universal approximability. However, this approach suffers from two major limitations: first, every newly proposed architecture often requires a completely new proof from scratch; second, these proofs are largely isolated from one another, lacking a common analytical foundation. This not only incurs significant redundancy but also hinders unified theoretical understanding across different network families. To address these issues, this paper proposes a general and modular framework for proving universal approximation. We define a basic building block (comprising one or multiple layers) that possesses the universal approximation property as a Universal Approximation Module (UAM). Under this condition, we show that any deep network composed of such modules inherently retains the universal approximation property. Moreover, the overall approximation process can be interpreted as a progressive refinement across modules. This perspective not only unifies the analysis of diverse architectures but also enables a step-by-step understanding of how expressive power evolves through the network.

[442] Algorithm-Relative Trajectory Valuation in Policy Gradient Control

Shihao Li, Jiachen Li, Jiamin Xu, Christopher Martin, Wei Li, Dongmei Chen

Main category: cs.LG

TL;DR: Trajectory value depends on learning algorithm: negative correlation between PE and value in vanilla REINFORCE due to variance effects, but positive correlation when stabilized.

Details

Motivation: To understand how trajectory value depends on the learning algorithm in policy-gradient control, particularly examining the role of Persistence of Excitation (PE).

Method: Used Trajectory Shapley in uncertain LQR, analyzed variance-mediated mechanisms, compared vanilla REINFORCE with stabilized versions (state whitening/Fisher preconditioning), and conducted experiments with Leave-One-Out scores.

Result: Found negative correlation (r≈-0.38) between PE and marginal value in vanilla REINFORCE due to variance effects, but positive correlation (r≈+0.29) when stabilized. Decision-aligned scores complement Shapley for pruning.

Conclusion: Trajectory value is algorithm-relative, with variance playing a key role in vanilla methods but information content dominating in stabilized versions.

Abstract: We study how trajectory value depends on the learning algorithm in policy-gradient control. Using Trajectory Shapley in an uncertain LQR, we find a negative correlation between Persistence of Excitation (PE) and marginal value under vanilla REINFORCE ($r\approx-0.38$). We prove a variance-mediated mechanism: (i) for fixed energy, higher PE yields lower gradient variance; (ii) near saddles, higher variance increases escape probability, raising marginal contribution. When stabilized (state whitening or Fisher preconditioning), this variance channel is neutralized and information content dominates, flipping the correlation positive ($r\approx+0.29$). Hence, trajectory value is algorithm-relative. Experiments validate the mechanism and show decision-aligned scores (Leave-One-Out) complement Shapley for pruning, while Shapley identifies toxic subsets.

[443] Meta-cognitive Multi-scale Hierarchical Reasoning for Motor Imagery Decoding

Si-Hyun Kim, Heon-Gyu Kwak, Byoung-Hee Kwon, Seong-Whan Lee

Main category: cs.LG

TL;DR: A hierarchical meta-cognitive framework improves four-class motor imagery EEG classification by combining multi-scale signal processing with uncertainty estimation, enhancing robustness to subject variability and noise.

Details

Motivation: Practical deployment of motor imagery BCIs is limited by noise and variability in EEG signals, requiring more robust decoding methods that can handle subject heterogeneity and unreliable trials.

Method: Proposes a hierarchical framework with multi-scale signal processing module that reorganizes backbone features into temporal multi-scale representations, and an introspective uncertainty estimation module that assigns per-cycle reliability scores for iterative refinement.

Result: Across three EEG backbones (EEGNet, ShallowConvNet, DeepConvNet) on BCI Competition IV-2a dataset, the framework improves average classification accuracy and reduces inter-subject variance compared to baselines.

Conclusion: Combining hierarchical multi-scale processing with introspective confidence estimation enhances reliability of MI-based BCI systems by increasing robustness to subject heterogeneity and noisy trials.

Abstract: Brain-computer interface (BCI) aims to decode motor intent from noninvasive neural signals to enable control of external devices, but practical deployment remains limited by noise and variability in motor imagery (MI)-based electroencephalogram (EEG) signals. This work investigates a hierarchical and meta-cognitive decoding framework for four-class MI classification. We introduce a multi-scale hierarchical signal processing module that reorganizes backbone features into temporal multi-scale representations, together with an introspective uncertainty estimation module that assigns per-cycle reliability scores and guides iterative refinement. We instantiate this framework on three standard EEG backbones (EEGNet, ShallowConvNet, and DeepConvNet) and evaluate four-class MI decoding using the BCI Competition IV-2a dataset under a subject-independent setting. Across all backbones, the proposed components improve average classification accuracy and reduce inter-subject variance compared to the corresponding baselines, indicating increased robustness to subject heterogeneity and noisy trials. These results suggest that combining hierarchical multi-scale processing with introspective confidence estimation can enhance the reliability of MI-based BCI systems.

[444] A Generalized Spectral Framework to Expain Neural Scaling and Compression Dynamics

Yizhou Zhang

Main category: cs.LG

TL;DR: A generalized spectral framework unifies learning dynamics and compression phenomena using a polynomial spectral evolution function, recovering existing theories as special cases.

Details

Motivation: To reconcile apparently distinct scaling behaviors in model learning and compression by developing a unified theoretical framework.

Method: Develops a generalized spectral framework with asymptotically polynomial spectral evolution function g(λ,t;β) characterized by spectral-temporal elasticity ρ(β).

Result: The framework recovers lazy and feature-learning theories as special cases and establishes invariant relations between learning and compression.

Conclusion: The generalized spectral approach provides a unified understanding of scaling laws across different learning regimes and compression phenomena.

Abstract: Empirical scaling laws describe how test loss and other performance metrics depend on model size, dataset size, and compute. While such laws are consistent within specific regimes, apparently distinct scaling behaviors have been reported for related settings such as model compression. Motivated by recent progress in spectral analyses of neural representations, this paper develops a \emph{generalized spectral framework} that unifies learning dynamics and compression phenomena under a common functional ansatz. We generalize the spectral evolution function from the linear kernel form $g(λt)=λt$ to an asymptotically polynomial function $g(λ,t;β)$, characterized by an effective spectral–temporal elasticity $ρ(β)$. This framework recovers existing lazy and feature-learning theories as special cases and yields an invariant relation between learning and compression

[445] Statistically Assuring Safety of Control Systems using Ensembles of Safety Filters and Conformal Prediction

Ihab Tabbara, Yuxuan Yang, Hussein Sibai

Main category: cs.LG

TL;DR: The paper introduces a conformal prediction framework to provide probabilistic safety guarantees for learned Hamilton-Jacobi reachability value functions and policies in autonomous systems.

Details

Motivation: Hamilton-Jacobi reachability analysis is computationally expensive for high-dimensional systems, motivating the use of reinforcement learning to approximate value functions, but learned functions may not be correct and lack safety guarantees.

Method: A conformal prediction-based framework that calibrates switching between unsafe nominal controllers and learned HJ-based safe policies, and investigates using ensembles of independently trained HJ value functions as safety filters.

Result: The approach provides probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states.

Conclusion: Conformal prediction enables bounding uncertainty in learned safety functions and provides a practical framework for safety assurance in learning-enabled autonomous systems.

Abstract: Safety assurance is a fundamental requirement for deploying learning-enabled autonomous systems. Hamilton-Jacobi (HJ) reachability analysis is a fundamental method for formally verifying safety and generating safe controllers. However, computing the HJ value function that characterizes the backward reachable set (BRS) of a set of user-defined failure states is computationally expensive, especially for high-dimensional systems, motivating the use of reinforcement learning approaches to approximate the value function. Unfortunately, a learned value function and its corresponding safe policy are not guaranteed to be correct. The learned value function evaluated at a given state may not be equal to the actual safety return achieved by following the learned safe policy. To address this challenge, we introduce a conformal prediction-based (CP) framework that bounds such uncertainty. We leverage CP to provide probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states. Specifically, we use CP to calibrate the switching between the unsafe nominal controller and the learned HJ-based safe policy and to derive safety guarantees under this switched policy. We also investigate using an ensemble of independently trained HJ value functions as a safety filter and compare this ensemble approach to using individual value functions alone.

[446] Test-driven Reinforcement Learning

Zhao Yu, Xiuping Wu, Liangjun Ke

Main category: cs.LG

TL;DR: Proposes Test-driven Reinforcement Learning (TdRL) framework using multiple test functions instead of single reward functions to address reward design challenges in RL, with experimental validation on DeepMind Control Suite.

Details

Motivation: Traditional RL uses reward functions that serve dual purposes (defining optimal goal and guiding learning), making manual design challenging and often resulting in suboptimal task representation.

Method: Uses pass-fail tests and indicative tests to separate task definition from learning guidance. Introduces lexicographic heuristic to compare trajectory distances to optimal set, and develops algorithm implementation with maximum entropy policy optimization.

Result: Experimental results show TdRL matches or outperforms handcrafted reward methods in policy training on DeepMind Control Suite, with greater design simplicity and inherent multi-objective optimization support.

Conclusion: TdRL provides a novel perspective for representing task objectives that helps address reward design challenges in RL applications.

Abstract: Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.

[447] CellARC: Measuring Intelligence with Cellular Automata

Miroslav Lžičař

Main category: cs.LG

TL;DR: CellARC is a synthetic benchmark for abstraction and reasoning based on 1D cellular automata, featuring controllable difficulty parameters and enabling rapid testing of small models under tight computational budgets.

Details

Motivation: To create a reproducible benchmark that decouples generalization from anthropomorphic priors, supports unlimited difficulty-controlled sampling, and enables studies of how quickly models infer new rules with limited resources.

Method: Built from multicolor 1D cellular automata with explicit control over alphabet size, radius, rule family, Langton’s lambda, query coverage, and cell entropy. Each episode has five support pairs and one query serialized in 256 tokens.

Result: A 10M-parameter vanilla transformer outperforms recent recursive models (58.0%/32.4% accuracy on interpolation/extrapolation), GPT-5 High achieves 62.3%/48.1%, and an ensemble reaches 65.4%/35.5%, showing neuro-symbolic complementarity.

Conclusion: CellARC provides a scalable, controllable benchmark for studying abstraction and reasoning, demonstrating that small transformers can outperform specialized recursive architectures and highlighting the value of neuro-symbolic approaches.

Abstract: We introduce CellARC, a synthetic benchmark for abstraction and reasoning built from multicolor 1D cellular automata (CA). Each episode has five support pairs and one query serialized in 256 tokens, enabling rapid iteration with small models while exposing a controllable task space with explicit knobs for alphabet size k, radius r, rule family, Langton’s lambda, query coverage, and cell entropy. We release 95k training episodes plus two 1k test splits (interpolation/extrapolation) and evaluate symbolic, recurrent, convolutional, transformer, recursive, and LLM baselines. CellARC decouples generalization from anthropomorphic priors, supports unlimited difficulty-controlled sampling, and enables reproducible studies of how quickly models infer new rules under tight budgets. Our strongest small-model baseline (a 10M-parameter vanilla transformer) outperforms recent recursive models (TRM, HRM), reaching 58.0%/32.4% per-token accuracy on the interpolation/extrapolation splits, while a large closed model (GPT-5 High) attains 62.3%/48.1% on subsets of 100 test tasks. An ensemble that chooses per episode between the Transformer and the best symbolic baseline reaches 65.4%/35.5%, highlighting neuro-symbolic complementarity. Leaderboard: https://cellarc.mireklzicar.com

[448] Rectified Noise: A Generative Model Using Positive-incentive Noise

Zhenyu Gu, Yanchen Xu, Sida Huang, Yubin Guo, Hongyuan Zhang

Main category: cs.LG

TL;DR: Rectified Noise (ΔRN) improves generative performance by injecting positive-incentive noise into pre-trained Rectified Flow models, achieving better results with minimal additional parameters.

Details

Motivation: Recent studies show that injecting noise through reverse-time SDEs can enhance generative performance in Rectified Flow models, but existing methods need improvement in efficiency and effectiveness.

Method: Propose Rectified Noise (ΔRN) algorithm to train π-noise generators that inject positive-incentive noise into the velocity field of pre-trained RF models, transforming them into π-noise generators with minimal additional training.

Result: RF models using Rectified Noise reduce FID from 10.16 to 9.05 on ImageNet-1k, and π-noise generators achieve improved performance with only 0.39% additional training parameters across various architectures and datasets.

Conclusion: Rectified Noise provides an efficient and effective way to enhance pre-trained RF models by injecting π-noise, significantly improving generative performance with minimal computational overhead.

Abstract: Rectified Flow (RF) has been widely used as an effective generative model. Although RF is primarily based on probability flow Ordinary Differential Equations (ODE), recent studies have shown that injecting noise through reverse-time Stochastic Differential Equations (SDE) for sampling can achieve superior generative performance. Inspired by Positive-incentive Noise ($π$-noise), we propose an innovative generative algorithm to train $π$-noise generators, namely Rectified Noise ($Δ$RN), which improves the generative performance by injecting $π$-noise into the velocity field of pre-trained RF models. After introducing the Rectified Noise pipeline, pre-trained RF models can be efficiently transformed into $π$-noise generators. We validate Rectified Noise by conducting extensive experiments across various model architectures on different datasets. Notably, we find that: (1) RF models using Rectified Noise reduce FID from \textbf{10.16 to 9.05} on ImageNet-1k. (2) The models of $π$-noise generators achieve improved performance with only \textbf{0.39%} additional training parameters.

[449] Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison

Yoonho Lee, Joseph Boen, Chelsea Finn

Main category: cs.LG

TL;DR: Feedback Descent optimizes text artifacts using structured textual feedback instead of scalar rewards, enabling directed optimization in text space without modifying model weights.

Details

Motivation: To overcome the information bottleneck in preference learning by preserving detailed critiques rather than compressing them to binary preferences, allowing for more effective optimization of text artifacts.

Method: Uses in-context learning to transform structured feedback into gradient-like directional information for targeted edits, with evaluators providing textual feedback paired with comparisons as high-bandwidth supervision.

Result: Outperforms state-of-the-art methods across three domains, including identifying novel drug-like molecules surpassing the 99.9th percentile of a 260,000+ compound database across six protein targets.

Conclusion: Feedback Descent provides an effective framework for optimizing text artifacts through structured feedback, demonstrating superior performance over existing methods while being task-agnostic and operating purely at inference time.

Abstract: We introduce \textit{Feedback Descent}, a framework that optimizes text artifacts – prompts, code, and molecules – through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the $99.9$th percentile of a database with more than $260{,}000$ compounds across six protein targets.

[450] SERL: Self-Examining Reinforcement Learning on Open-Domain

Weixuan Ou, Yanzhao Zheng, Shuoshuo Sun, Wei Zhang, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Pengwei Yan, Yifan Qiao

Main category: cs.LG

TL;DR: SERL is a self-improving RL framework where LLMs act as both Actor and Judge, using internal pairwise comparisons and self-consistency rewards without external signals, achieving state-of-the-art performance on open-domain tasks.

Details

Motivation: To overcome limitations of RLVR (requires verifiable rewards) and RLHF (relies on external rewards) for open-domain tasks where rewards are subjective and external mechanisms are unavailable.

Method: Proposes Self-Examining Reinforcement Learning (SERL) with two internal reward mechanisms: Copeland-style pairwise comparison judgments for Actor improvement and self-consistency rewards for Judge reliability enhancement.

Result: Improves Qwen3-8B LC win rate on AlpacaEval 2 from 52.37% to 59.90%, achieving state-of-the-art performance among self-improving approaches and comparable performance to much larger models like Qwen3-32B.

Conclusion: SERL demonstrates superior effectiveness and robustness for open-domain tasks by enabling self-improvement without external reward signals, bridging the gap between smaller and larger model performance.

Abstract: Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor’s capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge’s reliability. This process refines the Judge’s capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.

[451] IBMA: An Imputation-Based Mixup Augmentation Using Self-Supervised Learning for Time Series Data

Dang Nha Nguyen, Hai Dang Nguyen, Khoa Tho Anh Nguyen

Main category: cs.LG

TL;DR: Proposes IBMA, a novel data augmentation method combining imputation and Mixup for time series forecasting, showing consistent performance improvements across multiple models and datasets.

Details

Motivation: Time series forecasting has limited augmentation strategies compared to other fields, with advanced techniques like Mixup rarely used despite their potential benefits.

Method: Imputation-Based Mixup Augmentation (IBMA) combines imputation-augmented data with Mixup augmentation to enhance model generalization.

Result: IBMA achieved 22 improvements out of 24 instances across four datasets, with 10 best performances, particularly effective with iTrainformer imputation.

Conclusion: IBMA is an effective data augmentation method that consistently enhances forecasting performance across various models and datasets.

Abstract: Data augmentation in time series forecasting plays a crucial role in enhancing model performance by introducing variability while maintaining the underlying temporal patterns. However, time series data offers fewer augmentation strategies compared to fields such as image or text, with advanced techniques like Mixup rarely being used. In this work, we propose a novel approach, Imputation-Based Mixup Augmentation (IBMA), which combines Imputation-Augmented data with Mixup augmentation to bolster model generalization and improve forecasting performance. We evaluate the effectiveness of this method across several forecasting models, including DLinear (MLP), TimesNet (CNN), and iTrainformer (Transformer), these models represent some of the most recent advances in time series forecasting. Our experiments, conducted on four datasets (ETTh1, ETTh2, ETTm1, ETTm2) and compared against eight other augmentation techniques, demonstrate that IBMA consistently enhances performance, achieving 22 improvements out of 24 instances, with 10 of those being the best performances, particularly with iTrainformer imputation.

[452] Predict-then-Optimize Method for Seaport Power-Logistics Scheduling: Generalization across Varying Tasks Stream

Chuanqing Pu, Feilong Fan, Nengling Tai, Yan Xu, Wentao Huang, Honglin Wen

Main category: cs.LG

TL;DR: A decision-focused continual learning framework for power-logistics scheduling that adapts to evolving task structures from varying vessel arrivals, using Fisher information regularization to preserve knowledge from prior tasks.

Details

Motivation: Traditional predict-then-optimize pipelines assume fixed task configurations and generalize poorly to evolving scheduling tasks caused by varying seaport vessel arrivals.

Method: Proposes decision-focused continual learning with Fisher information based regularization to preserve critical parameters from prior tasks, and develops a differentiable convex surrogate for stable gradient backpropagation.

Result: Experiments at Jurong Port show superior decision performance and generalization over existing methods with reduced computational cost.

Conclusion: The framework successfully learns decision-aligned forecasting models for new scheduling tasks while maintaining generalization on earlier tasks.

Abstract: Power-logistics scheduling in modern seaports typically follow a predict-then-optimize pipeline. To enhance decision quality, decision-focused learning has been proposed to align forecasting and optimization via end-to-end training. However, most formulations assume a fixed task configuration in downstream optimization, and thus generalize poorly to evolving task structures induced by varying seaport vessel arrivals. We address this gap with a decision-focused continual learning framework that adapts online to a stream of scheduling tasks. Specifically, we introduce Fisher information based regularization to enhance cross-task generalization by preserving parameters critical to prior tasks. A differentiable convex surrogate is also developed to stabilize gradient backpropagation. The proposed approach enables learning a decision-aligned forecasting model for new scheduling tasks while retaining generalization on earlier tasks. Experiments calibrated to the Jurong Port demonstrate superior decision performance and generalization over existing methods with reduced computational cost.

[453] Balance Equation-based Distributionally Robust Offline Imitation Learning

Rishabh Agrawal, Yusuf Alvi, Rahul Jain, Ashutosh Nayyar

Main category: cs.LG

TL;DR: A robust offline imitation learning framework that handles environment dynamics shifts using distributionally robust optimization, learning from expert demonstrations without additional environment interaction.

Details

Motivation: Standard imitation learning assumes fixed environment dynamics between training and deployment, but real-world factors like modeling errors and parameter variations cause dynamics shifts that degrade performance.

Method: Balance Equation-based Distributionally Robust Offline Imitation Learning formulates the problem as distributionally robust optimization over transition model uncertainty sets, reformulating the robust objective using nominal data distribution for tractable offline learning.

Result: Empirical evaluations on continuous-control benchmarks show superior robustness and generalization compared to state-of-the-art offline IL baselines, especially under perturbed or shifted environments.

Conclusion: The proposed framework effectively addresses dynamics shifts in imitation learning through distributionally robust optimization, enabling robust policy learning from expert demonstrations without requiring additional environment interaction.

Abstract: Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible. However, standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment. In practice, this assumption rarely holds where modeling inaccuracies, real-world parameter variations, and adversarial perturbations can all induce shifts in transition dynamics, leading to severe performance degradation. We address this challenge through Balance Equation-based Distributionally Robust Offline Imitation Learning, a framework that learns robust policies solely from expert demonstrations collected under nominal dynamics, without requiring further environment interaction. We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution. Importantly, we show that this robust objective can be reformulated entirely in terms of the nominal data distribution, enabling tractable offline learning. Empirical evaluations on continuous-control benchmarks demonstrate that our approach achieves superior robustness and generalization compared to state-of-the-art offline IL baselines, particularly under perturbed or shifted environments.

[454] Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective

Justin Lee, Zheda Mai, Jinsu Yoo, Chongyu Fan, Cheng Zhang, Wei-Lun Chao

Main category: cs.LG

TL;DR: First systematic study of continual unlearning in text-to-image diffusion models, showing existing methods fail due to cumulative parameter drift, and proposing regularizers including gradient-projection to preserve retained knowledge.

Details

Motivation: Existing unlearning methods assume all requests arrive at once, but in practice they arrive sequentially. Current approaches suffer from rapid utility collapse after only a few requests, forgetting retained knowledge and generating degraded images.

Method: Study add-on regularizers to mitigate parameter drift, including a gradient-projection method that constrains drift orthogonal to semantic subspaces. These remain compatible with existing unlearning methods.

Result: Proposed regularizers substantially improve continual unlearning performance, with gradient-projection being particularly effective for preserving concepts close to unlearning targets. Methods are complementary for further gains.

Conclusion: Establishes continual unlearning as a fundamental challenge in text-to-image generation and provides insights, baselines, and open directions for advancing safe and accountable generative AI.

Abstract: Machine unlearning–the ability to remove designated concepts from a pre-trained model–has advanced rapidly, particularly for text-to-image diffusion models. However, existing methods typically assume that unlearning requests arrive all at once, whereas in practice they often arrive sequentially. We present the first systematic study of continual unlearning in text-to-image diffusion models and show that popular unlearning methods suffer from rapid utility collapse: after only a few requests, models forget retained knowledge and generate degraded images. We trace this failure to cumulative parameter drift from the pre-training weights and argue that regularization is crucial to addressing it. To this end, we study a suite of add-on regularizers that (1) mitigate drift and (2) remain compatible with existing unlearning methods. Beyond generic regularizers, we show that semantic awareness is essential for preserving concepts close to the unlearning target, and propose a gradient-projection method that constrains parameter drift orthogonal to their subspace. This substantially improves continual unlearning performance and is complementary to other regularizers for further gains. Taken together, our study establishes continual unlearning as a fundamental challenge in text-to-image generation and provides insights, baselines, and open directions for advancing safe and accountable generative AI.

[455] Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning

Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko

Main category: cs.LG

TL;DR: LOREN is a curvature-aware zeroth-order optimization method for fine-tuning LLMs that improves gradient estimation through adaptive perturbation distributions, low-rank preconditioning, and variance reduction techniques.

Details

Motivation: Existing zeroth-order methods suffer from high variance and suboptimal search directions when estimating gradients via random perturbations, leading to inefficient fine-tuning of large language models.

Method: LOREN reformulates gradient preconditioning as adaptive anisotropic perturbation distribution estimation, uses natural evolution strategies for low-rank block diagonal preconditioning to capture curvature, and applies REINFORCE leave-one-out estimator for variance reduction.

Result: LOREN outperforms state-of-the-art ZO methods with higher accuracy and faster convergence, while reducing peak memory usage by up to 27.3% compared to MeZO-Adam on standard LLM benchmarks.

Conclusion: LOREN provides an effective curvature-aware zeroth-order optimization approach that addresses key limitations of existing methods, enabling more efficient and memory-friendly fine-tuning of large language models.

Abstract: We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs). Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions. Our approach addresses these challenges by: (i) reformulating the problem of gradient preconditioning as that of adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner using the framework of natural evolution strategies, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance. Experiments on standard LLM benchmarks show that our method outperforms state-of-the-art ZO methods by achieving higher accuracy and faster convergence, while cutting peak memory usage by up to 27.3% compared with MeZO-Adam.

[456] Generalizable Insights for Graph Transformers in Theory and Practice

Timo Stoll, Luis Müller, Christopher Morris

Main category: cs.LG

TL;DR: The paper proposes Generalized-Distance Transformer (GDT), a graph transformer architecture that unifies various design choices and provides comprehensive theoretical and empirical analysis of attention mechanisms and positional embeddings across diverse applications.

Details

Motivation: Current graph transformer architectures vary widely in design choices, existing expressivity results are tied to specific implementations, and there's a gap between theory and practice preventing generalizable insights across domains.

Method: Proposed GDT architecture using standard attention that incorporates recent GT advancements, conducted fine-grained analysis of representation power, and performed extensive experiments across 8M+ graphs with 270M tokens covering diverse domains.

Result: Identified design choices that consistently perform well across applications, tasks, and model scales, demonstrated strong few-shot transfer performance without fine-tuning, and achieved comprehensive evaluation across multiple domains.

Conclusion: Distilled theoretical and practical findings into generalizable insights about effective graph transformer design, training, and inference that apply across diverse application domains.

Abstract: Graph Transformers (GTs) have shown strong empirical performance, yet current architectures vary widely in their use of attention mechanisms, positional embeddings (PEs), and expressivity. Existing expressivity results are often tied to specific design choices and lack comprehensive empirical validation on large-scale data. This leaves a gap between theory and practice, preventing generalizable insights that exceed particular application domains. Here, we propose the Generalized-Distance Transformer (GDT), a GT architecture using standard attention that incorporates many advancements for GTs from recent years, and develop a fine-grained understanding of the GDT’s representation power in terms of attention and PEs. Through extensive experiments, we identify design choices that consistently perform well across various applications, tasks, and model scales, demonstrating strong performance in a few-shot transfer setting without fine-tuning. Our evaluation covers over eight million graphs with roughly 270M tokens across diverse domains, including image-based object detection, molecular property prediction, code summarization, and out-of-distribution algorithmic reasoning. We distill our theoretical and practical findings into several generalizable insights about effective GT design, training, and inference.

[457] From Sequential to Recursive: Enhancing Decision-Focused Learning with Bidirectional Feedback

Xinyu Wang, Jinxiao Du, Yiyang Peng, Wei Ma

Main category: cs.LG

TL;DR: Proposes recursive decision-focused learning (R-DFL) with bidirectional feedback between prediction and optimization, overcoming limitations of sequential DFL frameworks through explicit unrolling and implicit differentiation methods.

Details

Motivation: Existing sequential DFL frameworks fail to capture bidirectional feedback between prediction and optimization in complex interaction scenarios, limiting their effectiveness in closed-loop decision-making problems.

Method: Introduces R-DFL framework with bidirectional feedback, implements two differentiation methods: explicit unrolling via automatic differentiation and implicit differentiation based on fixed-point methods for efficient gradient propagation.

Result: R-DFL substantially enhances final decision quality over sequential baselines and exhibits robust adaptability across diverse scenarios in both synthetic and real-world datasets (newsvendor problem, bipartite matching).

Conclusion: R-DFL provides a more effective framework for closed-loop decision-making problems by enabling bidirectional feedback between optimization and prediction, with both differentiation methods achieving comparable gradient accuracy while implicit method offers superior computational efficiency.

Abstract: Decision-focused learning (DFL) has emerged as a powerful end-to-end alternative to conventional predict-then-optimize (PTO) pipelines by directly optimizing predictive models through downstream decision losses. Existing DFL frameworks are limited by their strictly sequential structure, referred to as sequential DFL (S-DFL). However, S-DFL fails to capture the bidirectional feedback between prediction and optimization in complex interaction scenarios. In view of this, we first time propose recursive decision-focused learning (R-DFL), a novel framework that introduces bidirectional feedback between downstream optimization and upstream prediction. We further extend two distinct differentiation methods: explicit unrolling via automatic differentiation and implicit differentiation based on fixed-point methods, to facilitate efficient gradient propagation in R-DFL. We rigorously prove that both methods achieve comparable gradient accuracy, with the implicit method offering superior computational efficiency. Extensive experiments on both synthetic and real-world datasets, including the newsvendor problem and the bipartite matching problem, demonstrate that R-DFL not only substantially enhances the final decision quality over sequential baselines but also exhibits robust adaptability across diverse scenarios in closed-loop decision-making problems.

[458] DynaAct: Large Language Model Reasoning with Dynamic Action Spaces

Xueliang Zhao, Wei Wu, Jian Guan, Qintong Li, Lingpeng Kong

Main category: cs.LG

TL;DR: DynaAct automatically constructs compact action spaces for efficient sequential reasoning using LLM-extracted action sketches and submodular optimization for diversity and utility.

Details

Motivation: Existing approaches use either manual action spaces lacking scalability or unstructured spaces making exhaustive search computationally prohibitive.

Method: Extracts general action sketches from diverse reasoning problems using LLMs, then formulates submodular function to evaluate actions based on utility and diversity, using greedy selection.

Result: Significantly improves performance on six diverse benchmarks while maintaining efficient inference without substantial latency.

Conclusion: DynaAct provides an effective framework for automatic action space construction that enhances sequential reasoning in complex problem-solving.

Abstract: In modern sequential decision-making systems, the construction of an optimal candidate action space is critical to efficient inference. However, existing approaches either rely on manually defined action spaces that lack scalability or utilize unstructured spaces that render exhaustive search computationally prohibitive. In this paper, we propose a novel framework named \textsc{DynaAct} for automatically constructing a compact action space to enhance sequential reasoning in complex problem-solving scenarios. Our method first estimates a proxy for the complete action space by extracting general sketches observed in a corpus covering diverse complex reasoning problems using large language models. We then formulate a submodular function that jointly evaluates candidate actions based on their utility to the current state and their diversity, and employ a greedy algorithm to select an optimal candidate set. Extensive experiments on six diverse standard benchmarks demonstrate that our approach significantly improves overall performance, while maintaining efficient inference without introducing substantial latency. The implementation is available at https://github.com/zhaoxlpku/DynaAct.

[459] Online Linear Regression with Paid Stochastic Features

Nadav Merlis, Kyoungseok Jang, Nicolò Cesa-Bianchi

Main category: cs.LG

TL;DR: Online linear regression with noisy features where learners can pay to reduce noise. Optimal regret is √T when noise covariance is known and T^{2/3} when unknown.

Details

Motivation: Study practical scenarios where feature noise can be reduced through payments (e.g., better equipment, privacy incentives), and analyze the trade-off between prediction accuracy and cost.

Method: Analyze online linear regression with i.i.d. feature vectors corrupted by noise. Use matrix martingale concentration to show uniform convergence of empirical loss to expected loss across all payments and predictors.

Result: Proved optimal regret rates: √T (ignoring logs) when noise covariance mapping is known, and T^{2/3} when unknown.

Conclusion: The cost-quality trade-off in feature measurement significantly impacts learning efficiency, with unknown noise parameters substantially increasing regret rates.

Abstract: We study an online linear regression setting in which the observed feature vectors are corrupted by noise and the learner can pay to reduce the noise level. In practice, this may happen for several reasons: for example, because features can be measured more accurately using more expensive equipment, or because data providers can be incentivized to release less private features. Assuming feature vectors are drawn i.i.d. from a fixed but unknown distribution, we measure the learner’s regret against the linear predictor minimizing a notion of loss that combines the prediction error and payment. When the mapping between payments and noise covariance is known, we prove that the rate $\sqrt{T}$ is optimal for regret if logarithmic factors are ignored. When the noise covariance is unknown, we show that the optimal regret rate becomes of order $T^{2/3}$ (ignoring log factors). Our analysis leverages matrix martingale concentration, showing that the empirical loss uniformly converges to the expected one for all payments and linear predictors.

[460] An Integrated Fusion Framework for Ensemble Learning Leveraging Gradient Boosting and Fuzzy Rule-Based Models

Jinbo Li, Peng Liu, Long Chen, Witold Pedrycz, Weiping Ding

Main category: cs.LG

TL;DR: Proposes an Integrated Fusion Framework combining Gradient Boosting with Fuzzy Rule-Based Models to enhance performance while maintaining interpretability, using dynamic control factors and sample-based correction to prevent overfitting.

Details

Motivation: To overcome limitations of individual learning paradigms by integrating the interpretability of fuzzy rule-based models with the performance benefits of gradient boosting, addressing challenges like complex design specifications and scalability issues.

Method: An ensemble framework where fuzzy rule-based models are constructed at each iteration, controlled by dynamic factors that prevent model dominance, encourage diversity, and act as regularization. Includes sample-based correction mechanism for adaptive adjustments.

Result: Experimental results demonstrate performance enhancement, particularly in mitigating overfitting and complexity associated with many rules, while maintaining model interpretability.

Conclusion: The framework successfully leverages optimal control factors to improve performance, maintain interpretability, and simplify model maintenance and updates through the integration of gradient boosting with fuzzy rule-based models.

Abstract: The integration of different learning paradigms has long been a focus of machine learning research, aimed at overcoming the inherent limitations of individual methods. Fuzzy rule-based models excel in interpretability and have seen widespread application across diverse fields. However, they face challenges such as complex design specifications and scalability issues with large datasets. The fusion of different techniques and strategies, particularly Gradient Boosting, with Fuzzy Rule-Based Models offers a robust solution to these challenges. This paper proposes an Integrated Fusion Framework that merges the strengths of both paradigms to enhance model performance and interpretability. At each iteration, a Fuzzy Rule-Based Model is constructed and controlled by a dynamic factor to optimize its contribution to the overall ensemble. This control factor serves multiple purposes: it prevents model dominance, encourages diversity, acts as a regularization parameter, and provides a mechanism for dynamic tuning based on model performance, thus mitigating the risk of overfitting. Additionally, the framework incorporates a sample-based correction mechanism that allows for adaptive adjustments based on feedback from a validation set. Experimental results substantiate the efficacy of the presented gradient boosting framework for fuzzy rule-based models, demonstrating performance enhancement, especially in terms of mitigating overfitting and complexity typically associated with many rules. By leveraging an optimal factor to govern the contribution of each model, the framework improves performance, maintains interpretability, and simplifies the maintenance and update of the models.

[461] Hierarchical Structure-Property Alignment for Data-Efficient Molecular Generation and Editing

Ziyu Fan, Zhijian Huang, Yahan Li, Xiaowen Hu, Siyuan Shen, Yunliang Wang, Zeyu Zhong, Shuhong Liu, Shuning Yang, Shangqian Wu, Min Wu, Lei Deng

Main category: cs.LG

TL;DR: HSPAG is a hierarchical structure-property alignment framework for property-constrained molecular generation and editing that learns relationships between molecular structures and properties at multiple levels while being data-efficient.

Details

Motivation: Current molecular generation methods struggle with capturing complex structure-property relationships and suffer from limited coverage and incomplete property annotations in datasets.

Method: Treats SMILES and molecular properties as complementary modalities; learns relationships at atom, substructure, and whole-molecule levels; uses scaffold clustering and VAE for sample selection; incorporates property relevance-aware masking and diversified perturbation strategies.

Result: HSPAG successfully captures fine-grained structure-property relationships and supports controllable generation under multiple property constraints; validated through real-world case studies.

Conclusion: The proposed hierarchical structure-property alignment framework effectively addresses data efficiency and relationship modeling challenges in property-constrained molecular generation and editing.

Abstract: Property-constrained molecular generation and editing are crucial in AI-driven drug discovery but remain hindered by two factors: (i) capturing the complex relationships between molecular structures and multiple properties remains challenging, and (ii) the narrow coverage and incomplete annotations of molecular properties weaken the effectiveness of property-based models. To tackle these limitations, we propose HSPAG, a data-efficient framework featuring hierarchical structure-property alignment. By treating SMILES and molecular properties as complementary modalities, the model learns their relationships at atom, substructure, and whole-molecule levels. Moreover, we select representative samples through scaffold clustering and hard samples via an auxiliary variational auto-encoder (VAE), substantially reducing the required pre-training data. In addition, we incorporate a property relevance-aware masking mechanism and diversified perturbation strategies to enhance generation quality under sparse annotations. Experiments demonstrate that HSPAG captures fine-grained structure-property relationships and supports controllable generation under multiple property constraints. Two real-world case studies further validate the editing capabilities of HSPAG.

[462] HipKittens: Fast and Furious AMD Kernels

William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, Simran Arora

Main category: cs.LG

TL;DR: HipKittens (HK) is a programming framework that adapts tile-based abstractions from NVIDIA-focused DSLs like ThunderKittens to AMD GPUs, achieving performance competitive with hand-optimized assembly kernels and outperforming compiler baselines.

Details

Motivation: AMD GPUs offer high performance but require assembly programming for peak performance, while existing domain-specific languages like ThunderKittens are NVIDIA-specific. The paper aims to explore whether tile-based programming primitives can generalize to AMD hardware.

Method: The authors study programming primitives for performant AMD AI kernels and encapsulate insights in HipKittens framework, adapting tile-based abstractions from prior DSLs but rethinking the algorithms for AMD hardware across CDNA3 and CDNA4 platforms.

Result: HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, outperform compiler baselines, and in some settings outperform all available kernel baselines by 1.2-2.4× (e.g., d=64 attention, GQA backwards, memory-bound kernels).

Conclusion: Tile-based abstractions generalize to AMD GPUs but require algorithm rethinking. HipKittens demonstrates the feasibility of a single tile-based software layer for high-performance AI kernels across GPU vendors.

Abstract: AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives – for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers – are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4\times$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.

[463] Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks

Muthukumar Pandaram, Jakob Hollenstein, David Drexel, Samuele Tosatto, Antonio Rodríguez-Sánchez, Justus Piater

Main category: cs.LG

TL;DR: This paper critically examines sparsity assumptions in learned dynamics models for RL, finding that global sparsity is rare but local, state-dependent sparsity exists in specific temporal clusters and state dimensions.

Details

Motivation: To determine whether proposed notions of state and temporal sparsity in dynamics models actually hold in typical RL tasks, challenging common sparsity prior assumptions.

Method: Analyzed ground-truth dynamics from robotic RL environments in MuJoCo Playground benchmark suite, examining causal graph sparsity, state-dependent sparsity, and sparse local dynamics changes.

Result: Found that global sparsity is rare, but local state-dependent sparsity exists in temporally localized clusters (e.g., during contact events) and affects specific subsets of state dimensions.

Conclusion: Common sparsity prior assumptions in dynamics learning may be misguided; instead, grounded inductive biases should reflect the state-dependent sparsity structure of real-world dynamics.

Abstract: The use of learned dynamics models, also known as world models, can improve the sample efficiency of reinforcement learning. Recent work suggests that the underlying causal graphs of such dynamics models are sparsely connected, with each of the future state variables depending only on a small subset of the current state variables, and that learning may therefore benefit from sparsity priors. Similarly, temporal sparsity, i.e. sparsely and abruptly changing local dynamics, has also been proposed as a useful inductive bias. In this work, we critically examine these assumptions by analyzing ground-truth dynamics from a set of robotic reinforcement learning environments in the MuJoCo Playground benchmark suite, aiming to determine whether the proposed notions of state and temporal sparsity actually tend to hold in typical reinforcement learning tasks. We study (i) whether the causal graphs of environment dynamics are sparse, (ii) whether such sparsity is state-dependent, and (iii) whether local system dynamics change sparsely. Our results indicate that global sparsity is rare, but instead the tasks show local, state-dependent sparsity in their dynamics and this sparsity exhibits distinct structures, appearing in temporally localized clusters (e.g., during contact events) and affecting specific subsets of state dimensions. These findings challenge common sparsity prior assumptions in dynamics learning, emphasizing the need for grounded inductive biases that reflect the state-dependent sparsity structure of real-world dynamics.

[464] Stuart-Landau Oscillatory Graph Neural Network

Kaicheng Zhang, David N. Reynolds, Piero Deidda, Francesco Tudisco

Main category: cs.LG

TL;DR: SLGNN is a novel oscillatory graph neural network based on Stuart-Landau oscillator dynamics that outperforms existing OGNNs by incorporating both amplitude and phase dynamics with tunable parameters.

Details

Motivation: To address oversmoothing and vanishing gradient problems in deep GNNs by developing physics-inspired oscillatory architectures that can capture richer dynamics than phase-only models.

Method: Proposes Complex-Valued Stuart-Landau Graph Neural Network (SLGNN) that generalizes Kuramoto-based OGNNs by incorporating both amplitude and phase dynamics from Stuart-Landau oscillators with tunable hyperparameters like Hopf-parameter and coupling strength.

Result: Extensive experiments show SLGNN outperforms existing OGNNs across node classification, graph classification, and graph regression tasks.

Conclusion: SLGNN establishes a novel, expressive, and theoretically grounded framework for deep oscillatory architectures on graphs with improved performance over existing methods.

Abstract: Oscillatory Graph Neural Networks (OGNNs) are an emerging class of physics-inspired architectures designed to mitigate oversmoothing and vanishing gradient problems in deep GNNs. In this work, we introduce the Complex-Valued Stuart-Landau Graph Neural Network (SLGNN), a novel architecture grounded in Stuart-Landau oscillator dynamics. Stuart-Landau oscillators are canonical models of limit-cycle behavior near Hopf bifurcations, which are fundamental to synchronization theory and are widely used in e.g. neuroscience for mesoscopic brain modeling. Unlike harmonic oscillators and phase-only Kuramoto models, Stuart-Landau oscillators retain both amplitude and phase dynamics, enabling rich phenomena such as amplitude regulation and multistable synchronization. The proposed SLGNN generalizes existing phase-centric Kuramoto-based OGNNs by allowing node feature amplitudes to evolve dynamically according to Stuart-Landau dynamics, with explicit tunable hyperparameters (such as the Hopf-parameter and the coupling strength) providing additional control over the interplay between feature amplitudes and network structure. We conduct extensive experiments across node classification, graph classification, and graph regression tasks, demonstrating that SLGNN outperforms existing OGNNs and establishes a novel, expressive, and theoretically grounded framework for deep oscillatory architectures on graphs.

[465] A robust methodology for long-term sustainability evaluation of Machine Learning models

Jorge Paz-Ruza, João Gama, Amparo Alonso-Betanzos, Bertha Guijarro-Berdiñas

Main category: cs.LG

TL;DR: Proposes a comprehensive evaluation protocol for assessing long-term sustainability of ML models, showing traditional static evaluations fail to capture sustainability under evolving data and model updates.

Details

Motivation: Current AI sustainability assessments lack standardized, model-agnostic protocols, measure only short-term resource usage, and disproportionately emphasize batch learning, failing to reflect real-world long-term AI lifecycles.

Method: Developed a comprehensive evaluation protocol applicable to both batch and streaming learning scenarios, tested on diverse classification tasks using various model types.

Result: Long-term sustainability varies significantly across models, and higher environmental cost often yields little performance benefit. Traditional static train-test evaluations don’t reliably capture sustainability under evolving data and repeated updates.

Conclusion: A standardized, comprehensive evaluation protocol is needed to properly assess ML model sustainability across their entire lifecycle, as current methods are inadequate for real-world deployment scenarios.

Abstract: Sustainability and efficiency have become essential considerations in the development and deployment of Artificial Intelligence systems, yet existing regulatory and reporting practices lack standardized, model-agnostic evaluation protocols. Current assessments often measure only short-term experimental resource usage and disproportionately emphasize batch learning settings, failing to reflect real-world, long-term AI lifecycles. In this work, we propose a comprehensive evaluation protocol for assessing the long-term sustainability of ML models, applicable to both batch and streaming learning scenarios. Through experiments on diverse classification tasks using a range of model types, we demonstrate that traditional static train-test evaluations do not reliably capture sustainability under evolving data and repeated model updates. Our results show that long-term sustainability varies significantly across models, and in many cases, higher environmental cost yields little performance benefit.

[466] SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories

Returaj Burnwal, Nirav Pravinbhai Bhatt, Balaraman Ravindran

Main category: cs.LG

TL;DR: SafeMIL: Offline safe imitation learning using non-preferred trajectories to learn risky behavior avoidance via Multiple Instance Learning, achieving safer policies without reward degradation.

Details

Motivation: Online interactions can be risky in real-world settings, and specifying exact reward/safety costs is difficult. Non-preferred trajectories implicitly convey risky behaviors to avoid, providing a safer learning alternative.

Method: Proposes SafeMIL using Multiple Instance Learning to learn a parameterized cost function that predicts risky state-action pairs from non-preferred trajectories, then uses this cost to avoid unsafe behaviors.

Result: Empirically demonstrates that SafeMIL learns safer policies that satisfy cost constraints without degrading reward performance, outperforming several baseline methods.

Conclusion: SafeMIL effectively leverages non-preferred trajectories for offline safe imitation learning, enabling risk-aware policy learning while maintaining performance.

Abstract: In this work, we study the problem of offline safe imitation learning (IL). In many real-world settings, online interactions can be risky, and accurately specifying the reward and the safety cost information at each timestep can be difficult. However, it is often feasible to collect trajectories reflecting undesirable or risky behavior, implicitly conveying the behavior the agent should avoid. We refer to these trajectories as non-preferred trajectories. Unlike standard IL, which aims to mimic demonstrations, our agent must also learn to avoid risky behavior using non-preferred trajectories. In this paper, we propose a novel approach, SafeMIL, to learn a parameterized cost that predicts if the state-action pair is risky via \textit{Multiple Instance Learning}. The learned cost is then used to avoid non-preferred behaviors, resulting in a policy that prioritizes safety. We empirically demonstrate that our approach can learn a safer policy that satisfies cost constraints without degrading the reward performance, thereby outperforming several baselines.

[467] Deep (Predictive) Discounted Counterfactual Regret Minimization

Hang Xu, Kai Li, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng

Main category: cs.LG

TL;DR: Proposes an efficient model-free neural CFR algorithm that overcomes limitations in approximating advanced CFR variants, achieving faster convergence and stronger adversarial performance.

Details

Motivation: Existing neural CFR methods mainly use vanilla CFR and struggle to effectively integrate more advanced CFR variants, limiting their applicability in large games.

Method: Collects variance-reduced sampled advantages using a value network, fits cumulative advantages by bootstrapping, and applies discounting and clipping operations to simulate advanced CFR variant update mechanisms.

Result: Exhibits faster convergence in typical imperfect-information games and demonstrates stronger adversarial performance in large poker games compared to existing model-free neural algorithms.

Conclusion: The proposed neural CFR algorithm successfully approximates advanced CFR variants and shows improved performance over existing methods in large imperfect-information games.

Abstract: Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. To enhance CFR’s applicability in large games, researchers use neural networks to approximate its behavior. However, existing methods are mainly based on vanilla CFR and struggle to effectively integrate more advanced CFR variants. In this work, we propose an efficient model-free neural CFR algorithm, overcoming the limitations of existing methods in approximating advanced CFR variants. At each iteration, it collects variance-reduced sampled advantages based on a value network, fits cumulative advantages by bootstrapping, and applies discounting and clipping operations to simulate the update mechanisms of advanced CFR variants. Experimental results show that, compared with model-free neural algorithms, it exhibits faster convergence in typical imperfect-information games and demonstrates stronger adversarial performance in a large poker game.

[468] Improving Long-Range Interactions in Graph Neural Simulators via Hamiltonian Dynamics

Tai Hoang, Alessandro Trenta, Alessio Gravina, Niklas Freymuth, Philipp Becker, Davide Bacciu, Gerhard Neumann

Main category: cs.LG

TL;DR: IGNS is a graph-based neural simulator that uses Hamiltonian dynamics principles to preserve information and handle long-range interactions, outperforming existing methods on complex physical systems.

Details

Motivation: Traditional numerical solvers are computationally expensive, and current Graph Neural Simulators struggle with long-range interactions and error accumulation in autoregressive rollouts.

Method: Proposes Information-preserving Graph Neural Simulators (IGNS) based on Hamiltonian dynamics, with warmup phase for global context, geometric encoding for irregular meshes, and multi-step training to reduce rollout errors.

Result: IGNS consistently outperforms state-of-the-art GNSs across all tasks, achieving higher accuracy and stability in challenging dynamical systems.

Conclusion: IGNS effectively addresses limitations of existing neural simulators by preserving information through Hamiltonian structure and handling complex dynamics including non-conservative effects.

Abstract: Learning to simulate complex physical systems from data has emerged as a promising way to overcome the limitations of traditional numerical solvers, which often require prohibitive computational costs for high-fidelity solutions. Recent Graph Neural Simulators (GNSs) accelerate simulations by learning dynamics on graph-structured data, yet often struggle to capture long-range interactions and suffer from error accumulation under autoregressive rollouts. To address these challenges, we propose Information-preserving Graph Neural Simulators (IGNS), a graph-based neural simulator built on the principles of Hamiltonian dynamics. This structure guarantees preservation of information across the graph, while extending to port-Hamiltonian systems allows the model to capture a broader class of dynamics, including non-conservative effects. IGNS further incorporates a warmup phase to initialize global context, geometric encoding to handle irregular meshes, and a multi-step training objective to reduce rollout error. To evaluate these properties systematically, we introduce new benchmarks that target long-range dependencies and challenging external forcing scenarios. Across all tasks, IGNS consistently outperforms state-of-the-art GNSs, achieving higher accuracy and stability under challenging and complex dynamical systems.

[469] The Online Patch Redundancy Eliminator (OPRE): A novel approach to online agnostic continual learning using dataset compression

Raphaël Bayle, Martial Mermillod, Robert M. French

Main category: cs.LG

TL;DR: The paper introduces OPRE, an online dataset compression algorithm for continual learning that achieves state-of-the-art performance on CIFAR datasets while requiring minimal prior assumptions about data.

Details

Motivation: To address catastrophic forgetting in continual learning and challenge methods that rely on non-agnostic assumptions like pretrained feature extractors, which limit data generality.

Method: Proposes Online Patch Redundancy Eliminator (OPRE) - an online dataset compression algorithm combined with classifier training at test time.

Result: OPRE achieves superior performance on CIFAR-10 and CIFAR-100 compared to other state-of-the-art online continual learning methods.

Conclusion: Online dataset compression may be necessary for fully agnostic continual learning, and OPRE demonstrates this with minimal interpretable data assumptions.

Abstract: In order to achieve Continual Learning (CL), the problem of catastrophic forgetting, one that has plagued neural networks since their inception, must be overcome. The evaluation of continual learning methods relies on splitting a known homogeneous dataset and learning the associated tasks one after the other. We argue that most CL methods introduce a priori information about the data to come and cannot be considered agnostic. We exemplify this point with the case of methods relying on pretrained feature extractors, which are still used in CL. After showing that pretrained feature extractors imply a loss of generality with respect to the data that can be learned by the model, we then discuss other kinds of a priori information introduced in other CL methods. We then present the Online Patch Redundancy Eliminator (OPRE), an online dataset compression algorithm, which, along with the training of a classifier at test time, yields performance on CIFAR-10 and CIFAR-100 superior to a number of other state-of-the-art online continual learning methods. Additionally, OPRE requires only minimal and interpretable hypothesis on the data to come. We suggest that online dataset compression could well be necessary to achieve fully agnostic CL.

[470] Towards Non-Stationary Time Series Forecasting with Temporal Stabilization and Frequency Differencing

Junkai Lu, Peng Chen, Chenjuan Guo, Yang Shu, Meng Wang, Bin Yang

Main category: cs.LG

TL;DR: DTAF is a dual-branch framework for long-term time series forecasting that addresses non-stationarity in both temporal and frequency domains through specialized modules for temporal stabilization and frequency wave modeling.

Details

Motivation: Real-world time series often exhibit non-stationarity including temporal distribution shifts and spectral variability, which pose significant challenges for accurate long-term forecasting across domains like energy, finance, and transportation.

Method: Dual-branch framework with Temporal Stabilizing Fusion (TFS) module using non-stationary MOE filter to suppress temporal non-stationary patterns, and Frequency Wave Modeling (FWM) module applying frequency differencing to highlight spectral shifts.

Result: Extensive experiments on real-world benchmarks show DTAF outperforms state-of-the-art baselines with significant improvements in forecasting accuracy under non-stationary conditions.

Conclusion: DTAF effectively handles non-stationarity in both temporal and frequency domains, generating robust forecasts that adapt to complex real-world time series patterns.

Abstract: Time series forecasting is critical for decision-making across dynamic domains such as energy, finance, transportation, and cloud computing. However, real-world time series often exhibit non-stationarity, including temporal distribution shifts and spectral variability, which pose significant challenges for long-term time series forecasting. In this paper, we propose DTAF, a dual-branch framework that addresses non-stationarity in both the temporal and frequency domains. For the temporal domain, the Temporal Stabilizing Fusion (TFS) module employs a non-stationary mix of experts (MOE) filter to disentangle and suppress temporal non-stationary patterns while preserving long-term dependencies. For the frequency domain, the Frequency Wave Modeling (FWM) module applies frequency differencing to dynamically highlight components with significant spectral shifts. By fusing the complementary outputs of TFS and FWM, DTAF generates robust forecasts that adapt to both temporal and frequency domain non-stationarity. Extensive experiments on real-world benchmarks demonstrate that DTAF outperforms state-of-the-art baselines, yielding significant improvements in forecasting accuracy under non-stationary conditions. All codes are available at https://github.com/PandaJunk/DTAF.

[471] PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

Zhihao Lin, Lin Wu, Zhen Tian, Jianglin Lan

Main category: cs.LG

TL;DR: PrefPoE introduces a Preference-Product-of-Experts framework that uses advantage-guided exploration to create soft trust regions, improving training stability and performance across various RL tasks.

Details

Motivation: Standard entropy maximization in RL leads to high variance and inefficient policy updates, while existing methods suffer from entropy collapse and premature convergence.

Method: Trains a preference network to focus on high-advantage actions and fuses it with the main policy using product-of-experts (PoE) fusion, creating soft trust regions for stable updates.

Result: Significant performance improvements: +321% on HalfCheetah-v4, +69% on Ant-v4, +276% on LunarLander-v2, with enhanced training stability and sample efficiency compared to standard PPO.

Conclusion: Learning where to explore through advantage-guided preferences is crucial for RL, and PrefPoE provides a general framework for improving policy gradient methods across domains.

Abstract: Exploration in reinforcement learning remains a critical challenge, as naive entropy maximization often results in high variance and inefficient policy updates. We introduce \textbf{PrefPoE}, a novel \textit{Preference-Product-of-Experts} framework that performs intelligent, advantage-guided exploration via the first principled application of product-of-experts (PoE) fusion for single-task exploration-exploitation balancing. By training a preference network to concentrate probability mass on high-advantage actions and fusing it with the main policy through PoE, PrefPoE creates a \textbf{soft trust region} that stabilizes policy updates while maintaining targeted exploration. Across diverse control tasks spanning both continuous and discrete action spaces, PrefPoE demonstrates consistent improvements: +321% on HalfCheetah-v4 (1276~$\rightarrow$~5375), +69% on Ant-v4, +276% on LunarLander-v2, with consistently enhanced training stability and sample efficiency. Unlike standard PPO, which suffers from entropy collapse, PrefPoE sustains adaptive exploration through its unique dynamics, thereby preventing premature convergence and enabling superior performance. Our results establish that learning \textit{where to explore} through advantage-guided preferences is as crucial as learning how to act, offering a general framework for enhancing policy gradient methods across the full spectrum of reinforcement learning domains. Code and pretrained models are available in supplementary materials.

[472] A Unified Geometric Field Theory Framework for Transformers: From Manifold Embeddings to Kernel Modulation

Xianshuai Shi, Jianfeng Zhu, Leibo Liu

Main category: cs.LG

TL;DR: The paper provides a unified theoretical framework that interprets Transformer components (positional encoding and attention) through field theory and kernel operators on continuous manifolds.

Details

Motivation: Transformers lack a unified physical/mathematical interpretation for their core components (positional encoding and attention mechanisms), which this work aims to address.

Method: Map discrete positions to spatial functions on continuous manifolds and interpret Transformer layers as kernel-modulated operators acting over embedded manifolds.

Result: A structural theoretical framework integrating positional encoding, kernel integral operators, and attention mechanisms for theoretical investigation.

Conclusion: Transformers can be understood through a field-theoretic lens as operators acting on continuous manifolds, providing a unified interpretation of their core components.

Abstract: The Transformer architecture has achieved tremendous success in natural language processing, computer vision, and scientific computing through its self-attention mechanism. However, its core components-positional encoding and attention mechanisms-have lacked a unified physical or mathematical interpretation. This paper proposes a structural theoretical framework that integrates positional encoding, kernel integral operators, and attention mechanisms for in-depth theoretical investigation. We map discrete positions (such as text token indices and image pixel coordinates) to spatial functions on continuous manifolds, enabling a field-theoretic interpretation of Transformer layers as kernel-modulated operators acting over embedded manifolds.

[473] Data-Driven Discovery of Feature Groups in Clinical Time Series

Fedor Sergeev, Manuel Burger, Polina Leshetkina, Vincent Fortuin, Gunnar Rätsch, Rita Kuznetsova

Main category: cs.LG

TL;DR: Proposes a method to automatically learn feature groups from clinical time series data by clustering weights of feature-wise embedding layers, improving predictive performance and providing clinically interpretable results.

Details

Motivation: Clinical time series data are multivariate with hundreds of heterogeneous features, and manually defining feature groups using semantic knowledge is challenging even for domain experts, limiting the performance of deep learning models.

Method: Learns feature groups by clustering weights of feature-wise embedding layers during standard supervised training, discovering groups that directly improve downstream performance.

Result: Outperforms static clustering approaches on synthetic data and achieves performance comparable to expert-defined groups on real-world medical data, with learned groups being clinically interpretable.

Conclusion: The proposed method effectively discovers task-relevant feature relationships automatically, enhancing predictive modeling in clinical settings while maintaining interpretability.

Abstract: Clinical time series data are critical for patient monitoring and predictive modeling. These time series are typically multivariate and often comprise hundreds of heterogeneous features from different data sources. The grouping of features based on similarity and relevance to the prediction task has been shown to enhance the performance of deep learning architectures. However, defining these groups a priori using only semantic knowledge is challenging, even for domain experts. To address this, we propose a novel method that learns feature groups by clustering weights of feature-wise embedding layers. This approach seamlessly integrates into standard supervised training and discovers the groups that directly improve downstream performance on clinically relevant tasks. We demonstrate that our method outperforms static clustering approaches on synthetic data and achieves performance comparable to expert-defined groups on real-world medical data. Moreover, the learned feature groups are clinically interpretable, enabling data-driven discovery of task-relevant relationships between variables.

[474] Rethinking Explanation Evaluation under the Retraining Scheme

Yi Cai, Thibaud Ardoin, Mayank Gulati, Gerhard Wunder

Main category: cs.LG

TL;DR: The paper investigates issues with retraining-based explanation evaluation methods like ROAR, identifies the “sign issue” as a key problem causing misalignment between empirical results and theoretical expectations, and proposes improved evaluation variants that enhance efficiency and reliability.

Details

Motivation: Current explanation evaluation methods face challenges: inference-based approaches suffer from distribution shift issues, while retraining-based methods like ROAR produce results that contradict theoretical foundations of explainers. There's a need to understand and resolve this misalignment.

Method: The authors identify the “sign issue” as the key problem in retraining-based evaluation, propose a straightforward reframing of the evaluation process to resolve it, and develop novel variants of the evaluation framework that improve efficiency while maintaining comprehensive assessment.

Result: The proposed evaluation variants significantly improve evaluation efficiency over standard retraining protocols and provide deeper insights into explainer performance across various data scales, revealing open challenges in explainability research.

Conclusion: The work successfully resolves the identified sign issue in retraining-based explanation evaluation, offers more efficient and reliable evaluation methods, and provides valuable insights for explainer selection and benchmarking in explainability research.

Abstract: Feature attribution has gained prominence as a tool for explaining model decisions, yet evaluating explanation quality remains challenging due to the absence of ground-truth explanations. To circumvent this, explanation-guided input manipulation has emerged as an indirect evaluation strategy, measuring explanation effectiveness through the impact of input modifications on model outcomes during inference. Despite the widespread use, a major concern with inference-based schemes is the distribution shift caused by such manipulations, which undermines the reliability of their assessments. The retraining-based scheme ROAR overcomes this issue by adapting the model to the altered data distribution. However, its evaluation results often contradict the theoretical foundations of widely accepted explainers. This work investigates this misalignment between empirical observations and theoretical expectations. In particular, we identify the sign issue as a key factor responsible for residual information that ultimately distorts retraining-based evaluation. Based on the analysis, we show that a straightforward reframing of the evaluation process can effectively resolve the identified issue. Building on the existing framework, we further propose novel variants that jointly structure a comprehensive perspective on explanation evaluation. These variants largely improve evaluation efficiency over the standard retraining protocol, thereby enhancing practical applicability for explainer selection and benchmarking. Following our proposed schemes, empirical results across various data scales provide deeper insights into the performance of carefully selected explainers, revealing open challenges and future directions in explainability research.

[475] Dual-Kernel Graph Community Contrastive Learning

Xiang Chen, Kun Yue, Wenjie Liu, Zhenyu Zhang, Liang Duan

Main category: cs.LG

TL;DR: Proposes an efficient Graph Contrastive Learning framework that transforms graphs into compact networks of node sets with linear complexity contrastive loss and knowledge distillation for scalable performance.

Details

Motivation: Address scalability issues in Graph Contrastive Learning caused by intensive message passing in GNNs and quadratic computational complexity of contrastive loss on large-scale graphs.

Method: Transforms input graph into compact network of interconnected node sets, introduces kernelized graph community contrastive loss with linear complexity, and incorporates knowledge distillation into decoupled GNN architecture.

Result: Outperforms state-of-the-art GCL baselines on sixteen real-world datasets of varying scales in both effectiveness and scalability.

Conclusion: The proposed framework successfully addresses scalability challenges in GCL while maintaining strong performance across diverse graph datasets.

Abstract: Graph Contrastive Learning (GCL) has emerged as a powerful paradigm for training Graph Neural Networks (GNNs) in the absence of task-specific labels. However, its scalability on large-scale graphs is hindered by the intensive message passing mechanism of GNN and the quadratic computational complexity of contrastive loss over positive and negative node pairs. To address these issues, we propose an efficient GCL framework that transforms the input graph into a compact network of interconnected node sets while preserving structural information across communities. We firstly introduce a kernelized graph community contrastive loss with linear complexity, enabling effective information transfer among node sets to capture hierarchical structural information of the graph. We then incorporate a knowledge distillation technique into the decoupled GNN architecture to accelerate inference while maintaining strong generalization performance. Extensive experiments on sixteen real-world datasets of varying scales demonstrate that our method outperforms state-of-the-art GCL baselines in both effectiveness and scalability.

[476] Test-time Diverse Reasoning by Riemannian Activation Steering

Ly Tran Ho Khanh, Dongxuan Zhu, Man-Chung Yue, Viet Anh Nguyen

Main category: cs.LG

TL;DR: Proposes test-time Riemannian activation steering to boost output diversity in Best-of-N reasoning by optimizing steering vectors that maximize volume spanned by reasoning trajectories.

Details

Motivation: Address output diversity limit in Best-of-N reasoning where models generate similar outputs despite stochastic sampling, leading to repeated errors.

Method: Unsupervised activation steering that finds steering vectors maximizing total volume spanned by intervened activation subsets, solved via Riemannian optimization over product of spheres with log-determinant objective.

Result: Outperforms vanilla sampling techniques on mathematical benchmarks in both generative diversity and solution accuracy.

Conclusion: Test-time Riemannian activation steering effectively enhances output diversity and improves reasoning accuracy in language models.

Abstract: Best-of-$N$ reasoning improves the accuracy of language models in solving complex tasks by sampling multiple candidate solutions and then selecting the best one based on some criteria. A critical bottleneck for this strategy is the output diversity limit, which occurs when the model generates similar outputs despite stochastic sampling, and hence recites the same error. To address this lack of variance in reasoning paths, we propose a novel unsupervised activation steering strategy that simultaneously optimizes the steering vectors for multiple reasoning trajectories at test time. At any synchronization anchor along the batch generation process, we find the steering vectors that maximize the total volume spanned by all possible intervened activation subsets. We demonstrate that these steering vectors can be determined by solving a Riemannian optimization problem over the product of spheres with a log-determinant objective function. We then use a Riemannian block-coordinate descent algorithm with a well-tuned learning rate to obtain a stationary point of the problem, and we apply these steering vectors until the generation process reaches the subsequent synchronization anchor. Empirical evaluations on popular mathematical benchmarks demonstrate that our test-time Riemannian activation steering strategy outperforms vanilla sampling techniques in terms of generative diversity and solution accuracy.

[477] Improving the accuracy and generalizability of molecular property regression models with a substructure-substitution-rule-informed framework

Xiaoyu Fan, Lin Guo, Ruizhen Jia, Yang Tian, Zhihao Yang, Boxue Tian

Main category: cs.LG

TL;DR: MolRuleLoss is a framework that improves molecular property prediction accuracy and generalizability by incorporating substructure substitution rules into model loss functions, achieving significant performance gains across multiple tasks.

Details

Motivation: AI models for molecular property prediction often have poor accuracy in regression tasks and catastrophic performance on out-of-distribution molecules, limiting their practical utility in drug discovery.

Method: Incorporates partial derivative constraints for substructure substitution rules (SSRs) into molecular property regression models’ loss functions to improve accuracy and generalizability.

Result: Achieved 2.6-33.3% performance improvements across lipophilicity, water solubility, and solvation-free energy prediction tasks. Reduced RMSE from 29.507 to 0.007 for molecular weight prediction on OOD molecules.

Conclusion: MolRuleLoss framework significantly boosts prediction accuracy and generalizability of molecular property regression models, supporting applications in cheminformatics and AI-aided drug discovery.

Abstract: Artificial Intelligence (AI)-aided drug discovery is an active research field, yet AI models often exhibit poor accuracy in regression tasks for molecular property prediction, and perform catastrophically poorly for out-of-distribution (OOD) molecules. Here, we present MolRuleLoss, a substructure-substitution-rule-informed framework that improves the accuracy and generalizability of multiple molecular property regression models (MPRMs) such as GEM and UniMol for diverse molecular property prediction tasks. MolRuleLoss incorporates partial derivative constraints for substructure substitution rules (SSRs) into an MPRM’s loss function. When using GEM models for predicting lipophilicity, water solubility, and solvation-free energy (using lipophilicity, ESOL, and freeSolv datasets from MoleculeNet), the root mean squared error (RMSE) values with and without MolRuleLoss were 0.587 vs. 0.660, 0.777 vs. 0.798, and 1.252 vs. 1.877, respectively, representing 2.6-33.3% performance improvements. We show that both the number and the quality of SSRs contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. MolRuleLoss improved the generalizability of MPRMs for “activity cliff” molecules in a lipophilicity prediction task and improved the generalizability of MPRMs for OOD molecules in a melting point prediction task. In a molecular weight prediction task for OOD molecules, MolRuleLoss reduced the RMSE value of a GEM model from 29.507 to 0.007. We also provide a formal demonstration that the upper bound of the variation for property change of SSRs is positively correlated with an MPRM’s error. Together, we show that using the MolRuleLoss framework as a bolt-on boosts the prediction accuracy and generalizability of multiple MPRMs, supporting diverse applications in areas like cheminformatics and AI-aided drug discovery.

[478] Adversarial Bias: Data Poisoning Attacks on Fairness

Eunice Chan, Hanghang Tong

Main category: cs.LG

TL;DR: The paper demonstrates how adversarial poisoning attacks can intentionally compromise fairness in AI systems, showing that strategically injecting crafted data points into training sets can induce maximally unfair behavior in naive Bayes classifiers while preserving general performance.

Details

Motivation: With growing AI adoption, ensuring fairness is critical, but little research exists on how AI systems' fairness can be intentionally compromised through adversarial attacks.

Method: Theoretical analysis and experiments using adversarial poisoning strategy that injects carefully crafted data points into training sets to bias decision boundaries against protected groups.

Result: The attack significantly outperforms existing methods in degrading fairness metrics across multiple models and datasets, achieving higher unfairness levels with comparable or slightly worse accuracy impact.

Conclusion: The method provides a robust and potent approach to compromising machine learning fairness, demonstrating vulnerability across a wide range of models.

Abstract: With the growing adoption of AI and machine learning systems in real-world applications, ensuring their fairness has become increasingly critical. The majority of the work in algorithmic fairness focus on assessing and improving the fairness of machine learning systems. There is relatively little research on fairness vulnerability, i.e., how an AI system’s fairness can be intentionally compromised. In this work, we first provide a theoretical analysis demonstrating that a simple adversarial poisoning strategy is sufficient to induce maximally unfair behavior in naive Bayes classifiers. Our key idea is to strategically inject a small fraction of carefully crafted adversarial data points into the training set, biasing the model’s decision boundary to disproportionately affect a protected group while preserving generalizable performance. To illustrate the practical effectiveness of our method, we conduct experiments across several benchmark datasets and models. We find that our attack significantly outperforms existing methods in degrading fairness metrics across multiple models and datasets, often achieving substantially higher levels of unfairness with a comparable or only slightly worse impact on accuracy. Notably, our method proves effective on a wide range of models, in contrast to prior work, demonstrating a robust and potent approach to compromising the fairness of machine learning systems.

[479] LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration

Ruiyu Qiu, Rui Wang, Guanghui Yang, Xiang Li, Zhijiang Shao

Main category: cs.LG

TL;DR: LPPG-RL is a novel lexicographic multi-objective RL framework that uses sequential gradient projections and Dykstra’s algorithm to efficiently handle prioritized objectives in continuous spaces, outperforming existing methods.

Details

Motivation: Existing LMORL methods either require heuristic threshold tuning with prior knowledge or are limited to discrete domains, making them inefficient for real-world continuous applications with prioritized objectives.

Method: LPPG-RL uses sequential gradient projections to find feasible policy update directions, reformulates projection as optimization using Dykstra’s algorithm for speed, and introduces Subproblem Exploration to prevent gradient vanishing and enhance stability.

Result: Extensive experiments in 2D navigation show LPPG-RL outperforms state-of-the-art continuous LMORL methods, with theoretical guarantees for convergence and policy improvement.

Conclusion: LPPG-RL provides an effective framework for lexicographic multi-objective RL in continuous spaces, overcoming limitations of existing methods through gradient projections and optimization techniques.

Abstract: Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra’s projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.

[480] HN-MVTS: HyperNetwork-based Multivariate Time Series Forecasting

Andrey Savchenko, Oleg Kachan

Main category: cs.LG

TL;DR: HN-MVTS integrates a hypernetwork-based generative prior with neural forecasting models to improve multivariate time series forecasting by generating weights for the last layer, enhancing generalization without increasing inference time.

Details

Motivation: Complex channel-dependent models often underperform compared to channel-independent models despite considering inter-component relationships, due to issues with generalization and robustness in handling complex temporal dependencies.

Method: Proposes HN-MVTS architecture using a hypernetwork that takes learnable time series component embeddings and generates weights for the last layer of forecasting networks, acting as a data-adaptive regularizer during training only.

Result: Extensive experiments on eight benchmark datasets show HN-MVTS typically improves performance of state-of-the-art models (DLinear, PatchTST, TSMixer, etc.) when applied to them.

Conclusion: Hypernetwork-driven parameterization offers a promising direction for enhancing existing forecasting techniques in complex scenarios, improving generalization and long-range predictive accuracy without inference overhead.

Abstract: Accurate forecasting of multivariate time series data remains a formidable challenge, particularly due to the growing complexity of temporal dependencies in real-world scenarios. While neural network-based models have achieved notable success in this domain, complex channel-dependent models often suffer from performance degradation compared to channel-independent models that do not consider the relationship between components but provide high robustness due to small capacity. In this work, we propose HN-MVTS, a novel architecture that integrates a hypernetwork-based generative prior with an arbitrary neural network forecasting model. The input of this hypernetwork is a learnable embedding matrix of time series components. To restrict the number of new parameters, the hypernetwork learns to generate the weights of the last layer of the target forecasting networks, serving as a data-adaptive regularizer that improves generalization and long-range predictive accuracy. The hypernetwork is used only during the training, so it does not increase the inference time compared to the base forecasting model. Extensive experiments on eight benchmark datasets demonstrate that application of HN-MVTS to the state-of-the-art models (DLinear, PatchTST, TSMixer, etc.) typically improves their performance. Our findings suggest that hypernetwork-driven parameterization offers a promising direction for enhancing existing forecasting techniques in complex scenarios.

[481] From Confusion to Clarity: ProtoScore - A Framework for Evaluating Prototype-Based XAI

Helena Monke, Benjamin Sae-Chew, Benjamin Fresz, Marco F. Huber

Main category: cs.LG

TL;DR: ProtoScore is a standardized benchmark framework for evaluating prototype-based XAI methods, particularly for time series data, addressing the lack of objective comparison standards in explainable AI.

Details

Motivation: The opacity of neural networks in high-stakes fields requires explainable AI methods, but there's a critical gap in standardized benchmarks for prototype-based XAI methods, especially for time series data, leading to subjective evaluations.

Method: Developed ProtoScore framework that integrates the Co-12 properties of Nauta et al. to assess prototype-based XAI methods across different data types with focus on time series.

Result: Created a robust framework that enables fair and comprehensive evaluation of prototype methods against each other and other XAI methods, with all code publicly available.

Conclusion: ProtoScore facilitates objective comparison of prototype-based XAI methods, helping practitioners select appropriate explanation methods while reducing user study costs.

Abstract: The complexity and opacity of neural networks (NNs) pose significant challenges, particularly in high-stakes fields such as healthcare, finance, and law, where understanding decision-making processes is crucial. To address these issues, the field of explainable artificial intelligence (XAI) has developed various methods aimed at clarifying AI decision-making, thereby facilitating appropriate trust and validating the fairness of outcomes. Among these methods, prototype-based explanations offer a promising approach that uses representative examples to elucidate model behavior. However, a critical gap exists regarding standardized benchmarks to objectively compare prototype-based XAI methods, especially in the context of time series data. This lack of reliable benchmarks results in subjective evaluations, hindering progress in the field. We aim to establish a robust framework, ProtoScore, for assessing prototype-based XAI methods across different data types with a focus on time series data, facilitating fair and comprehensive evaluations. By integrating the Co-12 properties of Nauta et al., this framework allows for effectively comparing prototype methods against each other and against other XAI methods, ultimately assisting practitioners in selecting appropriate explanation methods while minimizing the costs associated with user studies. All code is publicly available at https://github.com/HelenaM23/ProtoScore .

[482] Multi-objective Hyperparameter Optimization in the Age of Deep Learning

Soham Basu, Frank Hutter, Danny Stoll

Main category: cs.LG

TL;DR: PriMO is the first hyperparameter optimization algorithm that incorporates multi-objective user priors, achieving state-of-the-art performance across 8 DL benchmarks.

Details

Motivation: Current HPO algorithms cannot leverage expert prior knowledge about hyperparameter settings, especially for multiple objectives, creating a gap in the algorithmic landscape.

Method: Introduces PriMO algorithm that integrates multi-objective user beliefs into hyperparameter optimization.

Result: Achieves state-of-the-art performance across 8 deep learning benchmarks in both multi-objective and single-objective settings.

Conclusion: PriMO positions itself as the new go-to HPO algorithm for deep learning practitioners due to its ability to incorporate multi-objective priors.

Abstract: While Deep Learning (DL) experts often have prior knowledge about which hyperparameter settings yield strong performance, only few Hyperparameter Optimization (HPO) algorithms can leverage such prior knowledge and none incorporate priors over multiple objectives. As DL practitioners often need to optimize not just one but many objectives, this is a blind spot in the algorithmic landscape of HPO. To address this shortcoming, we introduce PriMO, the first HPO algorithm that can integrate multi-objective user beliefs. We show PriMO achieves state-of-the-art performance across 8 DL benchmarks in the multi-objective and single-objective setting, clearly positioning itself as the new go-to HPO algorithm for DL practitioners.

[483] EMAformer: Enhancing Transformer through Embedding Armor for Time Series Forecasting

Zhiwei Zhang, Xinyi Du, Xuanchi Guo, Weihao Wang, Wenjuan Han

Main category: cs.LG

TL;DR: EMAformer enhances Transformer architecture for multivariate time series forecasting by introducing three inductive biases to address unstable inter-channel relationships, achieving state-of-the-art performance on 12 benchmarks.

Details

Motivation: Current Transformer models like iTransformer lag behind MLP-based models in multivariate time series forecasting due to unstable inter-channel relationships, creating a performance gap that needs to be addressed.

Method: Proposes EMAformer which enhances Transformer with an auxiliary embedding suite that introduces three key inductive biases: global stability, phase sensitivity, and cross-axis specificity to reinforce the model’s capabilities.

Result: Achieves state-of-the-art performance on 12 real-world benchmarks, reducing forecasting errors by average of 2.73% in MSE and 5.15% in MAE compared to existing methods.

Conclusion: EMAformer significantly advances the practical applicability of Transformer-based approaches for multivariate time series forecasting by unlocking further potential of the Transformer architecture through targeted inductive biases.

Abstract: Multivariate time series forecasting is crucial across a wide range of domains. While presenting notable progress for the Transformer architecture, iTransformer still lags behind the latest MLP-based models. We attribute this performance gap to unstable inter-channel relationships. To bridge this gap, we propose EMAformer, a simple yet effective model that enhances the Transformer with an auxiliary embedding suite, akin to armor that reinforces its ability. By introducing three key inductive biases, i.e., \textit{global stability}, \textit{phase sensitivity}, and \textit{cross-axis specificity}, EMAformer unlocks the further potential of the Transformer architecture, achieving state-of-the-art performance on 12 real-world benchmarks and reducing forecasting errors by an average of 2.73% in MSE and 5.15% in MAE. This significantly advances the practical applicability of Transformer-based approaches for multivariate time series forecasting. The code is available on https://github.com/PlanckChang/EMAformer.

[484] Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye, Hang Ding, Siyuan Chen, Yiyang Jiang, Changyuan Zhang, Xuan Zhang

Main category: cs.LG

TL;DR: BACL improves multimodal models by focusing on ambiguous negatives through curriculum learning and local attention mechanisms.

Details

Motivation: Current multimodal models treat all negative pairs equally, missing the opportunity to learn from borderline cases where negatives differ only slightly from positives.

Method: Proposes Boundary-Aware Curriculum with Local Attention (BACL) with two modules: Boundary-aware Negative Sampler that gradually increases difficulty, and Contrastive Local Attention loss that identifies mismatch locations.

Result: Achieves up to +32% R@1 improvement over CLIP and sets new state-of-the-art on four large-scale benchmarks without requiring extra labels.

Conclusion: BACL is a lightweight, differentiable add-on that effectively leverages ambiguous negatives to significantly boost multimodal model performance.

Abstract: Most multimodal models treat every negative pair alike, ignoring the ambiguous negatives that differ from the positive by only a small detail. We propose Boundary-Aware Curriculum with Local Attention (BACL), a lightweight add-on that turns these borderline cases into a curriculum signal. A Boundary-aware Negative Sampler gradually raises difficulty, while a Contrastive Local Attention loss highlights where the mismatch occurs. The two modules are fully differentiable and work with any off-the-shelf dual encoder. Theory predicts a fast O(1/n) error rate; practice shows up to +32% R@1 over CLIP and new SOTA on four large-scale benchmarks, all without extra labels.

[485] ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games

Ruochuan Shi, Runyu Lu, Yuanheng Zhu, Dongbin Zhao

Main category: cs.LG

TL;DR: ARAC is a multi-agent reinforcement learning method that combines attention-based GNNs with adaptive divergence regularization to address sparse rewards and dynamic interactions in graph-structured adversarial tasks.

Details

Motivation: To solve coordination problems in graph-structured MARL adversarial tasks where sparse rewards hinder efficient policy learning and dynamic interactions require effective modeling of agent dependencies.

Method: Integrates attention-based graph neural networks for modeling agent dependencies with adaptive divergence regularization that exploits reference policies early in training but reduces reliance on them over time.

Result: Achieves faster convergence, higher final success rates, and stronger scalability across varying numbers of agents in pursuit and confrontation scenarios compared to MARL baselines.

Conclusion: ARAC is effective for complex graph-structured environments, successfully addressing sparse reward problems and dynamic interactions through adaptive regularization and expressive graph representations.

Abstract: In graph-structured multi-agent reinforcement learning (MARL) adversarial tasks such as pursuit and confrontation, agents must coordinate under highly dynamic interactions, where sparse rewards hinder efficient policy learning. We propose Adaptive Regularized Multi-Agent Soft Actor-Critic (ARAC), which integrates an attention-based graph neural network (GNN) for modeling agent dependencies with an adaptive divergence regularization mechanism. The GNN enables expressive representation of spatial relations and state features in graph environments. Divergence regularization can serve as policy guidance to alleviate the sparse reward problem, but it may lead to suboptimal convergence when the reference policy itself is imperfect. The adaptive divergence regularization mechanism enables the framework to exploit reference policies for efficient exploration in the early stages, while gradually reducing reliance on them as training progresses to avoid inheriting their limitations. Experiments in pursuit and confrontation scenarios demonstrate that ARAC achieves faster convergence, higher final success rates, and stronger scalability across varying numbers of agents compared with MARL baselines, highlighting its effectiveness in complex graph-structured environments.

[486] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

Xiyuan Wei, Chih-Jen Lin, Tianbao Yang

Main category: cs.LG

TL;DR: NeuCLIP is a novel optimization framework that reformulates CLIP’s contrastive loss using convex and variational analysis to estimate normalization terms more accurately without requiring large batches.

Details

Motivation: Existing CLIP training methods require large batches for accurate normalization term estimation, demanding substantial computational resources. Per-sample normalizer estimators have optimization errors that scale with dataset-to-batch size ratio, limiting effectiveness for large datasets or small batches.

Method: Reformulates contrastive loss via convex analysis into minimization with auxiliary variables for log-normalizers, then transforms this into minimizing a compact neural network that predicts log-normalizers. Uses alternating optimization to jointly train CLIP model and auxiliary network with tailored architecture and acceleration techniques.

Result: Extensive experiments on large-scale CLIP training across datasets from millions to billions of samples show NeuCLIP outperforms previous methods with more accurate normalizer estimation and improved performance.

Conclusion: NeuCLIP provides an effective solution to the normalization term estimation challenge in CLIP training, enabling better performance without requiring large computational batches, making it suitable for large-scale datasets.

Abstract: Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, which are updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. However, this scheme incurs optimization error that scales with the ratio of dataset size to batch size, limiting effectiveness for large datasets or small batches. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) $\textbf{reformulating}$ the contrastive loss for each sample $\textbf{via convex analysis}$ into a minimization problem with an auxiliary variable representing its log-normalizer; and (ii) $\textbf{transforming}$ the resulting minimization over $n$ auxiliary variables (where $n$ is the dataset size) via $\textbf{variational analysis}$ into the minimization over a compact neural network that predicts the log-normalizers. We design an alternating optimization algorithm that jointly trains the CLIP model and the auxiliary network. By employing a tailored architecture and acceleration techniques for the auxiliary network, NeuCLIP achieves more accurate normalizer estimation, leading to improved performance compared with previous methods. Extensive experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods.

[487] Physics-Informed Neural Operators for Cardiac Electrophysiology

Hannah Lydon, Milad Kazemi, Martin Bishop, Nicola Paoletti

Main category: cs.LG

TL;DR: PINO (Physics-Informed Neural Operator) approach for cardiac electrophysiology modeling that learns mappings between function spaces, enabling generalization across mesh resolutions and initial conditions with accurate long-term predictions.

Details

Motivation: Traditional PDE solvers are computationally expensive and discretization-sensitive, while deep learning methods struggle with chaotic dynamics and long-term predictions. PINNs have limitations in mesh resolution and predictive stability.

Method: Physics-Informed Neural Operator (PINO) that learns mappings between function spaces rather than point-wise solutions, allowing generalization to multiple mesh resolutions and initial conditions.

Result: PINO accurately reproduces cardiac EP dynamics over extended time horizons, performs zero-shot evaluations on unseen scenarios, maintains predictive quality in long roll-outs, and scales resolution by 10x training resolution with significant simulation time reduction.

Conclusion: PINO-based approaches offer efficient and scalable cardiac EP simulations with advantages over traditional numerical solvers and PINNs, particularly in generalization and long-term stability.

Abstract: Accurately simulating systems governed by PDEs, such as voltage fields in cardiac electrophysiology (EP) modelling, remains a significant modelling challenge. Traditional numerical solvers are computationally expensive and sensitive to discretisation, while canonical deep learning methods are data-hungry and struggle with chaotic dynamics and long-term predictions. Physics-Informed Neural Networks (PINNs) mitigate some of these issues by incorporating physical constraints in the learning process, yet they remain limited by mesh resolution and long-term predictive stability. In this work, we propose a Physics-Informed Neural Operator (PINO) approach to solve PDE problems in cardiac EP. Unlike PINNs, PINO models learn mappings between function spaces, allowing them to generalise to multiple mesh resolutions and initial conditions. Our results show that PINO models can accurately reproduce cardiac EP dynamics over extended time horizons and across multiple propagation scenarios, including zero-shot evaluations on scenarios unseen during training. Additionally, our PINO models maintain high predictive quality in long roll-outs (where predictions are recursively fed back as inputs), and can scale their predictive resolution by up to 10x the training resolution. These advantages come with a significant reduction in simulation time compared to numerical PDE solvers, highlighting the potential of PINO-based approaches for efficient and scalable cardiac EP simulations.

[488] HardFlow: Hard-Constrained Sampling for Flow-Matching Models via Trajectory Optimization

Zeyang Li, Kaveh Alim, Navid Azizan

Main category: cs.LG

TL;DR: HardFlow: A novel framework that reformulates hard-constrained sampling as trajectory optimization using optimal control to precisely satisfy constraints at terminal time while maintaining sample quality.

Details

Motivation: Existing projection-based methods for enforcing hard constraints in generative models are overly restrictive and degrade sample quality, while downstream applications require precise constraint satisfaction.

Method: Leverages numerical optimal control to steer sampling trajectories, exploits flow-matching model structure, and uses model predictive control techniques to transform complex constrained optimization into tractable surrogate problems.

Result: Extensive experiments across robotics, PDEs, and vision domains show HardFlow substantially outperforms existing methods in both constraint satisfaction and sample quality.

Conclusion: The trajectory optimization perspective provides a flexible unified framework for hard constraint enforcement that goes beyond mere constraint satisfaction to minimize distribution shift and enhance sample quality.

Abstract: Diffusion and flow-matching have emerged as powerful methodologies for generative modeling, with remarkable success in capturing complex data distributions and enabling flexible guidance at inference time. Many downstream applications, however, demand enforcing hard constraints on generated samples (for example, robot trajectories must avoid obstacles), a requirement that goes beyond simple guidance. Prevailing projection-based approaches constrain the entire sampling path to the constraint manifold, which is overly restrictive and degrades sample quality. In this paper, we introduce a novel framework that reformulates hard-constrained sampling as a trajectory optimization problem. Our key insight is to leverage numerical optimal control to steer the sampling trajectory so that constraints are satisfied precisely at the terminal time. By exploiting the underlying structure of flow-matching models and adopting techniques from model predictive control, we transform this otherwise complex constrained optimization problem into a tractable surrogate that can be solved efficiently and effectively. Furthermore, this trajectory optimization perspective offers significant flexibility beyond mere constraint satisfaction, allowing for the inclusion of integral costs to minimize distribution shift and terminal objectives to further enhance sample quality, all within a unified framework. We provide a control-theoretic analysis of our method, establishing bounds on the approximation error between our tractable surrogate and the ideal formulation. Extensive experiments across diverse domains, including robotics (planning), partial differential equations (boundary control), and vision (text-guided image editing), demonstrate that our algorithm, which we name $\textit{HardFlow}$, substantially outperforms existing methods in both constraint satisfaction and sample quality.

[489] An update to PYRO-NN: A Python Library for Differentiable CT Operators

Linda-Sophie Schneider, Yipeng Sun, Chengze Ye, Markus Michen, Andreas Maier

Main category: cs.LG

TL;DR: PYRO-NN is an updated Python library for differentiable CT reconstruction that extends compatibility to PyTorch, adds CUDA kernel support, and provides tools for simulating artifacts and creating end-to-end trainable pipelines.

Details

Motivation: To integrate classical CT reconstruction techniques with data-driven deep learning approaches through differentiable operators, enabling end-to-end optimization and physical modeling within neural networks.

Method: Developed an updated Python-based library with PyTorch compatibility, native CUDA kernel support for efficient projection/back-projection operations across different geometries, and tools for simulating imaging artifacts and arbitrary acquisition trajectories.

Result: Created a flexible framework with high-level Python API that enables creation of end-to-end trainable pipelines for CT reconstruction combining classical methods with deep learning.

Conclusion: PYRO-NN provides an effective tool for integrating differentiable CT reconstruction with deep learning, facilitating the development of hybrid approaches that combine physical modeling with data-driven methods.

Abstract: Deep learning has brought significant advancements to X-ray Computed Tomography (CT) reconstruction, offering solutions to challenges arising from modern imaging technologies. These developments benefit from methods that combine classical reconstruction techniques with data-driven approaches. Differentiable operators play a key role in this integration by enabling end-to-end optimization and the incorporation of physical modeling within neural networks. In this work, we present an updated version of PYRO-NN, a Python-based library for differentiable CT reconstruction. The updated framework extends compatibility to PyTorch and introduces native CUDA kernel support for efficient projection and back-projection operations across parallel, fan, and cone-beam geometries. Additionally, it includes tools for simulating imaging artifacts, modeling arbitrary acquisition trajectories, and creating flexible, end-to-end trainable pipelines through a high-level Python API. Code is available at: https://github.com/csyben/PYRO-NN

[490] Coherence Mechanisms for Provable Self-Improvement

Mehryar Mohri, Jon Schneider, Yifan Wu

Main category: cs.LG

TL;DR: A principled framework for LLM self-improvement using coherence - requiring model outputs to remain consistent under task-preserving input transformations, with formal guarantees of monotonic improvement.

Details

Motivation: Prior self-improvement approaches rely on empirical heuristics without formal guarantees, highlighting the need for a principled framework with theoretical foundations.

Method: Projection-based mechanisms that update baseline models to be coherent while minimizing deviation from original behavior, using both direct and two-step projection methods.

Result: Rigorous theoretical guarantees of monotonic improvement (reduced Bregman divergence), extended to non-realizable settings, finite samples, and relaxed coherence constraints.

Conclusion: Coherence is established as a fundamental and necessary principle for provable self-improvement, with characterization theorems showing any mechanism with similar guarantees must conform to coherence-based structure.

Abstract: Self-improvement is a critical capability for large language models and other intelligent systems, enabling them to refine their behavior and internal consistency without external supervision. Despite its importance, prior approaches largely rely on empirical heuristics and lack formal guarantees. In this paper, we propose a principled framework for self-improvement based on the concept of \emph{coherence}, which requires that a model’s outputs remain consistent under task-preserving transformations of the input. We formalize this concept using projection-based mechanisms that update a baseline model to be coherent while remaining as close as possible to its original behavior. We provide rigorous theoretical guarantees that these mechanisms achieve \emph{monotonic improvement}, measured by a reduction in expected Bregman divergence. Our analysis is comprehensive, covering both \emph{direct} and \emph{two-step} projection methods, and robustly extends these guarantees to non-realizable settings, empirical (finite-sample) distributions, and relaxed coherence constraints. Furthermore, we establish a general \emph{characterization theorem}, showing that any mechanism with similar provable improvement guarantees must inherently conform to a coherence-based structure. This culminates in rigidity results under the demand for universal improvement, establishing coherence as a fundamental and, in a formal sense, necessary principle for provable self-improvement.

[491] One Model for All: Universal Pre-training for EEG based Emotion Recognition across Heterogeneous Datasets and Paradigms

Xiang Li, You Li, Yazhou Zhang

Main category: cs.LG

TL;DR: A universal pre-training framework for EEG emotion recognition that handles dataset heterogeneity through two-stage learning: univariate pre-training via self-supervised contrastive learning and multivariate fine-tuning with ART-GAT architecture.

Details

Motivation: EEG-based emotion recognition faces challenges with dataset heterogeneity (channel/subject variability) that hinders generalizable models, and existing approaches struggle with effective knowledge transfer.

Method: Two-stage framework: (1) Univariate pre-training using self-supervised contrastive learning on individual channels with Unified Channel Schema; (2) Multivariate fine-tuning with Adaptive Resampling Transformer (ART) and Graph Attention Network (GAT) to capture spatio-temporal dependencies.

Result: Achieves new SOTA on all within-subject benchmarks: SEED (99.27%), DEAP (93.69%), DREAMER (93.93%). Also shows SOTA cross-dataset transfer with 94.08% (intersection) and 93.05% (UCS) on unseen DREAMER dataset. GAT module provides +22.19% gain over GCN on DEAP.

Conclusion: The framework enables universal, scalable, and effective pre-trained models for diverse EEG analysis tasks, with universal pre-training serving as an essential stabilizer and the GAT module being critical for performance.

Abstract: EEG-based emotion recognition is hampered by profound dataset heterogeneity (channel/subject variability), hindering generalizable models. Existing approaches struggle to transfer knowledge effectively. We propose ‘One Model for All’, a universal pre-training framework for EEG analysis across disparate datasets. Our paradigm decouples learning into two stages: (1) Univariate pre-training via self-supervised contrastive learning on individual channels, enabled by a Unified Channel Schema (UCS) that leverages the channel union (e.g., SEED-62ch, DEAP-32ch); (2) Multivariate fine-tuning with a novel ‘ART’ (Adaptive Resampling Transformer) and ‘GAT’ (Graph Attention Network) architecture to capture complex spatio-temporal dependencies. Experiments show universal pre-training is an essential stabilizer, preventing collapse on SEED (vs. scratch) and yielding substantial gains on DEAP (+7.65%) and DREAMER (+3.55%). Our framework achieves new SOTA performance on all within-subject benchmarks: SEED (99.27%), DEAP (93.69%), and DREAMER (93.93%). We also show SOTA cross-dataset transfer, achieving 94.08% (intersection) and 93.05% (UCS) on the unseen DREAMER dataset, with the former surpassing the within-domain pre-training benchmark. Ablation studies validate our architecture: the GAT module is critical, yielding a +22.19% gain over GCN on the high-noise DEAP dataset, and its removal causes a catastrophic -16.44% performance drop. This work paves the way for more universal, scalable, and effective pre-trained models for diverse EEG analysis tasks.

[492] Binary Split Categorical feature with Mean Absolute Error Criteria in CART

Peng Yu, Yike Chen, Chao Xu, Albert Bifet, Jesse Read

Main category: cs.LG

TL;DR: Unsupervised numerical encoding methods fail for MAE criterion in CART. A new efficient splitting algorithm is proposed to handle categorical features with MAE.

Details

Motivation: Traditional approaches using numerical encoding for categorical features with MAE criterion are ineffective, creating a need for better methods.

Method: A novel efficient splitting algorithm specifically designed for handling categorical features with the Mean Absolute Error criterion.

Result: Demonstrated that unsupervised numerical encoding methods are not viable for MAE criteria and presented a working alternative.

Conclusion: The proposed algorithm offers a promising solution to enhance categorical data handling in CART algorithms, overcoming limitations of existing approaches.

Abstract: In the context of the Classification and Regression Trees (CART) algorithm, the efficient splitting of categorical features using standard criteria like GINI and Entropy is well-established. However, using the Mean Absolute Error (MAE) criterion for categorical features has traditionally relied on various numerical encoding methods. This paper demonstrates that unsupervised numerical encoding methods are not viable for the MAE criteria. Furthermore, we present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion. Our findings underscore the limitations of existing approaches and offer a promising solution to enhance the handling of categorical data in CART algorithms.

[493] Clustering Guided Residual Neural Networks for Multi-Tx Localization in Molecular Communications

Ali Sonmez, Erencem Ozbey, Efe Feyzi Mantaroglu, H. Birkan Yilmaz

Main category: cs.LG

TL;DR: Clustering-based centroid correction methods and clustering-guided Residual Neural Networks (AngleNN and SizeNN) significantly improve multiple transmitter localization in Molecular Communication via Diffusion, reducing localization error by 43-69% compared to K-means.

Details

Motivation: Accurate localization of multiple transmitters in Molecular Communication via Diffusion is challenging due to the stochastic nature of diffusion and overlapping molecule distributions at the receiver surface.

Method: Proposed clustering-based centroid correction methods for robustness against density variations and outliers, and two clustering-guided Residual Neural Networks: AngleNN for direction refinement and SizeNN for cluster size estimation.

Result: Experimental results show significant improvements with localization error reduction between 69% (2-Tx) and 43% (4-Tx) compared to K-means.

Conclusion: The proposed approaches effectively address the challenges of multiple transmitter localization in Molecular Communication via Diffusion by combining clustering methods with neural networks for enhanced accuracy.

Abstract: Transmitter localization in Molecular Communication via Diffusion is a critical topic with many applications. However, accurate localization of multiple transmitters is a challenging problem due to the stochastic nature of diffusion and overlapping molecule distributions at the receiver surface. To address these issues, we introduce clustering-based centroid correction methods that enhance robustness against density variations, and outliers. In addition, we propose two clusteringguided Residual Neural Networks, namely AngleNN for direction refinement and SizeNN for cluster size estimation. Experimental results show that both approaches provide significant improvements with reducing localization error between 69% (2-Tx) and 43% (4-Tx) compared to the K-means.

[494] LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Randall Balestriero, Yann LeCun

Main category: cs.LG

TL;DR: LeJEPA is a theoretically grounded, scalable self-supervised learning method that combines Joint-Embedding Predictive Architecture with Sketched Isotropic Gaussian Regularization to achieve stable, efficient training across diverse architectures and domains.

Details

Motivation: Current Joint-Embedding Predictive Architectures (JEPAs) lack practical guidance and theoretical foundation, leading to ad-hoc research and development. There's a need for a comprehensive theory and practical implementation that can scale effectively.

Method: LeJEPA combines JEPA predictive loss with SIGReg (Sketched Isotropic Gaussian Regularization), which constrains embeddings to follow an isotropic Gaussian distribution - identified as optimal for minimizing downstream prediction risk. The method is heuristics-free and requires minimal code.

Result: LeJEPA achieves 79% accuracy on ImageNet-1k with ViT-H/14 using linear evaluation with frozen backbone. It demonstrates stability across 10+ datasets, 60+ architectures, various scales and domains, with linear time/memory complexity and single hyperparameter tuning.

Conclusion: LeJEPA offers a simple, theory-friendly ecosystem that can reestablish self-supervised pre-training as a core AI research pillar, providing both theoretical grounding and practical efficiency across diverse applications.

Abstract: Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in {\bf LeJEPA}, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs’ embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective–{\bf Sketched Isotropic Gaussian Regularization} (SIGReg)–to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only $\approx$50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\href{git@github.com:rbalestr-lab/lejepa.git}{GitHub repo}).

[495] FMMI: Flow Matching Mutual Information Estimation

Ivan Butakov, Alexander Semenenko, Alexey Frolov, Ivan Oseledets

Main category: cs.LG

TL;DR: Novel MI estimator using normalizing flows to transform between joint and marginal distributions, providing efficient and precise estimation that scales to high dimensions.

Details

Motivation: To overcome limitations of traditional discriminative MI estimators by reframing the approach from classification to distribution transformation.

Method: Learn a normalizing flow that transforms one distribution into another, rather than training a classifier to discriminate between joint and marginal distributions.

Result: Produces computationally efficient and precise MI estimates that scale well to high dimensions and across wide ranges of ground-truth MI values.

Conclusion: The normalizing flow-based approach provides a fundamentally different and effective framework for mutual information estimation with strong scalability properties.

Abstract: We introduce a novel Mutual Information (MI) estimator that fundamentally reframes the discriminative approach. Instead of training a classifier to discriminate between joint and marginal distributions, we learn a normalizing flow that transforms one into the other. This technique produces a computationally efficient and precise MI estimate that scales well to high dimensions and across a wide range of ground-truth MI values.

[496] The Path Not Taken: RLVR Provably Learns Off the Principals

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, Kai Sheng Tai

Main category: cs.LG

TL;DR: RLVR appears sparse but actually learns off principal directions in weight space with minimal spectral drift, while SFT targets principal weights and distorts the spectrum.

Details

Motivation: To resolve the paradox that RLVR reliably improves reasoning performance while appearing to modify only a small fraction of parameters, and to provide a mechanistic understanding of RLVR's learning dynamics.

Method: Proposed Three-Gate Theory: Gate I (KL Anchor) imposes KL-constrained updates, Gate II (Model Geometry) steers steps into low-curvature subspaces, and Gate III (Precision) hides micro-updates. Validated theory through parameter-level analysis of RLVR dynamics.

Result: RLVR learns off principal directions with minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. SFT targets principal weights and distorts the spectrum. RL operates in distinct optimization regime from SFT.

Conclusion: RLVR has clear regularities in parameter evolution and operates in distinct regime from SFT. Direct adaptation of SFT-era PEFT methods is flawed. Need geometry-aware RLVR-native algorithms rather than repurposed SFT heuristics.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR’s learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR’s training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics.

[497] Automatic Grid Updates for Kolmogorov-Arnold Networks using Layer Histograms

Jamison Moody, James Usevitch

Main category: cs.LG

TL;DR: AdaptKAN improves Kolmogorov-Arnold Networks by enabling autonomous domain grid updates using histogram-based algorithms, eliminating manual adjustments and enhancing performance across multiple tasks.

Details

Motivation: Original KANs require manual domain grid adjustments during training, creating user overhead. AdaptKAN aims to automate this process and improve performance while maintaining KAN benefits like interpretability.

Method: Uses histogram-based algorithms to autonomously update domain grids in a data-driven manner, informed by changing output ranges of previous layers. Also applies the same algorithm for OOD detection.

Result: AdaptKAN matches or exceeds performance of prior KAN architectures and MLPs on four tasks: learning scientific equations (Feynman dataset), image classification from frozen features, learning control Lyapunov functions, and OOD detection (OpenOOD v1.5 benchmark).

Conclusion: AdaptKAN successfully automates domain grid updates in KANs, eliminating manual overhead while maintaining or improving performance across diverse applications including symbolic equation learning and OOD detection.

Abstract: Kolmogorov-Arnold Networks (KANs) are a class of neural networks that have received increased attention in recent literature. In contrast to MLPs, KANs leverage parameterized, trainable activation functions and offer several benefits including improved interpretability and higher accuracy on learning symbolic equations. However, the original KAN architecture requires adjustments to the domain discretization of the network (called the “domain grid”) during training, creating extra overhead for the user in the training process. Typical KAN layers are not designed with the ability to autonomously update their domains in a data-driven manner informed by the changing output ranges of previous layers. As an added benefit, this histogram algorithm may also be applied towards detecting out-of-distribution (OOD) inputs in a variety of settings. We demonstrate that AdaptKAN exceeds or matches the performance of prior KAN architectures and MLPs on four different tasks: learning scientific equations from the Feynman dataset, image classification from frozen features, learning a control Lyapunov function, and detecting OOD inputs on the OpenOOD v1.5 benchmark.

[498] Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

Main category: cs.LG

TL;DR: Leveraging causal mechanisms in neural networks enables accurate prediction of model behavior on out-of-distribution examples, outperforming causal-agnostic methods.

Details

Motivation: To determine if interpretability techniques can predict neural network behavior on out-of-distribution examples, addressing a crucial need for reliable model behavior prediction.

Method: Proposed two methods: counterfactual simulation (checking realization of key causal variables) and value probing (using causal variable values for predictions), tested on diverse language modeling tasks.

Result: Both methods achieved high AUC-ROC in-distribution and outperformed causal-agnostic methods in out-of-distribution settings where behavior prediction is most important.

Conclusion: Internal causal analysis of language models provides a novel and significant application for predicting model behavior, especially in challenging out-of-distribution scenarios.

Abstract: Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks–including symbol manipulation, knowledge retrieval, and instruction following–we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model’s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

[499] Multiplicative Reweighting for Robust Neural Network Optimization

Noga Bar, Tomer Koren, Raja Giryes

Main category: cs.LG

TL;DR: The paper proposes using multiplicative weights (MW) updates from learning with expert advice to reweight examples during neural network training, making it robust to label noise.

Details

Motivation: Neural networks degrade in performance when trained with noisy labels, and MW updates have shown robustness to data corruptions in expert advice settings.

Method: Apply multiplicative weights updates for reweighting examples during neural network optimization, theoretically establishing convergence with gradient descent.

Result: MW improves neural networks’ accuracy on CIFAR-10, CIFAR-100 and Clothing1M datasets in the presence of label noise, and also impacts adversarial robustness.

Conclusion: MW-based reweighting is an effective approach for making neural networks robust to label noise during training.

Abstract: Neural networks are widespread due to their powerful performance. Yet, they degrade in the presence of noisy labels at training time. Inspired by the setting of learning with expert advice, where multiplicative weights (MW) updates were recently shown to be robust to moderate data corruptions in expert advice, we propose to use MW for reweighting examples during neural networks optimization. We theoretically establish the convergence of our method when used with gradient descent and prove its advantages in 1d cases. We then validate empirically our findings for the general case by showing that MW improves neural networks’ accuracy in the presence of label noise on CIFAR-10, CIFAR-100 and Clothing1M. We also show the impact of our approach on adversarial robustness.

[500] Hierarchical Deep Counterfactual Regret Minimization

Jiayu Chen, Zhekai Wang, Vaneet Aggarwal

Main category: cs.LG

TL;DR: First hierarchical version of Deep CFR (HDCFR) that enhances learning efficiency in large-scale imperfect information games by incorporating skill-based hierarchical strategy learning with theoretical guarantees.

Details

Motivation: To improve learning in complex Imperfect Information Games (IIGs) by integrating skill-based strategy learning with CFR, enabling more human-like decision-making and transferable skills while handling large state spaces and deep game trees.

Method: Develop hierarchical CFR updating rules with variance-reduced Monte Carlo sampling, then extend to neural network function approximators for large-scale tasks while maintaining theoretical convergence guarantees.

Result: HDCFR enables learning with predefined expertise, facilitates skill transfer between similar tasks, and provides theoretical justification including convergence rates and unbiased regret estimators.

Conclusion: HDCFR represents a significant advancement in hierarchical reinforcement learning for IIGs, combining theoretical rigor with practical scalability while enabling human expertise integration and skill transfer.

Abstract: Imperfect Information Games (IIGs) offer robust models for scenarios where decision-makers face uncertainty or lack complete information. Counterfactual Regret Minimization (CFR) has been one of the most successful family of algorithms for tackling IIGs. The integration of skill-based strategy learning with CFR could potentially mirror more human-like decision-making process and enhance the learning performance for complex IIGs. It enables the learning of a hierarchical strategy, wherein low-level components represent skills for solving subgames and the high-level component manages the transition between skills. In this paper, we introduce the first hierarchical version of Deep CFR (HDCFR), an innovative method that boosts learning efficiency in tasks involving extensively large state spaces and deep game trees. A notable advantage of HDCFR over previous works is its ability to facilitate learning with predefined (human) expertise and foster the acquisition of skills that can be transferred to similar tasks. To achieve this, we initially construct our algorithm on a tabular setting, encompassing hierarchical CFR updating rules and a variance-reduced Monte Carlo sampling extension. Notably, we offer the theoretical justifications, including the convergence rate of the proposed updating rule, the unbiasedness of the Monte Carlo regret estimator, and ideal criteria for effective variance reduction. Then, we employ neural networks as function approximators and develop deep learning objectives to adapt our proposed algorithms for large-scale tasks, while maintaining the theoretical support.

[501] Pruning at Initialization – A Sketching Perspective

Noga Bar, Raja Giryes

Main category: cs.LG

TL;DR: The paper analyzes the lottery ticket hypothesis in linear networks, showing that finding sparse masks at initialization is equivalent to sketching problems, enabling theoretical analysis and algorithm improvements.

Details

Motivation: To understand the lottery ticket hypothesis in pruning neural networks at initialization by studying it in the linear setting and connecting it to sketching problems from matrix multiplication.

Method: Analyze the problem in linear networks, show equivalence between finding sparse masks at initialization and sketching problems, and use this perspective to bound approximation errors and improve existing algorithms.

Result: Theoretical justification that sparse network search may be data-independent, bounds on approximation error of pruned linear models, and a generic improvement to existing pruning algorithms.

Conclusion: The sketching perspective provides valuable tools for analyzing the lottery ticket hypothesis, reveals data-independent properties of sparse network search, and enables algorithmic improvements for pruning at initialization.

Abstract: The lottery ticket hypothesis (LTH) has increased attention to pruning neural networks at initialization. We study this problem in the linear setting. We show that finding a sparse mask at initialization is equivalent to the sketching problem introduced for efficient matrix multiplication. This gives us tools to analyze the LTH problem and gain insights into it. Specifically, using the mask found at initialization, we bound the approximation error of the pruned linear model at the end of training. We theoretically justify previous empirical evidence that the search for sparse networks may be data independent. By using the sketching perspective, we suggest a generic improvement to existing algorithms for pruning at initialization, which we show to be beneficial in the data-independent case.

[502] Efficient Deep Learning with Decorrelated Backpropagation

Sander Dalm, Joshua Offergeld, Nasir Ahmad, Marcel van Gerven

Main category: cs.LG

TL;DR: This paper demonstrates that decorrelated backpropagation can achieve more than two-fold training speed-up and higher accuracy compared to standard backpropagation in deep residual networks.

Details

Motivation: Training deep neural networks at scale has high computational costs and carbon footprint, and while input decorrelation has shown potential to speed up learning, it hasn't translated to substantial improvements in large-scale DNNs due to challenges in enforcing fast and stable network-wide decorrelation.

Method: The authors developed a novel algorithm that induces network-wide input decorrelation with minimal computational overhead, combined with careful optimizations to implement decorrelated backpropagation for deep convolutional neural networks.

Result: The method achieved more than two-fold speed-up and higher test accuracy compared to standard backpropagation when training several deep residual networks.

Conclusion: Decorrelated backpropagation provides exciting prospects for efficient deep learning at scale, demonstrating that much more efficient training of deep convolutional neural networks is feasible.

Abstract: The backpropagation algorithm remains the dominant and most successful method for training deep neural networks (DNNs). At the same time, training DNNs at scale comes at a significant computational cost and therefore a high carbon footprint. Converging evidence suggests that input decorrelation may speed up deep learning. However, to date, this has not yet translated into substantial improvements in training efficiency in large-scale DNNs. This is mainly caused by the challenge of enforcing fast and stable network-wide decorrelation. Here, we show for the first time that much more efficient training of deep convolutional neural networks is feasible by embracing decorrelated backpropagation as a mechanism for learning. To achieve this goal we made use of a novel algorithm which induces network-wide input decorrelation using minimal computational overhead. By combining this algorithm with careful optimizations, we achieve a more than two-fold speed-up and higher test accuracy compared to backpropagation when training several deep residual networks. This demonstrates that decorrelation provides exciting prospects for efficient deep learning at scale.

[503] ElastoGen: 4D Generative Elastodynamics

Yutao Feng, Yintong Shang, Xiang Feng, Lei Lan, Shandian Zhe, Tianjia Shao, Hongzhi Wu, Kun Zhou, Chenfanfu Jiang, Yin Yang

Main category: cs.LG

TL;DR: ElastoGen is a lightweight physics-based AI model that generates accurate 4D elastodynamics by converting nonlinear force equilibrium equations into iterative convolution-like operations, enabling efficient simulation of hyperelastic materials.

Details

Motivation: To overcome limitations of deep models that learn from visual observations alone, by leveraging established physics principles and mathematical procedures for more accurate and efficient elastodynamics simulation.

Method: Converts the differential equation of nonlinear force equilibrium into iterative local convolution-like operations that fit deep architectures, building network modules following this physics-aligned design philosophy.

Result: ElastoGen is much more lightweight than deep generative models in both training requirements and network scale, while efficiently generating accurate dynamics for various hyperelastic materials and enabling end-to-end 4D generation.

Conclusion: The physics-driven approach of ElastoGen provides accurate elastodynamics simulation with reduced computational requirements, making it suitable for integration with other deep learning modules for comprehensive 4D generation tasks.

Abstract: We present ElastoGen, a knowledge-driven AI model that generates physically accurate 4D elastodynamics. Unlike deep models that learn from video- or image-based observations, ElastoGen leverages the principles of physics and learns from established mathematical and optimization procedures. The core idea of ElastoGen is converting the differential equation, corresponding to the nonlinear force equilibrium, into a series of iterative local convolution-like operations, which naturally fit deep architectures. We carefully build our network module following this overarching design philosophy. ElastoGen is much more lightweight in terms of both training requirements and network scale than deep generative models. Because of its alignment with actual physical procedures, ElastoGen efficiently generates accurate dynamics for a wide range of hyperelastic materials and can be easily integrated with upstream and downstream deep modules to enable end-to-end 4D generation.

[504] Physics-informed deep learning and compressive collocation for high-dimensional diffusion-reaction equations: practical existence theory and numerics

Simone Brugiapaglia, Nick Dexter, Samir Karam, Weiqi Wang

Main category: cs.LG

TL;DR: This paper develops and analyzes an efficient deep learning-based solver for high-dimensional PDEs that competes with compressive spectral collocation methods, with theoretical guarantees on stability, accuracy, and sample complexity that scales favorably with dimension.

Details

Motivation: Deep learning shows promise for solving PDEs and mitigating the curse of dimensionality, but mathematical foundations for its numerical efficiency (stability, accuracy, sample complexity) are only recently emerging. The paper aims to establish rigorous mathematical underpinnings for DL-based PDE solvers.

Method: Leverages recent advancements in function approximation using sparsity-based techniques and random sampling to develop a deep learning-based PDE solver. Uses trainable DNNs with carefully bounded network architecture and sample complexity requirements.

Result: Demonstrates a practical existence theorem showing that properly constructed DNNs can stably and accurately approximate diffusion-reaction PDEs with high probability. The method competes with compressive spectral collocation methods both theoretically and numerically.

Conclusion: Deep learning-based PDE solvers can achieve logarithmic or linear scaling in dimension for sample complexity, providing an efficient alternative to traditional methods for high-dimensional problems while maintaining stability and accuracy guarantees.

Abstract: On the forefront of scientific computing, Deep Learning (DL), i.e., machine learning with Deep Neural Networks (DNNs), has emerged a powerful new tool for solving Partial Differential Equations (PDEs). It has been observed that DNNs are particularly well suited to weakening the effect of the curse of dimensionality, a term coined by Richard E. Bellman in the late 50s to describe challenges such as the exponential dependence of the sample complexity, i.e., the number of samples required to solve an approximation problem, on the dimension of the ambient space. However, although DNNs have been used to solve PDEs since the 90s, the literature underpinning their mathematical efficiency in terms of numerical analysis (i.e., stability, accuracy, and sample complexity), is only recently beginning to emerge. In this paper, we leverage recent advancements in function approximation using sparsity-based techniques and random sampling to develop and analyze an efficient high-dimensional PDE solver based on DL. We show, both theoretically and numerically, that it can compete with a novel stable and accurate compressive spectral collocation method for the solution of high-dimensional, steady-state diffusion-reaction equations with periodic boundary conditions. In particular, we demonstrate a new practical existence theorem, which establishes the existence of a class of trainable DNNs with suitable bounds on the network architecture and a sufficient condition on the sample complexity, with logarithmic or, at worst, linear scaling in dimension, such that the resulting networks stably and accurately approximate a diffusion-reaction PDE with high probability.

[505] Informed Correctors for Discrete Diffusion Models

Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, Scott Linderman

Main category: cs.LG

TL;DR: Proposes informed correctors for discrete diffusion models to improve sampling efficiency and quality by countering approximation errors with diffusion model guidance.

Details

Motivation: Existing discrete diffusion sampling strategies struggle to balance computation and sample quality when reducing sampling steps, even with well-learned models.

Method: Predictor-corrector sampling scheme with diffusion model-informed correctors, hollow transformers architecture, and tailored training objective leveraging more training signals.

Result: Superior samples with fewer errors on text8 and improved FID scores on tokenized ImageNet 256x256 datasets compared to existing samplers.

Conclusion: Informed correctors enable fast and high-fidelity generation for discrete diffusion models, addressing key limitations of current sampling approaches.

Abstract: Discrete diffusion has emerged as a powerful framework for generative modeling in discrete domains, yet efficiently sampling from these models remains challenging. Existing sampling strategies often struggle to balance computation and sample quality when the number of sampling steps is reduced, even when the model has learned the data distribution well. To address these limitations, we propose a predictor-corrector sampling scheme where the corrector is informed by the diffusion model to more reliably counter the accumulating approximation errors. To further enhance the effectiveness of our informed corrector, we introduce complementary architectural modifications based on hollow transformers and a simple tailored training objective that leverages more training signal. We use a synthetic example to illustrate the failure modes of existing samplers and show how informed correctors alleviate these problems. On the text8 and tokenized ImageNet 256x256 datasets, our informed corrector consistently produces superior samples with fewer errors or improved FID scores for discrete diffusion models. These results underscore the potential of informed correctors for fast and high-fidelity generation using discrete diffusion. Our code is available at https://github.com/lindermanlab/informed-correctors.

[506] Certified Robust Invariant Polytope Training in Neural Controlled ODEs

Akash Harapanahalli, Samuel Coogan

Main category: cs.LG

TL;DR: Framework for training neural network controllers with certified robust forward invariant polytopes using lifted embedding systems and sign constraints.

Details

Motivation: To develop controllers that guarantee robust forward invariance - ensuring trajectories remain within safe regions despite disturbances, addressing limitations of existing Lyapunov-based approaches.

Method: Parameterize lifted control systems in higher dimensions, construct lifted embedding systems using interval analysis and neural network verifiers, and enforce sign constraints on vector fields to certify forward invariant polytopes.

Result: Scalable approach achieving certification for systems with over 50 states, outperforming state-of-the-art Lyapunov-based sampling methods in runtime.

Conclusion: The proposed framework successfully trains neural network controllers with certified robust forward invariant polytopes, demonstrating scalability and computational efficiency advantages over existing methods.

Abstract: We consider a nonlinear control system modeled as an ordinary differential equation subject to disturbance, with a state feedback controller parameterized as a feedforward neural network. We propose a framework for training controllers with certified robust forward invariant polytopes, where any trajectory initialized inside the polytope remains within the polytope, regardless of the disturbance. First, we parameterize a family of lifted control systems in a higher dimensional space, where the original neural controlled system evolves on an invariant subspace of each lifted system. We use interval analysis and neural network verifiers to further construct a family of lifted embedding systems, carefully capturing the knowledge of this invariant subspace. If the vector field of any lifted embedding system satisfies a sign constraint at a single point, then a certain convex polytope of the original system is robustly forward invariant. Treating the neural network controller and the lifted system parameters as variables, we propose an algorithm to train controllers with certified forward invariant polytopes in the closed-loop control system. Through two examples, we demonstrate how the simplicity of the sign constraint allows our approach to scale with system dimension to over $50$ states, and outperform state-of-the-art Lyapunov-based sampling approaches in runtime.

[507] Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

Jiayu Chen, Le Xu, Wentse Chen, Jeff Schneider

Main category: cs.LG

TL;DR: Offline model-based RL using Bayes Adaptive MDP framework with Monte Carlo planning, outperforming state-of-the-art methods on D4RL and tokamak control tasks.

Details

Motivation: Address model uncertainty in offline MBRL by treating it as a Bayes Adaptive MDP problem, enabling better handling of multiple MDPs consistent with offline data.

Method: Propose Bayes Adaptive Monte-Carlo planning algorithm based on Monte Carlo Tree Search for continuous state-action spaces, integrated as policy improvement operator in policy iteration.

Result: Significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three target tracking tasks in stochastic tokamak control simulator.

Conclusion: The RL + Search framework successfully addresses model uncertainty in offline MBRL, demonstrating superior performance across diverse benchmark tasks.

Abstract: Offline RL is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based RL (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our ``RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three target tracking tasks in a challenging, stochastic tokamak control simulator. The codebase is available at: https://github.com/LucasCJYSDL/Offline-RL-Kit.

[508] Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade

Main category: cs.LG

TL;DR: Proposes a method to speed up exact inference in long convolution sequence models from quadratic to quasilinear time, achieving up to 7.8× end-to-end improvement.

Details

Motivation: Transformers have quadratic computational cost in sequence length, and while some subquadratic architectures exist, many like Hyena remain quadratic during inference despite being efficient at training.

Method: Uses a tiling approach inspired by relaxed polynomial interpolation to decrease memory movement and share computation, enabling almost complete parallelization across layers of the position-mixing part.

Result: Achieves up to 7.8× end-to-end inference speedup over standard methods, with 110× improvement specifically within the position-mixing component.

Conclusion: The proposed framework successfully reduces inference complexity of long convolution sequence models to quasilinear time while maintaining exact computation.

Abstract: While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs’ exact inference to quasilinear $O(L\log^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $7.8\times$ end-to-end improvement over standard inference by improving $110\times$ within the position-mixing part.

Aladin Djuhera, Amin Seffo, Vlad C. Andrei, Holger Boche, Walid Saad

Main category: cs.LG

TL;DR: SCoTT is a wireless-aware path planning framework that uses vision-language models to co-optimize wireless performance and trajectory length, achieving near-optimal performance with significantly reduced computational cost compared to traditional methods.

Details

Motivation: Path planning under wireless performance constraints is challenging, and naive integration of such constraints into classical planning algorithms leads to prohibitive search costs.

Method: Proposes SCoTT framework using vision-language models with Strategic Chain-of-Thought Tasking, which decomposes the exhaustive search problem into structured subtasks solved via chain-of-thought prompting using wireless heatmap images and ray-tracing data from digital twins.

Result: SCoTT achieves path gains within 2% of optimal DP-WA* while generating shorter trajectories, and can accelerate DP-WA* by reducing search space (saving up to 62% execution time). Works effectively across different VLMs and validated in ROS/Gazebo simulations.

Conclusion: SCoTT demonstrates the potential of natural language interfaces for wireless-aware navigation in real-world applications, with practical viability for 6G-enabled digital twins and low inference cost deployment.

Abstract: Path planning under wireless performance constraints is a complex challenge in robot navigation. However, naively incorporating such constraints into classical planning algorithms often incurs prohibitive search costs. In this paper, we propose SCoTT, a wireless-aware path planning framework that leverages vision-language models (VLMs) to co-optimize average path gains and trajectory length using wireless heatmap images and ray-tracing data from a digital twin (DT). At the core of our framework is Strategic Chain-of-Thought Tasking (SCoTT), a novel prompting paradigm that decomposes the exhaustive search problem into structured subtasks, each solved via chain-of-thought prompting. To establish strong baselines, we compare classical A* and wireless-aware extensions of it, and derive DP-WA*, an optimal, iterative dynamic programming algorithm that incorporates all path gains and distance metrics from the DT, but at significant computational cost. In extensive experiments, we show that SCoTT achieves path gains within 2% of DP-WA* while consistently generating shorter trajectories. Moreover, SCoTT’s intermediate outputs can be used to accelerate DP-WA* by reducing its search space, saving up to 62% in execution time. We validate our framework using four VLMs, demonstrating effectiveness across both large and small models, thus making it applicable to a wide range of compact models at low inference cost. We also show the practical viability of our approach by deploying SCoTT as a ROS node within Gazebo simulations. Finally, we discuss data acquisition pipelines, compute requirements, and deployment considerations for VLMs in 6G-enabled DTs, underscoring the potential of natural language interfaces for wireless-aware navigation in real-world applications.

[510] SPO-VCS: An End-to-End Smart Predict-then-Optimize Framework with Alternating Differentiation Method for Relocation Problems in Large-Scale Vehicle Crowd Sensing

Xinyu Wang, Yiyang Peng, Wei Ma

Main category: cs.LG

TL;DR: The paper proposes an end-to-end Smart Predict-then-Optimize (SPO) framework for vehicle relocation in crowd sensing systems, integrating optimization into prediction to minimize task-specific matching divergence rather than prediction error.

Details

Motivation: Vehicle sensing systems have biased coverage due to heterogeneous trip patterns, and conventional two-stage predict-then-optimize approaches suffer from error propagation leading to suboptimal decisions.

Method: Develops an SPO framework with ADMM-based unrolling to compute gradients of quadratic programming layers, enabling end-to-end learning by minimizing task-specific matching divergence.

Result: Validated on real-world Hong Kong taxi datasets, the framework shows effectiveness in vehicle relocation for improved sensing coverage.

Conclusion: The SPO framework presents a novel approach for decision-making under uncertainty with significant potential for intelligent transportation systems.

Abstract: Ubiquitous mobile devices have catalyzed the development of vehicle crowd sensing (VCS). In particular, vehicle sensing systems show great potential in the flexible acquisition of spatio-temporal urban data through built-in sensors under diverse sensing scenarios. However, vehicle systems often exhibit biased coverage due to the heterogeneous nature of trip requests and routes. To achieve a high sensing coverage, a critical challenge lies in optimally relocating vehicles to minimize the divergence between vehicle distributions and target sensing distributions. Conventional approaches typically employ a two-stage predict-then-optimize (PTO) process: first predicting real-time vehicle distributions and subsequently generating an optimal relocation strategy based on the predictions. However, this approach can lead to suboptimal decision-making due to the propagation of errors from upstream prediction. To this end, we develop an end-to-end Smart Predict-then-Optimize (SPO) framework by integrating optimization into prediction within the deep learning architecture, and the entire framework is trained by minimizing the task-specific matching divergence rather than the upstream prediction error. Methodologically, we formulate the vehicle relocation problem by quadratic programming (QP) and incorporate a novel unrolling approach based on the Alternating Direction Method of Multipliers (ADMM) within the SPO framework to compute gradients of the QP layer, facilitating backpropagation and gradient-based optimization for end-to-end learning. The effectiveness of the proposed framework is validated by real-world taxi datasets in Hong Kong. Utilizing the alternating differentiation method, the general SPO framework presents a novel concept of addressing decision-making problems with uncertainty, demonstrating significant potential for advancing applications in intelligent transportation systems.

[511] Generalizing Weisfeiler-Lehman Kernels to Subgraphs

Dongkwan Kim, Alice Oh

Main category: cs.LG

TL;DR: WLKS is a Weisfeiler-Lehman kernel generalized for subgraphs that captures complex interactions within and between subgraphs by applying WL algorithm on k-hop neighborhoods, achieving better performance and efficiency than existing methods.

Details

Motivation: Current GNNs produce suboptimal results for subgraph-level tasks due to inability to capture complex interactions within and between subgraphs.

Method: Propose WLKS - Weisfeiler-Lehman kernel generalized for subgraphs by applying WL algorithm on induced k-hop neighborhoods and combining kernels across different k-hop levels.

Result: Significantly outperforms leading approaches on five out of eight datasets while reducing training time from 0.01x to 0.25x compared to state-of-the-art.

Conclusion: WLKS provides a more expressive and efficient alternative for subgraph representation learning by balancing expressiveness and efficiency without neighborhood sampling.

Abstract: Subgraph representation learning has been effective in solving various real-world problems. However, current graph neural networks (GNNs) produce suboptimal results for subgraph-level tasks due to their inability to capture complex interactions within and between subgraphs. To provide a more expressive and efficient alternative, we propose WLKS, a Weisfeiler-Lehman (WL) kernel generalized for subgraphs by applying the WL algorithm on induced $k$-hop neighborhoods. We combine kernels across different $k$-hop levels to capture richer structural information that is not fully encoded in existing models. Our approach can balance expressiveness and efficiency by eliminating the need for neighborhood sampling. In experiments on eight real-world and synthetic benchmarks, WLKS significantly outperforms leading approaches on five datasets while reducing training time, ranging from 0.01x to 0.25x compared to the state-of-the-art.

[512] Cluster Catch Digraphs with the Nearest Neighbor Distance

Rui Shi, Elvan Ceyhan, Nedret Billor

Main category: cs.LG

TL;DR: A new clustering method using Cluster Catch Digraphs with nearest neighbor distance spatial randomness test outperforms existing CCD variants, especially for high-dimensional data.

Details

Motivation: To address limitations of RK-CCDs by replacing Ripley's K function with a more effective nearest neighbor distance test for better clustering performance.

Method: Uses Cluster Catch Digraphs with a spatial randomness test based on nearest neighbor distance instead of Ripley’s K function, evaluated through Monte Carlo analysis.

Result: Method performs comparably or better than KS-CCDs and RK-CCDs, particularly effective for high-dimensional data, and shows competitive performance on real datasets.

Conclusion: The new CCD variant with nearest neighbor distance test is a robust clustering method, especially suitable for high-dimensional data analysis.

Abstract: We introduce a new method for clustering based on Cluster Catch Digraphs (CCDs). The new method addresses the limitations of RK-CCDs by employing a new variant of spatial randomness test that employs the nearest neighbor distance (NND) instead of the Ripley’s K function used by RK-CCDs. We conduct a comprehensive Monte Carlo analysis to assess the performance of our method, considering factors such as dimensionality, data set size, number of clusters, cluster volumes, and inter-cluster distance. Our method is particularly effective for high-dimensional data sets, comparable to or outperforming KS-CCDs and RK-CCDs that rely on a KS-type statistic or the Ripley’s K function. We also evaluate our methods using real and complex data sets, comparing them to well-known clustering methods. Again, our methods exhibit competitive performance, producing high-quality clusters with desirable properties. Keywords: Graph-based clustering, Cluster catch digraphs, High-dimensional data, The nearest neighbor distance, Spatial randomness test

[513] A Survey on Human-Centered Evaluation of Explainable AI Methods in Clinical Decision Support Systems

Alessandro Gambetti, Qiwei Han, Hong Shen, Claudia Soares

Main category: cs.LG

TL;DR: Systematic survey of 31 human-centered XAI evaluations in clinical decision support reveals most use post-hoc methods like SHAP/Grad-CAM with small clinician studies, showing improved trust but increased cognitive load and reasoning misalignment.

Details

Motivation: XAI is essential for CDSS transparency and clinical adoption, but real-world effectiveness is limited and inconsistently evaluated, requiring systematic assessment of human-centered evaluations.

Method: PRISMA-guided systematic survey of 31 human-centered XAI evaluations in CDSS, classified by XAI methodology, evaluation design, and adoption barriers.

Result: Over 80% use post-hoc model-agnostic approaches (SHAP/Grad-CAM) with clinician sample sizes below 25; explanations improve trust and diagnostic confidence but increase cognitive load and misalign with clinical reasoning.

Conclusion: Proposed stakeholder-centric evaluation framework integrating socio-technical principles and HCI to guide development of clinically viable and trustworthy XAI-based CDSS.

Abstract: Explainable Artificial Intelligence (XAI) is essential for the transparency and clinical adoption of Clinical Decision Support Systems (CDSS). However, the real-world effectiveness of existing XAI methods remains limited and is inconsistently evaluated. This study conducts a systematic PRISMA-guided survey of 31 human-centered evaluations (HCE) of XAI applied to CDSS, classifying them by XAI methodology, evaluation design, and adoption barrier. Our findings reveal that most existing studies employ post-hoc, model-agnostic approaches such as SHAP and Grad-CAM, typically assessed through small-scale clinician studies. The results show that over 80% of the studies adopt post-hoc, model-agnostic approaches such as SHAP and Grad-CAM, and that clinician sample sizes remain below 25 participants. The findings indicate that explanations generally improve clinician trust and diagnostic confidence, but frequently increase cognitive load and exhibit misalignment with domain reasoning processes. To bridge these gaps, we propose a stakeholder-centric evaluation framework that integrates socio-technical principles and human-computer interaction to guide the future development of clinically viable and trustworthy XAI-based CDSS.

[514] Towards Synthesizing High-Dimensional Tabular Data with Limited Samples

Zuqing Li, Junhao Gan, Jianzhong Qi

Main category: cs.LG

TL;DR: CtrTab is a condition-controlled diffusion model that addresses performance degradation in high-dimensional tabular data synthesis by injecting perturbed ground-truth samples as auxiliary inputs during training.

Details

Motivation: Existing diffusion-based tabular data synthesis models degenerate in high-dimensional settings due to limited training samples hindering accurate distribution capture, performing worse than simpler non-diffusion models.

Method: Proposes CtrTab with condition-controlled diffusion that injects perturbed ground-truth samples as auxiliary inputs during training, introducing implicit L2 regularization on model sensitivity to control signals.

Result: CtrTab outperforms state-of-the-art models across multiple datasets with performance gap in accuracy over 90% on average, improving robustness and stability in high-dimensional, low-data scenarios.

Conclusion: The condition-controlled diffusion approach with perturbed ground-truth samples effectively mitigates insufficient learning signals and stabilizes training for high-dimensional tabular data synthesis.

Abstract: Diffusion-based tabular data synthesis models have yielded promising results. However, when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To mitigate the insufficient learning signals and to stabilize training under such conditions, we propose CtrTab, a condition-controlled diffusion model that injects perturbed ground-truth samples as auxiliary inputs during training. This design introduces an implicit L2 regularization on the model’s sensitivity to the control signal, improving robustness and stability in high-dimensional, low-data scenarios. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with a performance gap in accuracy over 90% on average.

[515] COPA: Comparing the incomparable in multi-objective model evaluation

Adrián Javaloy, Antonio Vergari, Isabel Valera

Main category: cs.LG

TL;DR: COPA enables automatic normalization and aggregation of incomparable ML objectives using cumulative functions and relative rankings, helping users navigate Pareto fronts for model selection.

Details

Motivation: Comparing and trading off multiple ML objectives (accuracy, robustness, fairness, scalability) is challenging due to different units and scales, requiring expert knowledge and being time-consuming.

Method: Use cumulative functions approximated by relative rankings to make incomparable objectives comparable, then aggregate them while matching user-specific preferences.

Result: COPA successfully enables meaningful navigation and search for models in Pareto fronts across diverse ML areas including fair ML, domain generalization, AutoML and foundation models.

Conclusion: COPA provides a systematic approach to normalize and aggregate objectives where classical methods fail, helping practitioners efficiently select models from large sets.

Abstract: In machine learning (ML), we often need to choose one among hundreds of trained ML models at hand, based on various objectives such as accuracy, robustness, fairness or scalability. However, it is often unclear how to compare, aggregate and, ultimately, trade-off these objectives, making it a time-consuming task that requires expert knowledge, as objectives may be measured in different units and scales. In this work, we investigate how objectives can be automatically normalized and aggregated to systematically help the user navigate their Pareto front. To this end, we make incomparable objectives comparable using their cumulative functions, approximated by their relative rankings. As a result, our proposed approach, COPA, can aggregate them while matching user-specific preferences, allowing practitioners to meaningfully navigate and search for models in the Pareto front. We demonstrate the potential impact of COPA in both model selection and benchmarking tasks across diverse ML areas such as fair ML, domain generalization, AutoML and foundation models, where classical ways to normalize and aggregate objectives fall short.

[516] CATransformers: Carbon Aware Transformers Through Joint Model-Hardware Optimization

Irene Wang, Newsha Ardalani, Mostafa Elhoushi, Daniel Jiang, Samuel Hsia, Ekin Sumbul, Divya Mahajan, Carole-Jean Wu, Bilge Acun

Main category: cs.LG

TL;DR: CATransformers is a carbon-aware co-optimization framework that reduces total carbon emissions by up to 30% for Transformer models while maintaining accuracy and latency.

Details

Motivation: The growing adoption of machine learning solutions increases lifecycle carbon footprint, including operational carbon from training/inference and embodied carbon from hardware manufacturing.

Method: Introduces a carbon-aware co-optimization framework that integrates both operational and embodied carbon into early-stage design space exploration for Transformer models and hardware accelerators.

Result: The framework consistently reduces total carbon emissions by up to 30% across various Transformer models while maintaining accuracy and latency performance.

Conclusion: Holistic optimization methods are needed to prioritize carbon efficiency without compromising model capability and execution time performance.

Abstract: Machine learning solutions are rapidly adopted to enable a variety of key use cases, from conversational AI assistants to scientific discovery. This growing adoption is expected to increase the associated lifecycle carbon footprint, including both \emph{operational carbon} from training and inference and \emph{embodied carbon} from AI hardware manufacturing. We introduce \ourframework – the first carbon-aware co-optimization framework for Transformer-based models and hardware accelerators. By integrating both operational and embodied carbon into early-stage design space exploration, \ourframework enables sustainability-driven model architecture and hardware accelerator co-design that reveals fundamentally different trade-offs than latency- or energy-centric approaches. Evaluated across a range of Transformer models, \ourframework consistently demonstrates the potential to reduce total carbon emissions – by up to 30% – while maintaining accuracy and latency. We further highlight its extensibility through a focused case study on multi-modal models. Our results emphasize the need for holistic optimization methods that prioritize carbon efficiency without compromising model capability and execution time performance. The source code of \ourframework is available at {\small{\href{https://github.com/facebookresearch/CATransformers}{\texttt{https://github.com/facebookresearch/CATransformers}}}}.

[517] Towards High Resolution Probabilistic Coastal Inundation Forecasting from Sparse Observations

Kazi Ashik Islam, Zakaria Mehrab, Mahantesh Halappanavar, Henning Mortveit, Sridhar Katragadda, Jon Derek Loftis, Stefan Hoops, Madhav Marathe

Main category: cs.LG

TL;DR: DIFF-SPARSE is a masked conditional diffusion model for probabilistic coastal inundation forecasting from sparse sensor observations, achieving up to 62% improvement over existing methods at 95% sparsity levels.

Details

Motivation: Coastal flooding threats require accurate hyper-local inundation forecasting, but real-world deployment is constrained by sparse sensor networks due to budget limitations.

Method: Uses masked conditional diffusion model with novel masking strategy during training, incorporating inundation history, digital elevation data, temporal co-variates, and CNN with conditional UNet architecture with cross-attention.

Result: Achieves up to 62% improvement in forecasting metrics compared to existing methods at 95% sparsity level, with digital elevation data proving more useful than temporal co-variates at high sparsity.

Conclusion: DIFF-SPARSE effectively addresses the challenge of spatiotemporal prediction from sparse observations and demonstrates superior performance in coastal inundation forecasting under extreme sparsity conditions.

Abstract: Coastal flooding poses increasing threats to communities worldwide, necessitating accurate and hyper-local inundation forecasting for effective emergency response. However, real-world deployment of forecasting systems is often constrained by sparse sensor networks, where only a limited subset of locations may have sensors due to budget constraints. To approach this challenge, we present DIFF -SPARSE, a masked conditional diffusion model designed for probabilistic coastal inundation forecasting from sparse sensor observations. DIFF -SPARSE primarily utilizes the inundation history of a location and its neighboring locations from a context time window as spatiotemporal context. The fundamental challenge of spatiotemporal prediction based on sparse observations in the context window is addressed by introducing a novel masking strategy during training. Digital elevation data and temporal co-variates are utilized as additional spatial and temporal contexts, respectively. A convolutional neural network and a conditional UNet architecture with cross-attention mechanism are employed to capture the spatiotemporal dynamics in the data. We trained and tested DIFF -SPARSE on coastal inundation data from the Eastern Shore of Virginia and systematically assessed the performance of DIFF -SPARSE across different sparsity levels 0%, 50%, 95% missing observations. Our experiment results show that DIFF -SPARSE achieves upto 62% improvement in terms of two forecasting performance metrics compared to existing methods, at 95% sparsity level. Moreover, our ablation studies reveal that digital elevation data becomes more useful at high sparsity levels compared to temporal co-variates.

[518] Accelerating Visual-Policy Learning through Parallel Differentiable Simulation

Haoxiang You, Yilang Liu, Ian Abraham

Main category: cs.LG

TL;DR: A computationally efficient visual policy learning algorithm using differentiable simulation and first-order policy gradients, with decoupled rendering to reduce overhead and improve optimization stability.

Details

Motivation: To enable seamless integration with existing differentiable simulation ecosystems without requiring specialized differentiable rendering software, while reducing computational and memory overhead.

Method: Proposes decoupling the rendering process from the computation graph, leveraging differentiable simulation and first-order analytical policy gradients for visual policy learning.

Result: Significantly reduces wall-clock training time and consistently outperforms baseline methods, achieving 4x improvement in final return on complex tasks like humanoid locomotion, and learning humanoid running policy within 4 hours on a single GPU.

Conclusion: The decoupling approach effectively attenuates policy gradient norm, leading to more stable and smoother optimization while maintaining computational efficiency in visual policy learning.

Abstract: In this work, we propose a computationally efficient algorithm for visual policy learning that leverages differentiable simulation and first-order analytical policy gradients. Our approach decouple the rendering process from the computation graph, enabling seamless integration with existing differentiable simulation ecosystems without the need for specialized differentiable rendering software. This decoupling not only reduces computational and memory overhead but also effectively attenuates the policy gradient norm, leading to more stable and smoother optimization. We evaluate our method on standard visual control benchmarks using modern GPU-accelerated simulation. Experiments show that our approach significantly reduces wall-clock training time and consistently outperforms all baseline methods in terms of final returns. Notably, on complex tasks such as humanoid locomotion, our method achieves a $4\times$ improvement in final return, and successfully learns a humanoid running policy within 4 hours on a single GPU.

[519] Tool-Aided Evolutionary LLM for Generative Policy Toward Efficient Resource Management in Wireless Federated Learning

Chongyang Tan, Ruoqi Wen, Rongpeng Li, Zhifeng Zhao, Ekram Hossain, Honggang Zhang

Main category: cs.LG

TL;DR: Proposes T-ELLM framework using evolutionary LLMs for device selection in federated learning, combining language-based prompts with mathematical decoupling and virtual environment optimization.

Details

Motivation: FL efficiency depends on device selection and resource allocation in dynamic wireless environments, but conventional methods require domain expertise, hyperparameter tuning, and high interaction costs.

Method: Uses Tool-aided Evolutionary LLM with natural language prompts, mathematical decoupling of joint optimization, model-based virtual learning environment, and group relative policy optimization.

Result: Outperforms benchmarks in energy efficiency, shows robust adaptability to environmental changes, and reduces communication overhead while maintaining high-fidelity decisions.

Conclusion: T-ELLM provides an effective framework for device selection in FL with bounded virtual-real environment discrepancy, enabling sample-efficient optimization with strong generalization.

Abstract: Federated Learning (FL) enables distributed model training across edge devices in a privacy-friendly manner. However, its efficiency heavily depends on effective device selection and high-dimensional resource allocation in dynamic and heterogeneous wireless environments. Conventional methods demand a confluence of domain-specific expertise, extensive hyperparameter tuning, and/or heavy interaction cost. This paper proposes a Tool-aided Evolutionary Large Language Model (T-ELLM) framework to generate a qualified policy for device selection in a wireless FL environment. Unlike conventional optimization methods, T-ELLM leverages natural language-based scenario prompts to enhance generalization across varying network conditions. The framework decouples the joint optimization problem mathematically, enabling tractable learning of device selection policies while delegating resource allocation to convex optimization tools. To improve adaptability, T-ELLM integrates a sample-efficient, model-based virtual learning environment that captures the relationship between device selection and learning performance, facilitating subsequent group relative policy optimization. This concerted approach reduces reliance on real-world interactions, minimizing communication overhead while maintaining high-fidelity decision-making. Theoretical analysis proves that the discrepancy between virtual and real environments is bounded, ensuring the advantage function learned in the virtual environment maintains a provably small deviation from real-world conditions. Experimental results demonstrate that T-ELLM outperforms benchmark methods in energy efficiency and exhibits robust adaptability to environmental changes.

[520] RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs

Soumya Rani Samineni, Durgesh Kalwar, Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati

Main category: cs.LG

TL;DR: The paper critically examines RL-based post-training of LLMs, showing that popular structural assumptions in MDP modeling make RL approaches effectively equivalent to supervised learning, with iterative supervised fine-tuning achieving comparable results.

Details

Motivation: To critically examine the formulation and assumptions underlying RL-based post-training of LLMs, particularly questioning whether the RL apparatus is actually necessary given the structural assumptions made.

Method: Analysis of popular structural assumptions in MDP modeling for LLMs, including state-action concatenation and uniform reward splitting, plus experiments comparing GRPO with iterative supervised fine-tuning on benchmarks like GSM8K and Countdown.

Result: Iterative supervised fine-tuning with positive and negative samples achieves performance comparable to GRPO-based training, suggesting the RL approach may not provide significant advantages over simpler supervised methods.

Conclusion: While RL may be useful for improving LLM reasoning, the simplistic structural assumptions in current RL frameworks make their interpretations questionable and suggest they may not be fundamentally different from supervised learning approaches.

Abstract: Reinforcement learning-based post-training of large language models (LLMs) has recently gained attention, particularly following the release of DeepSeek R1, which applied GRPO for fine-tuning. Amid the growing hype around improved reasoning abilities attributed to RL post-training, we critically examine the formulation and assumptions underlying these methods. We start by highlighting the popular structural assumptions made in modeling LLM training as a Markov Decision Process (MDP), and show how they lead to a degenerate MDP that doesn’t quite need the RL/GRPO apparatus. The two critical structural assumptions include (1) making the MDP states be just a concatenation of the actions-with states becoming the context window and the actions becoming the tokens in LLMs and (2) splitting the reward of a state-action trajectory uniformly across the trajectory. Through a comprehensive analysis, we demonstrate that these simplifying assumptions make the approach effectively equivalent to an outcome-driven supervised learning. Our experiments on benchmarks including GSM8K and Countdown using Qwen-2.5 base models show that iterative supervised fine-tuning, incorporating both positive and negative samples, achieves performance comparable to GRPO-based training. We will also argue that the structural assumptions indirectly incentivize the RL to generate longer sequences of intermediate tokens-which in turn feeds into the narrative of “RL generating longer thinking traces.” While RL may well be a very useful technique for improving the reasoning abilities of LLMs, our analysis shows that the simplistic structural assumptions made in modeling the underlying MDP render the popular LLM RL frameworks and their interpretations questionable.

[521] Policy-Driven World Model Adaptation for Robust Offline Model-based Reinforcement Learning

Jiayu Chen, Le Xu, Aravind Venugopal, Jeff Schneider

Main category: cs.LG

TL;DR: Proposes a unified offline model-based RL framework that adapts world models and policies together using Stackelberg learning dynamics to improve robustness against adversarial noise.

Details

Motivation: Addresses two key issues in offline MBRL: objective mismatch between model learning and policy optimization, and lack of robustness where small adversarial noise causes significant performance degradation.

Method: Uses maximin optimization solved via Stackelberg learning dynamics to dynamically adapt world model alongside policy under a unified learning objective for robustness improvement.

Result: Achieves state-of-the-art performance on twelve noisy D4RL MuJoCo tasks and three stochastic Tokamak Control tasks, demonstrating improved robustness.

Conclusion: The proposed unified framework effectively addresses robustness issues in offline MBRL by co-adapting world models and policies through Stackelberg learning dynamics.

Abstract: Offline reinforcement learning (RL) offers a powerful paradigm for data-driven control. Compared to model-free approaches, offline model-based RL (MBRL) explicitly learns a world model from a static dataset and uses it as a surrogate simulator, improving data efficiency and enabling potential generalization beyond the dataset support. However, most existing offline MBRL methods follow a two-stage training procedure: first learning a world model by maximizing the likelihood of the observed transitions, then optimizing a policy to maximize its expected return under the learned model. This objective mismatch results in a world model that is not necessarily optimized for effective policy learning. Moreover, we observe that policies learned via offline MBRL often lack robustness during deployment, and small adversarial noise in the environment can lead to significant performance degradation. To address these, we propose a framework that dynamically adapts the world model alongside the policy under a unified learning objective aimed at improving robustness. At the core of our method is a maximin optimization problem, which we solve by innovatively utilizing Stackelberg learning dynamics. We provide theoretical analysis to support our design and introduce computationally efficient implementations. We benchmark our algorithm on twelve noisy D4RL MuJoCo tasks and three stochastic Tokamak Control tasks, demonstrating its state-of-the-art performance.

[522] When fractional quasi p-norms concentrate

Ivan Y. Tyukin, Bogdan Grechuk, Evgeny M. Mirkes, Alexander N. Gorban

Main category: cs.LG

TL;DR: This paper resolves the long-standing controversy about distance concentration in high dimensions for fractional quasi p-norms (p∈(0,1)), identifying conditions when they concentrate and when they don’t.

Details

Motivation: To address fundamental questions about distance concentration in high dimensions, which is crucial for developing stable data analysis algorithms, and resolve theoretical and empirical controversies around fractional quasi p-norms.

Method: The authors analyze conditions under which fractional quasi p-norms concentrate or don’t concentrate, examining broad classes of distributions and their concentration properties.

Result: For broad distribution classes, fractional quasi p-norms admit exponential and uniform concentration bounds, ruling out previous ‘optimal’ p-setting approaches. However, specific conditions and distribution families allow controlling concentration rates through appropriate p choices.

Conclusion: The findings resolve tensions in the literature, show that anti-concentration distributions exist arbitrarily close to concentrating ones, and enable designing data encoding schemes that favor or discourage distance concentration.

Abstract: Concentration of distances in high dimension is an important factor for the development and design of stable and reliable data analysis algorithms. In this paper, we address the fundamental long-standing question about the concentration of distances in high dimension for fractional quasi $p$-norms, $p\in(0,1)$. The topic has been at the centre of various theoretical and empirical controversies. Here we, for the first time, identify conditions when fractional quasi $p$-norms concentrate and when they don’t. We show that contrary to some earlier suggestions, for broad classes of distributions, fractional quasi $p$-norms admit exponential and uniform in $p$ concentration bounds. For these distributions, the results effectively rule out previously proposed approaches to alleviate concentration by “optimal” setting the values of $p$ in $(0,1)$. At the same time, we specify conditions and the corresponding families of distributions for which one can still control concentration rates by appropriate choices of $p$. We also show that in an arbitrarily small vicinity of a distribution from a large class of distributions for which uniform concentration occurs, there are uncountably many other distributions featuring anti-concentration properties. Importantly, this behavior enables devising relevant data encoding or representation schemes favouring or discouraging distance concentration. The results shed new light on this long-standing problem and resolve the tension around the topic in both theory and empirical evidence reported in the literature.

[523] STaR-Bets: Sequential Target-Recalculating Bets for Tighter Confidence Intervals

Václav Voráček, Francesco Orabona

Main category: cs.LG

TL;DR: Proposes a new betting-based algorithm for constructing optimal confidence intervals for bounded random variables that outperforms existing methods and achieves near-optimal width.

Details

Motivation: Current betting-based methods for confidence intervals are either suboptimal in fixed-horizon settings or lack finite-time guarantees, creating a gap in achieving optimal confidence interval width.

Method: Developed a betting-based algorithm that uses optimal strategies at each step rather than constant strategies, leveraging this for improved performance over classical methods like Hoeffding and Bernstein inequalities.

Result: The proposed algorithm empirically outperforms competitors and achieves confidence interval width that is optimal up to a 1+o(1) factor that diminishes with sample size n.

Conclusion: The work successfully bridges the gap in optimal confidence interval construction for bounded random variables, providing both empirical superiority and theoretical guarantees for finite-time performance.

Abstract: The construction of confidence intervals for the mean of a bounded random variable is a classical problem in statistics with numerous applications in machine learning and virtually all scientific fields. In particular, obtaining the tightest possible confidence intervals is vital every time the sampling of the random variables is expensive. The current state-of-the-art method to construct confidence intervals is by using betting algorithms. This is a very successful approach for deriving optimal confidence sequences, even matching the rate of law of iterated logarithms. However, in the fixed horizon setting, these approaches are either sub-optimal or based on heuristic solutions with strong empirical performance but without a finite-time guarantee. Hence, no betting-based algorithm guaranteeing the optimal $\mathcal{O}(\sqrt{\frac{σ^2\log\frac1δ}{n}})$ width of the confidence intervals are known. This work bridges this gap. We propose a betting-based algorithm to compute confidence intervals that empirically outperforms the competitors. Our betting strategy uses the optimal strategy in every step (in a certain sense), whereas the standard betting methods choose a constant strategy in advance. Leveraging this fact results in strict improvements even for classical concentration inequalities, such as the ones of Hoeffding or Bernstein. Moreover, we also prove that the width of our confidence intervals is optimal up to an $1+o(1)$ factor diminishing with $n$. The code is available at https://github.com/vvoracek/STaR-bets-confidence-interval.

[524] Zeroth-Order Optimization Finds Flat Minima

Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil, Sewoong Oh, Michael Muehlebach, Niao He

Main category: cs.LG

TL;DR: Zeroth-order optimization with standard two-point estimator implicitly favors flat minima (solutions with small trace of Hessian) over sharp minima, providing convergence guarantees to approximate flat minima for convex smooth functions.

Details

Motivation: While zeroth-order methods are widely used in ML applications where gradients are unavailable, existing theory focuses only on convergence to stationary points without understanding which particular solutions are reached. There's a gap in understanding implicit regularization properties.

Method: Analyze zeroth-order optimization with standard two-point estimator, provide theoretical convergence rates to approximate flat minima for convex and sufficiently smooth functions, and validate through experiments on binary classification and language model fine-tuning.

Result: Theoretical analysis shows zeroth-order optimization favors solutions with small trace of Hessian (flat minima). Experiments confirm this implicit regularization effect in both convex classification tasks and language model fine-tuning.

Conclusion: Zeroth-order optimization exhibits implicit regularization towards flat minima, providing theoretical guarantees for convergence to solutions with small Hessian trace, which has practical implications for applications like black-box attacks, RL, and language model fine-tuning.

Abstract: Zeroth-order methods are extensively used in machine learning applications where gradients are infeasible or expensive to compute, such as black-box attacks, reinforcement learning, and language model fine-tuning. Existing optimization theory focuses on convergence to an arbitrary stationary point, but less is known on the implicit regularization that provides a fine-grained characterization on which particular solutions are finally reached. We show that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian, which is widely used in previous work to distinguish between sharp and flat minima. We further provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions, where flat minima are defined as the minimizers that achieve the smallest trace of Hessian among all optimal solutions. Experiments on binary classification tasks with convex losses and language model fine-tuning support our theoretical findings.

[525] Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning

Achleshwar Luthra, Tianbao Yang, Tomer Galanti

Main category: cs.LG

TL;DR: Self-supervised contrastive learning (CL) implicitly approximates a supervised variant called negatives-only supervised contrastive loss (NSCL), with the gap vanishing as the number of classes increases. NSCL produces representations with specific geometric properties that enable accurate few-shot learning via linear probes.

Details

Motivation: To establish theoretical foundations for self-supervised contrastive learning, which has shown empirical success but lacks complete theoretical understanding.

Method: Theoretical analysis showing CL objectives approximate NSCL, characterization of geometric structure of NSCL minimizers, and introduction of a new bound on few-shot error for linear probing.

Result: The gap between CL and NSCL decays as O(1/#classes), representations exhibit augmentation collapse and within-class collapse, class centers form a simplex equiangular tight frame, and the few-shot error bound provides tight performance estimates.

Conclusion: CL implicitly approximates supervised learning objectives, producing geometrically structured representations that support accurate few-shot learning, with theoretical properties validated empirically.

Abstract: Despite its empirical success, the theoretical foundations of self-supervised contrastive learning (CL) are not yet fully established. In this work, we address this gap by showing that standard CL objectives implicitly approximate a supervised variant we call the negatives-only supervised contrastive loss (NSCL), which excludes same-class contrasts. We prove that the gap between the CL and NSCL losses vanishes as the number of semantic classes increases, under a bound that is both label-agnostic and architecture-independent. We characterize the geometric structure of the global minimizers of the NSCL loss: the learned representations exhibit augmentation collapse, within-class collapse, and class centers that form a simplex equiangular tight frame. We further introduce a new bound on the few-shot error of linear-probing. This bound depends on two measures of feature variability–within-class dispersion and variation along the line between class centers. We show that directional variation dominates the bound and that the within-class dispersion’s effect diminishes as the number of labeled samples increases. These properties enable CL and NSCL-trained representations to support accurate few-shot label recovery using simple linear probes. Finally, we empirically validate our theoretical findings: the gap between CL and NSCL losses decays at a rate of $\mathcal{O}(\frac{1}{#\text{classes}})$; the two losses are highly correlated; minimizing the CL loss implicitly brings the NSCL loss close to the value achieved by direct minimization; and the proposed few-shot error bound provides a tight estimate of probing performance in practice. The code and project page of the paper are available at [\href{https://github.com/DLFundamentals/understanding-ssl}{code}, \href{https://dlfundamentals.github.io/ssl-is-approximately-sl/}{project page}].

[526] Rethinking Losses for Diffusion Bridge Samplers

Sebastian Sanokowski, Lukas Gruber, Christoph Bartmann, Sepp Hochreiter, Sebastian Lehner

Main category: cs.LG

TL;DR: rKL-LD loss outperforms LV loss for diffusion bridges, offering better performance, stability, and reduced hyperparameter tuning.

Details

Motivation: To address conceptual issues with LV loss for diffusion bridges and show that rKL-LD loss provides better theoretical foundation and practical performance.

Method: Analyzed gradient equivalence between LV and rKL losses, then compared rKL-LD loss against LV loss for diffusion bridges with different types on challenging benchmarks.

Result: rKL-LD loss consistently outperforms LV loss across different diffusion bridge types, requires less hyperparameter optimization, and yields more stable training.

Conclusion: rKL-LD loss is superior to LV loss for diffusion bridges both theoretically and practically, providing better sampling performance and training stability.

Abstract: Diffusion bridges are a promising class of deep-learning methods for sampling from unnormalized distributions. Recent works show that the Log Variance (LV) loss consistently outperforms the reverse Kullback-Leibler (rKL) loss when using the reparametrization trick to compute rKL-gradients. While the on-policy LV loss yields identical gradients to the rKL loss when combined with the log-derivative trick for diffusion samplers with non-learnable forward processes, this equivalence does not hold for diffusion bridges or when diffusion coefficients are learned. Based on this insight we argue that for diffusion bridges the LV loss does not represent an optimization objective that can be motivated like the rKL loss via the data processing inequality. Our analysis shows that employing the rKL loss with the log-derivative trick (rKL-LD) does not only avoid these conceptual problems but also consistently outperforms the LV loss. Experimental results with different types of diffusion bridges on challenging benchmarks show that samplers trained with the rKL-LD loss achieve better performance. From a practical perspective we find that rKL-LD requires significantly less hyperparameter optimization and yields more stable training behavior.

[527] Do-PFN: In-Context Learning for Causal Effect Estimation

Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, Bernhard Schölkopf

Main category: cs.LG

TL;DR: Do-PFN: A method that uses pre-trained networks on synthetic causal data to estimate causal effects from observational data without requiring knowledge of the underlying causal graph.

Details

Motivation: Existing causal effect estimation methods require interventional data, ground truth causal graphs, or unconfoundedness assumptions, limiting real-world applicability.

Method: Pre-train Prior-data fitted networks (PFNs) on synthetic data from various causal structures with interventions, enabling prediction of interventional outcomes from observational data via in-context learning.

Result: Accurate estimation of causal effects without knowledge of the underlying causal graph, demonstrated through extensive experiments on synthetic case studies.

Conclusion: Do-PFN enables scalable and robust causal effect estimation across diverse causal datasets, transferring PFNs’ predictive success to causal inference.

Abstract: Estimation of causal effects is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the harder problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments on synthetic case studies, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. We also perform ablation studies that elucidate Do-PFN’s scalability and robustness across datasets with a variety of causal characteristics.

[528] MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko

Main category: cs.LG

TL;DR: MAC is an efficient second-order optimization method that approximates KFAC’s Fisher information matrix components, reducing computational burden while maintaining superior convergence for training neural networks including transformers.

Details

Motivation: Second-order optimization methods like KFAC provide better convergence using curvature information but suffer from high computational costs. This work aims to develop a more efficient alternative.

Method: Proposed MAC method analyzes the eigenspectra of KFAC’s Kronecker factors (activations and pre-activation gradients) and develops efficient approximations. It applies Kronecker factorization to transformer attention layers and integrates attention scores into preconditioning.

Result: Extensive evaluations show MAC outperforms KFAC and other state-of-the-art methods in accuracy, training time, and memory usage across various network architectures and datasets.

Conclusion: MAC provides a computationally efficient second-order optimization method that maintains convergence benefits while significantly reducing computational burden, making it suitable for modern architectures like transformers.

Abstract: Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two conditions under which it converges to global minima. Our extensive evaluations on various network architectures and datasets show that the proposed method outperforms KFAC and other state-of-the-art methods in terms of accuracy, end-to-end training time, and memory usage.

[529] Sampling 3D Molecular Conformers with Diffusion Transformers

J. Thorben Frank, Winfried Ripken, Gregor Lied, Klaus-Robert Müller, Oliver T. Unke, Stefan Chmiela

Main category: cs.LG

TL;DR: DiTMC adapts Diffusion Transformers for molecular conformer generation, addressing challenges like integrating graph information with 3D geometry and handling Euclidean symmetries through modular architecture and graph-based conditioning.

Details

Motivation: Diffusion Transformers show strong performance in image synthesis but face novel challenges when applied to molecules, including integrating discrete molecular graphs with continuous 3D geometry and handling Euclidean symmetries.

Method: Proposes DiTMC with modular architecture separating 3D coordinate processing from atomic connectivity conditioning, using graph-based conditioning strategies and both standard and SO(3)-equivariant attention mechanisms.

Result: Achieves state-of-the-art precision and physical validity on standard conformer generation benchmarks (GEOM-QM9, -DRUGS, -XL), with flexible trade-off between accuracy and computational efficiency.

Conclusion: DiTMC demonstrates how architectural choices and symmetry priors affect molecular structure generation quality and efficiency, providing promising directions for large-scale generative modeling of molecular structures.

Abstract: Diffusion Transformers (DiTs) have demonstrated strong performance in generative modeling, particularly in image synthesis, making them a compelling choice for molecular conformer generation. However, applying DiTs to molecules introduces novel challenges, such as integrating discrete molecular graph information with continuous 3D geometry, handling Euclidean symmetries, and designing conditioning mechanisms that generalize across molecules of varying sizes and structures. We propose DiTMC, a framework that adapts DiTs to address these challenges through a modular architecture that separates the processing of 3D coordinates from conditioning on atomic connectivity. To this end, we introduce two complementary graph-based conditioning strategies that integrate seamlessly with the DiT architecture. These are combined with different attention mechanisms, including both standard non-equivariant and SO(3)-equivariant formulations, enabling flexible control over the trade-off between between accuracy and computational efficiency. Experiments on standard conformer generation benchmarks (GEOM-QM9, -DRUGS, -XL) demonstrate that DiTMC achieves state-of-the-art precision and physical validity. Our results highlight how architectural choices and symmetry priors affect sample quality and efficiency, suggesting promising directions for large-scale generative modeling of molecular structures. Code is available at https://github.com/ML4MolSim/dit_mc.

[530] ORVIT: Near-Optimal Online Distributionally Robust Reinforcement Learning

Debamita Ghosh, George K. Atia, Yue Wang

Main category: cs.LG

TL;DR: Online distributionally robust RL algorithm that achieves sublinear regret for robust control under f-divergence ambiguity sets without requiring generative models or offline data.

Details

Motivation: Address distributional mismatch between training and deployment in RL, where policies trained in simulators often fail in practice due to environmental differences. Existing methods require impractical assumptions like generative models or broad offline dataset coverage.

Method: Propose online distributionally robust RL where agent interacts with single unknown training environment while seeking policies robust to uncertainty sets around nominal model. Use f-divergence-based ambiguity sets (including χ² and KL divergence) and design computationally efficient algorithm.

Result: Achieve sublinear regret for robust control objective under minimal assumptions without requiring generative or offline data access. Establish minimax lower bound showing near-optimality. Experiments show consistent improvement in worst-case performance across diverse environments with model misspecification.

Conclusion: The proposed online distributionally robust RL method effectively addresses distributional mismatch, provides theoretical guarantees, and demonstrates practical improvements in worst-case performance without impractical data requirements.

Abstract: We investigate reinforcement learning (RL) in the presence of distributional mismatch between training and deployment, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions, and thereby reliable guarantees on real-world performance are essential. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment-assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study a more practical and challenging setting: online distributionally robust RL, where the agent interacts only with a single unknown training environment while seeking policies that are robust with respect to an uncertainty set around this nominal model. We consider general $f$-divergence-based ambiguity sets, including $χ^2$ and KL divergence balls, and design a computationally efficient algorithm that achieves sublinear regret for the robust control objective under minimal assumptions, without requiring generative or offline data access. Moreover, we establish a corresponding minimax lower bound on the regret of any online algorithm, demonstrating the near-optimality of our method. Experiments across diverse environments with model misspecification show that our approach consistently improves worst-case performance and aligns with the theoretical guarantees.

[531] S$^2$M-Former: Spiking Symmetric Mixing Branchformer for Brain Auditory Attention Detection

Jiaqi Wang, Zhengyu Ma, Xiongri Shen, Chenlin Zhou, Leilei Zhao, Han Zhang, Yi Zhong, Siqi Cai, Zhenxi Song, Zhiguo Zhang

Main category: cs.LG

TL;DR: S²M-Former is a novel spiking symmetric mixing framework for auditory attention detection that achieves state-of-the-art performance with 14.7× parameter reduction and 5.8× energy savings compared to existing methods.

Details

Motivation: Current EEG-based auditory attention detection lacks synergistic frameworks that can fully leverage complementary EEG features under energy-efficiency constraints, which is crucial for developing neuro-steered hearing devices.

Method: Proposes a spike-driven symmetric architecture with parallel spatial and frequency branches using mirrored modular design and biologically plausible token-channel mixers, along with lightweight 1D token sequences to replace conventional 3D operations.

Result: Achieves comparable state-of-the-art decoding accuracy on three AAD benchmarks (KUL, DTU, AV-GC-AAD) across three settings (within-trial, cross-trial, cross-subject) while reducing parameters by 14.7× and energy consumption by 5.8× compared to recent ANN methods.

Conclusion: S²M-Former is a promising low-power, high-performance solution for auditory attention detection tasks, offering superior parameter efficiency and energy savings while maintaining competitive performance.

Abstract: Auditory attention detection (AAD) aims to decode listeners’ focus in complex auditory environments from electroencephalography (EEG) recordings, which is crucial for developing neuro-steered hearing devices. Despite recent advancements, EEG-based AAD remains hindered by the absence of synergistic frameworks that can fully leverage complementary EEG features under energy-efficiency constraints. We propose S$^2$M-Former, a novel spiking symmetric mixing framework to address this limitation through two key innovations: i) Presenting a spike-driven symmetric architecture composed of parallel spatial and frequency branches with mirrored modular design, leveraging biologically plausible token-channel mixers to enhance complementary learning across branches; ii) Introducing lightweight 1D token sequences to replace conventional 3D operations, reducing parameters by 14.7$\times$. The brain-inspired spiking architecture further reduces power consumption, achieving a 5.8$\times$ energy reduction compared to recent ANN methods, while also surpassing existing SNN baselines in terms of parameter efficiency and performance. Comprehensive experiments on three AAD benchmarks (KUL, DTU and AV-GC-AAD) across three settings (within-trial, cross-trial and cross-subject) demonstrate that S$^2$M-Former achieves comparable state-of-the-art (SOTA) decoding accuracy, making it a promising low-power, high-performance solution for AAD tasks. Code is available at https://github.com/JackieWang9811/S2M-Former.

[532] Interpretable Reward Model via Sparse Autoencoder

Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang

Main category: cs.LG

TL;DR: SARM integrates Sparse Autoencoder into reward models to create interpretable, sparse feature spaces for transparent reward scoring and dynamic preference adjustment.

Details

Motivation: Traditional reward models lack interpretability, offer limited insight into reward reasoning, and are inflexible to user preference shifts. Multidimensional RMs fail to provide feature-level attribution and require costly annotations.

Method: Integrates a pretrained Sparse Autoencoder (SAE) into reward models to map LLM hidden activations into interpretable, sparse, monosemantic feature space, then aggregates features through a scalar head for reward scoring.

Result: Empirical evaluations show SARM enables direct feature-level attribution of rewards, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models.

Conclusion: SARM provides a novel architecture that enhances reward model interpretability and flexibility while maintaining strong alignment performance, addressing key limitations of traditional reward modeling approaches.

Abstract: Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.

[533] Instance Generation for Meta-Black-Box Optimization through Latent Space Reverse Engineering

Chen Wang, Yue-Jiao Gong, Zhiguang Cao, Zeyuan Ma

Main category: cs.LG

TL;DR: LSRE generates diverse training problem instances for Meta-Black-Box Optimization (MetaBBO) to improve generalization by using autoencoder-based latent space representation and genetic programming to create Diverse-BBO dataset.

Details

Motivation: Current MetaBBO methods use limited CoCo-BBOB benchmarks, risking overfitting and poor generalization due to insufficient problem diversity.

Method: LSRE trains an autoencoder to map problem features to 2D latent space, performs uniform-grid sampling for diverse representations, then uses genetic programming to reverse engineer function formulas with minimal L2-distance to create Diverse-BBO dataset.

Result: MetaBBOs trained on Diverse-BBO show superior generalization on synthetic and realistic scenarios compared to existing training sets, with ablation studies confirming LSRE’s design effectiveness.

Conclusion: LSRE successfully addresses MetaBBO’s overfitting issue by generating diverse training instances, revealing insights about instance diversity’s impact on generalization performance.

Abstract: To relieve intensive human-expertise required to design optimization algorithms, recent Meta-Black-Box Optimization (MetaBBO) researches leverage generalization strength of meta-learning to train neural network-based algorithm design policies over a predefined training problem set, which automates the adaptability of the low-level optimizers on unseen problem instances. Currently, a common training problem set choice in existing MetaBBOs is well-known benchmark suites CoCo-BBOB. Although such choice facilitates the MetaBBO’s development, problem instances in CoCo-BBOB are more or less limited in diversity, raising the risk of overfitting of MetaBBOs, which might further results in poor generalization. In this paper, we propose an instance generation approach, termed as \textbf{LSRE}, which could generate diverse training problem instances for MetaBBOs to learn more generalizable policies. LSRE first trains an autoencoder which maps high-dimensional problem features into a 2-dimensional latent space. Uniform-grid sampling in this latent space leads to hidden representations of problem instances with sufficient diversity. By leveraging a genetic-programming approach to search function formulas with minimal L2-distance to these hidden representations, LSRE reverse engineers a diversified problem set, termed as \textbf{Diverse-BBO}. We validate the effectiveness of LSRE by training various MetaBBOs on Diverse-BBO and observe their generalization performances on either synthetic or realistic scenarios. Extensive experimental results underscore the superiority of Diverse-BBO to existing training set choices in MetaBBOs. Further ablation studies not only demonstrate the effectiveness of design choices in LSRE, but also reveal interesting insights on instance diversity and MetaBBO’s generalization.

[534] FedShard: Federated Unlearning with Efficiency Fairness and Performance Fairness

Siyuan Wen, Meng Zhang, Yang Yang, Ningning Ding

Main category: cs.LG

TL;DR: FedShard is the first federated unlearning algorithm that ensures both efficiency fairness and performance fairness among clients during data removal, addressing fairness gaps in existing methods while maintaining high unlearning speed.

Details

Motivation: Current federated unlearning methods focus on efficiency and effectiveness but ignore fairness aspects among decentralized clients, leaving risks like cascaded leaving and poisoning attacks unaddressed.

Method: FedShard adaptively balances convergence, unlearning efficiency, and unlearning fairness using novel fairness metrics that satisfy established fairness properties.

Result: FedShard accelerates unlearning 1.3-6.2x faster than retraining and 4.9x faster than state-of-the-art exact unlearning methods, while mitigating unfairness risks and balancing costs among clients.

Conclusion: FedShard successfully addresses fairness gaps in federated unlearning, providing balanced efficiency and performance fairness with significant speed improvements over existing approaches.

Abstract: To protect clients’ right to be forgotten in federated learning, federated unlearning aims to remove the data contribution of leaving clients from the global learned model. While current studies mainly focused on enhancing unlearning efficiency and effectiveness, the crucial aspects of efficiency fairness and performance fairness among decentralized clients during unlearning have remained largely unexplored. In this study, we introduce FedShard, the first federated unlearning algorithm designed to concurrently guarantee both efficiency fairness and performance fairness. FedShard adaptively addresses the challenges introduced by dilemmas among convergence, unlearning efficiency, and unlearning fairness. Furthermore, we propose two novel metrics to quantitatively assess the fairness of unlearning algorithms, which we prove to satisfy well-known properties in other existing fairness measurements. Our theoretical analysis and numerical evaluation validate FedShard’s fairness in terms of both unlearning performance and efficiency. We demonstrate that FedShard mitigates unfairness risks such as cascaded leaving and poisoning attacks and realizes more balanced unlearning costs among clients. Experimental results indicate that FedShard accelerates the data unlearning process 1.3-6.2 times faster than retraining from scratch and 4.9 times faster than the state-of-the-art exact unlearning methods.

[535] GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks

Sergey Salishev, Ian Akhremchik

Main category: cs.LG

TL;DR: Quantized neural networks analyzed as noisy channels; fine-tuning with differentiable STE and learnable parameters achieves competitive accuracy down to extreme W1A1 quantization while maintaining efficiency.

Details

Motivation: To understand capacity dynamics in quantized neural networks as bit-width decreases and address quantization bottlenecks through constrained optimization.

Method: Uses fully differentiable Straight-Through Estimator (STE) with learnable bit-width, noise scale and clamp bounds, enforced via exterior-point penalty; employs mild metric smoothing through distillation for training stability.

Result: Attains competitive accuracy down to extreme W1A1 (1-bit weights and activations) quantization setting while retaining STE efficiency.

Conclusion: The proposed constrained optimization approach effectively handles quantization bottlenecks and achieves strong performance even at extreme low-precision settings.

Abstract: Quantized neural networks can be viewed as a chain of noisy channels, where rounding in each layer reduces capacity as bit-width shrinks; the floating-point (FP) checkpoint sets the maximum input rate. We track capacity dynamics as the average bit-width decreases and identify resulting quantization bottlenecks by casting fine-tuning as a smooth, constrained optimization problem. Our approach employs a fully differentiable Straight-Through Estimator (STE) with learnable bit-width, noise scale and clamp bounds, and enforces a target bit-width via an exterior-point penalty; mild metric smoothing (via distillation) stabilizes training. Despite its simplicity, the method attains competitive accuracy down to the extreme W1A1 setting while retaining the efficiency of STE.

[536] TimeMosaic: Temporal Heterogeneity Guided Time Series Forecasting via Adaptive Granularity Patch and Segment-wise Decoding

Kuiye Ding, Fanda Fan, Chunyi Hou, Zheya Wang, Lei Wang, Zhengxin Yang, Jianfeng Zhan

Main category: cs.LG

TL;DR: TimeMosaic is a multivariate time series forecasting framework that addresses temporal heterogeneity through adaptive patch embedding and segment-wise decoding, outperforming existing methods and achieving competitive performance with large-scale training.

Details

Motivation: Existing patch-based methods use fixed-length segmentation, which overlooks heterogeneity in local temporal dynamics and decoding requirements, leading to loss of details in information-dense regions and redundancy in stable segments.

Method: TimeMosaic employs adaptive patch embedding to dynamically adjust granularity based on local information density, and introduces segment-wise decoding that treats each prediction horizon as a related subtask with horizon-specific adaptation.

Result: Extensive evaluations show TimeMosaic delivers consistent improvements over existing methods, and when trained on a large-scale corpus with 321 billion observations, it achieves performance competitive with state-of-the-art TSFMs.

Conclusion: TimeMosaic effectively addresses temporal heterogeneity in multivariate time series forecasting through its adaptive patch embedding and segment-wise decoding approach, demonstrating superior performance across various benchmarks.

Abstract: Multivariate time series forecasting is essential in domains such as finance, transportation, climate, and energy. However, existing patch-based methods typically adopt fixed-length segmentation, overlooking the heterogeneity of local temporal dynamics and the decoding heterogeneity of forecasting. Such designs lose details in information-dense regions, introduce redundancy in stable segments, and fail to capture the distinct complexities of short-term and long-term horizons. We propose TimeMosaic, a forecasting framework that aims to address temporal heterogeneity. TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density, balancing motif reuse with structural clarity while preserving temporal continuity. In addition, it introduces segment-wise decoding that treats each prediction horizon as a related subtask and adapts to horizon-specific difficulty and information requirements, rather than applying a single uniform decoder. Extensive evaluations on benchmark datasets demonstrate that TimeMosaic delivers consistent improvements over existing methods, and our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.

[537] NeRC: Neural Ranging Correction through Differentiable Moving Horizon Location Estimation

Xu Weng, K. V. Ling, Haochen Liu, Bingheng Wang, Kun Cao

Main category: cs.LG

TL;DR: NeRC is an end-to-end neural framework that corrects GNSS ranging errors for mobile devices in urban environments using differentiable location estimation and Euclidean Distance Field cost maps, eliminating the need for hard-to-obtain ranging error labels.

Details

Motivation: GNSS localization on mobile devices suffers from poor accuracy in urban areas due to signal propagation errors and low-quality hardware. Traditional data-driven methods require difficult-to-obtain ranging error annotations, limiting their practical deployment.

Method: Proposes Neural Ranging Correction (NeRC) framework with differentiable moving horizon estimation (MHE) that uses ground-truth locations for training instead of ranging error labels. Introduces Euclidean Distance Field (EDF) cost maps as a new training paradigm to reduce dependency on labeled locations.

Result: Demonstrates significant improvement in positioning accuracy on public benchmarks and collected datasets. Successfully deployed on edge devices with real-time performance verification for mobile applications.

Conclusion: NeRC provides an effective end-to-end solution for GNSS error correction that bypasses the need for ranging error annotations, achieving improved accuracy while maintaining real-time performance on mobile devices.

Abstract: GNSS localization using everyday mobile devices is challenging in urban environments, as ranging errors caused by the complex propagation of satellite signals and low-quality onboard GNSS hardware are blamed for undermining positioning accuracy. Researchers have pinned their hopes on data-driven methods to regress such ranging errors from raw measurements. However, the grueling annotation of ranging errors impedes their pace. This paper presents a robust end-to-end Neural Ranging Correction (NeRC) framework, where localization-related metrics serve as the task objective for training the neural modules. Instead of seeking impractical ranging error labels, we train the neural network using ground-truth locations that are relatively easy to obtain. This functionality is supported by differentiable moving horizon location estimation (MHE) that handles a horizon of measurements for positioning and backpropagates the gradients for training. Even better, as a blessing of end-to-end learning, we propose a new training paradigm using Euclidean Distance Field (EDF) cost maps, which alleviates the demands on labeled locations. We evaluate the proposed NeRC on public benchmarks and our collected datasets, demonstrating its distinguished improvement in positioning accuracy. We also deploy NeRC on the edge to verify its real-time performance for mobile devices.

[538] A Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models

Kin G. Olivares, Malcolm Wolff, Tatiana Konstantinova, Shankar Ramasubramanian, Boris Oreshkin, Andrew Gordon Wilson, Andres Potapczynski, Willa Potosnak, Michael W. Mahoney, Mengfei Cao, Dmitry Efimov

Main category: cs.LG

TL;DR: Current cross-frequency transfer learning (CFTL) benchmarking practices are flawed due to small datasets, improper statistics, suboptimal models, and test leakage. Our rigorous evaluation shows statistical models outperform foundation forecasting models by significant margins.

Details

Motivation: To address limitations in current CFTL benchmarking practices including over-reliance on small datasets, inadequate statistical treatment, reporting of suboptimal models, and failure to prevent test dataset overlap.

Method: Unified reimplementation of neural forecasting networks adapted for CFTL setup, pre-training only on proprietary and synthetic data while preventing test leakage, and evaluation on 15 large, diverse public forecast competition datasets.

Result: Statistical models and ensembles consistently outperform existing foundation forecasting models by >8.2% in sCRPS and >20% in MASE across datasets. Synthetic dataset pre-training improves FFM accuracy by 7%.

Conclusion: Current CFTL benchmarking practices need improvement, statistical models are often underreported, and while statistical methods outperform FFMs, synthetic pre-training does provide some benefits to FFMs.

Abstract: Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on small-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models’ accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.

[539] HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions

Rafael Bischof, Michal Piovarči, Michael A. Kraus, Siddhartha Mishra, Bernd Bickel

Main category: cs.LG

TL;DR: HyPINO is a multi-physics neural operator that achieves zero-shot generalization across various PDEs without fine-tuning, using a Swin Transformer hypernetwork with mixed supervision from analytical solutions and physics-informed objectives.

Details

Motivation: To develop a neural operator that can generalize across different PDE types without task-specific fine-tuning, addressing the limitations of existing methods that require extensive retraining for each new problem.

Method: Combines Swin Transformer-based hypernetwork with mixed supervision (labeled analytical solutions via MMS and unlabeled physics-informed samples), plus iterative refinement procedure that treats residuals as delta PDEs for progressive error reduction.

Result: Achieves strong zero-shot accuracy on 7 benchmark problems, outperforming U-Nets, Poseidon, and PINO. Iterative refinement achieves >100x lower L2 loss in best cases. Fine-tuned PINNs initialized by HyPINO converge faster and to lower error than random initialization and meta-learning approaches.

Conclusion: HyPINO demonstrates scalable potential as a foundation for extending neural operators to solve complex, nonlinear, and high-dimensional PDE problems, with publicly available code and model weights.

Abstract: We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parameterizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that treats the residual of the generated PINN as “delta PDE” and performs another forward pass to generate a corrective PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves a >100x lower $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems. The code and model weights are publicly available at https://github.com/rbischof/hypino.

[540] Safeguarding Graph Neural Networks against Topology Inference Attacks

Jie Fu, Yuan Hong, Zhili Chen, Wendy Hui Wang

Main category: cs.LG

TL;DR: GNNs are vulnerable to topology privacy attacks that can reconstruct training graph structures from black-box model access. Existing edge-level privacy methods are insufficient, so the authors propose PGR - a bi-level optimization defense that generates synthetic training graphs to protect topology while maintaining accuracy.

Details

Motivation: While previous research focused on edge-level privacy in GNNs, topology privacy (confidentiality of the graph's overall structure) remains underexplored despite being a critical threat. GNNs are vulnerable to graph-level inference attacks that can reconstruct training graph structures.

Method: Proposed Private Graph Reconstruction (PGR) - a bi-level optimization framework where a synthetic training graph is iteratively generated using meta-gradients, and the GNN model is concurrently updated based on the evolving graph structure.

Result: Extensive experiments show that PGR significantly reduces topology leakage with minimal impact on model accuracy. The proposed Topology Inference Attacks (TIAs) demonstrate that GNNs are highly susceptible to structure reconstruction attacks, and existing edge-level differential privacy mechanisms fail to mitigate the risk or severely compromise accuracy.

Conclusion: Topology privacy is a critical vulnerability in GNNs that requires specialized defense mechanisms. PGR provides an effective solution that protects graph structure confidentiality while maintaining model utility, addressing a gap in current privacy-preserving GNN approaches.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, their widespread adoption has raised serious privacy concerns. While prior research has primarily focused on edge-level privacy, a critical yet underexplored threat lies in topology privacy - the confidentiality of the graph’s overall structure. In this work, we present a comprehensive study on topology privacy risks in GNNs, revealing their vulnerability to graph-level inference attacks. To this end, we propose a suite of Topology Inference Attacks (TIAs) that can reconstruct the structure of a target training graph using only black-box access to a GNN model. Our findings show that GNNs are highly susceptible to these attacks, and that existing edge-level differential privacy mechanisms are insufficient as they either fail to mitigate the risk or severely compromise model accuracy. To address this challenge, we introduce Private Graph Reconstruction (PGR), a novel defense framework designed to protect topology privacy while maintaining model accuracy. PGR is formulated as a bi-level optimization problem, where a synthetic training graph is iteratively generated using meta-gradients, and the GNN model is concurrently updated based on the evolving graph. Extensive experiments demonstrate that PGR significantly reduces topology leakage with minimal impact on model accuracy. Our code is available at https://github.com/JeffffffFu/PGR.

[541] Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

Main category: cs.LG

TL;DR: TimeRCD is a new foundation model for time series anomaly detection that uses Relative Context Discrepancy (RCD) pre-training instead of reconstruction, enabling better zero-shot performance by detecting contextual shifts between adjacent time windows.

Details

Motivation: Current foundation models for TSAD rely on reconstruction-based objectives, which suffer from objective mismatch - they struggle to identify subtle anomalies and often misinterpret complex normal patterns, leading to high false positive and negative rates.

Method: TimeRCD uses a Relative Context Discrepancy (RCD) pre-training paradigm where the model is trained to identify anomalies by detecting significant discrepancies between adjacent time windows, implemented with a standard Transformer architecture. A large-scale synthetic corpus with token-level anomaly labels is created for pre-training.

Result: Extensive experiments show TimeRCD significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot TSAD across diverse datasets, demonstrating superior anomaly detection capabilities.

Conclusion: The RCD paradigm establishes a new effective path for building robust and generalizable foundation models for time series anomaly detection, overcoming limitations of reconstruction-based approaches.

Abstract: Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains a major challenge. Prevailing foundation models for TSAD predominantly rely on reconstruction-based objectives, which suffer from a fundamental objective mismatch: they struggle to identify subtle anomalies while often misinterpreting complex normal patterns, leading to high rates of false negatives and positives. To overcome these limitations, we introduce \texttt{TimeRCD}, a novel foundation model for TSAD built upon a new pre-training paradigm: Relative Context Discrepancy (RCD). Instead of learning to reconstruct inputs, \texttt{TimeRCD} is explicitly trained to identify anomalies by detecting significant discrepancies between adjacent time windows. This relational approach, implemented with a standard Transformer architecture, enables the model to capture contextual shifts indicative of anomalies that reconstruction-based methods often miss. To facilitate this paradigm, we develop a large-scale, diverse synthetic corpus with token-level anomaly labels, providing the rich supervisory signal necessary for effective pre-training. Extensive experiments demonstrate that \texttt{TimeRCD} significantly outperforms existing general-purpose and anomaly-specific foundation models in zero-shot TSAD across diverse datasets. Our results validate the superiority of the RCD paradigm and establish a new, effective path toward building robust and generalizable foundation models for time series anomaly detection.

[542] TRUST-FS: Tensorized Reliable Unsupervised Multi-View Feature Selection for Incomplete Data

Minghui Lu, Yanyong Huang, Minbo Ma, Jinyuan Chang, Dongjie Wang, Xiuwen Yi, Tianrui Li

Main category: cs.LG

TL;DR: TRUST-FS is a novel multi-view unsupervised feature selection method that handles incomplete data with missing variables through unified tensor factorization, integrating feature selection, imputation, and view weight learning while ensuring reliable similarity graph construction.

Details

Motivation: Existing MUFS methods have limitations: they can't handle missing variables (only missing views), treat imputation and feature selection separately, and suffer from inaccurate similarity graphs due to missing data.

Method: Proposes TRUST-FS using adaptive-weighted CP tensor decomposition that simultaneously performs feature selection, missing-variable imputation, and view weight learning. Uses Subjective Logic for trustworthy cross-view similarity information to learn reliable similarity graphs.

Result: Comprehensive experiments demonstrate TRUST-FS’s effectiveness and superiority over state-of-the-art methods for incomplete multi-view data with missing variables.

Conclusion: TRUST-FS successfully addresses key challenges in incomplete multi-view feature selection by unifying imputation and feature selection while ensuring reliable similarity graph construction through tensor factorization and subjective logic.

Abstract: Multi-view unsupervised feature selection (MUFS), which selects informative features from multi-view unlabeled data, has attracted increasing research interest in recent years. Although great efforts have been devoted to MUFS, several challenges remain: 1) existing methods for incomplete multi-view data are limited to handling missing views and are unable to address the more general scenario of missing variables, where some features have missing values in certain views; 2) most methods address incomplete data by first imputing missing values and then performing feature selection, treating these two processes independently and overlooking their interactions; 3) missing data can result in an inaccurate similarity graph, which reduces the performance of feature selection. To solve this dilemma, we propose a novel MUFS method for incomplete multi-view data with missing variables, termed Tensorized Reliable UnSupervised mulTi-view Feature Selection (TRUST-FS). TRUST-FS introduces a new adaptive-weighted CP decomposition that simultaneously performs feature selection, missing-variable imputation, and view weight learning within a unified tensor factorization framework. By utilizing Subjective Logic to acquire trustworthy cross-view similarity information, TRUST-FS facilitates learning a reliable similarity graph, which subsequently guides feature selection and imputation. Comprehensive experimental results demonstrate the effectiveness and superiority of our method over state-of-the-art methods.

[543] Evolutionary Profiles for Protein Fitness Prediction

Jigang Fan, Xiaoran Jiao, Shengdong Lin, Zhanming Liang, Weian Mao, Chenchen Jing, Hao Chen, Chunhua Shen

Main category: cs.LG

TL;DR: EvoIF is a lightweight protein fitness prediction model that combines within-family evolutionary profiles from homologs with cross-family structural-evolutionary constraints from inverse folding, achieving state-of-the-art performance with minimal training data.

Details

Motivation: Protein fitness prediction is limited by small experimental datasets relative to vast sequence space. Existing protein language models show strong zero-shot performance but lack explicit integration of evolutionary and structural constraints.

Method: EvoIF integrates two evolutionary signals: within-family profiles from retrieved homologs and cross-family structural-evolutionary constraints from inverse folding logits. It fuses sequence-structure representations via a compact transition block for calibrated log-odds scoring.

Result: On ProteinGym (217 assays, >2.5M mutants), EvoIF achieves state-of-the-art performance using only 0.15% training data and fewer parameters than large models. Within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths.

Conclusion: EvoIF provides an efficient framework for protein fitness prediction by unifying evolutionary and structural constraints, demonstrating that lightweight models can achieve competitive performance through strategic integration of complementary information sources.

Abstract: Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at https://github.com/aim-uofa/EvoIF.

[544] Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection

Seohyeon Cha, Huancheng Chen, Haris Vikalo

Main category: cs.LG

TL;DR: FedProTIP is a federated continual learning framework that uses gradient projection to prevent catastrophic forgetting and includes task identity prediction for task-agnostic inference.

Details

Motivation: Address catastrophic forgetting in federated continual learning settings where data heterogeneity, communication constraints, and privacy concerns exacerbate the forgetting problem.

Method: Projects client updates onto orthogonal complement of previous task representations to reduce interference, and uses lightweight task identity prediction with core bases from prior tasks.

Result: Significantly outperforms state-of-the-art methods in average accuracy, especially when task identities are unknown.

Conclusion: FedProTIP effectively mitigates forgetting in federated continual learning through gradient projection and task identity prediction, achieving superior performance in challenging task-agnostic scenarios.

Abstract: Federated continual learning (FCL) enables distributed client devices to learn from streaming data across diverse and evolving tasks. A major challenge to continual learning, catastrophic forgetting, is exacerbated in decentralized settings by the data heterogeneity, constrained communication and privacy concerns. We propose Federated gradient Projection-based Continual Learning with Task Identity Prediction (FedProTIP), a novel FCL framework that mitigates forgetting by projecting client updates onto the orthogonal complement of the subspace spanned by previously learned representations of the global model. This projection reduces interference with earlier tasks and preserves performance across the task sequence. To further address the challenge of task-agnostic inference, we incorporate a lightweight mechanism that leverages core bases from prior tasks to predict task identity and dynamically adjust the global model’s outputs. Extensive experiments across standard FCL benchmarks demonstrate that FedProTIP significantly outperforms state-of-the-art methods in average accuracy, particularly in settings where task identities are a priori unknown.

[545] Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

Joongkyu Lee, Seouh-won Yi, Min-hwan Oh

Main category: cs.LG

TL;DR: M-AUPO algorithm for online PbRL with ranking feedback achieves improved sample efficiency using multiple comparisons via Plackett-Luce model, with performance scaling with subset size.

Details

Motivation: Existing PbRL works fail to leverage richer ranking feedback effectively, with performance not improving or even deteriorating as feedback length increases, despite the availability of more information.

Method: Proposed M-AUPO algorithm that selects multiple actions by maximizing average uncertainty within offered subsets using Plackett-Luce model for ranking feedback.

Result: Achieves suboptimality gap of O~(d/T * √(∑1/|S_t|)) where larger subsets directly improve performance, avoiding exponential dependence on unknown parameter norm.

Conclusion: First theoretical result in PbRL with ranking feedback showing explicit improvement in sample efficiency as subset size increases, with near-matching lower bound established.

Abstract: We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL’s recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{O}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter’s norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $Ω\left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.

[546] Optimism as Risk-Seeking in Multi-Agent Reinforcement Learning

Runyu Zhang, Na Li, Asuman Ozdaglar, Jeff Shamma, Gioele Zardini

Main category: cs.LG

TL;DR: A principled framework that unifies risk-sensitive learning and optimism in cooperative multi-agent reinforcement learning, showing that risk-seeking objectives can be interpreted as optimism and improving coordination over baseline methods.

Details

Motivation: Existing optimistic methods in cooperative MARL are typically heuristic and lack theoretical grounding, while risk-averse approaches often lead to suboptimal equilibria. There's a need for a principled framework that connects risk sensitivity with optimism.

Method: Proposed optimistic value functions that formalize optimism as divergence-penalized risk-seeking evaluations, derived a policy-gradient theorem for these functions, and developed decentralized optimistic actor-critic algorithms.

Result: Empirical results on cooperative benchmarks demonstrate that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods.

Conclusion: The framework successfully unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

Abstract: Risk sensitivity has become a central theme in reinforcement learning (RL), where convex risk measures and robust formulations provide principled ways to model preferences beyond expected return. Recent extensions to multi-agent RL (MARL) have largely emphasized the risk-averse setting, prioritizing robustness to uncertainty. In cooperative MARL, however, such conservatism often leads to suboptimal equilibria, and a parallel line of work has shown that optimism can promote cooperation. Existing optimistic methods, though effective in practice, are typically heuristic and lack theoretical grounding. Building on the dual representation for convex risk measures, we propose a principled framework that interprets risk-seeking objectives as optimism. We introduce optimistic value functions, which formalize optimism as divergence-penalized risk-seeking evaluations. Building on this foundation, we derive a policy-gradient theorem for optimistic value functions, including explicit formulas for the entropic risk/KL-penalty setting, and develop decentralized optimistic actor-critic algorithms that implement these updates. Empirical results on cooperative benchmarks demonstrate that risk-seeking optimism consistently improves coordination over both risk-neutral baselines and heuristic optimistic methods. Our framework thus unifies risk-sensitive learning and optimism, offering a theoretically grounded and practically effective approach to cooperation in MARL.

[547] Normalization in Attention Dynamics

Nikita Karagodin, Shu Ge, Yury Polyanskiy, Philippe Rigollet

Main category: cs.LG

TL;DR: Normalization schemes act as speed regulators for token representations in transformers, influencing clustering dynamics and preventing collapse.

Details

Motivation: To understand how different normalization schemes affect token representations in deep transformers and provide a unified framework for comparing them.

Method: Model token representation evolution as interacting particles on a sphere and analyze normalization as speed regulation across schemes like Post-LN, Pre-LN, Mix-LN, Peri-LN, and nGPT.

Result: The framework reveals how different normalization schemes shape token representations across layers and identifies Peri-LN as particularly effective.

Conclusion: Normalization schemes regulate the speed of token representation evolution, with Peri-LN emerging as a superior choice for managing clustering dynamics and preventing representation collapse.

Abstract: We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes – including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT – revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.

[548] AXIS: Explainable Time Series Anomaly Detection with Large Language Models

Tian Lan, Hao Duong Le, Jinbo Li, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

Main category: cs.LG

TL;DR: AXIS is a framework that conditions frozen LLMs for explainable time-series anomaly detection by providing three complementary hints instead of direct time-to-text serialization, achieving better explanations and competitive detection accuracy.

Details

Motivation: Current LLM-based approaches for explainable time-series anomaly detection struggle with processing continuous signals and lack contextual grounding between time series and text modalities.

Method: AXIS enriches LLM input with three hints: symbolic numeric hint for numerical grounding, context-integrated step-aligned hint from pretrained time-series encoder for fine-grained dynamics, and task-prior hint for global anomaly characteristics.

Result: AXIS yields significantly higher quality explanations and achieves competitive detection accuracy compared to general-purpose LLMs, specialized time-series LLMs, and time-series Vision Language Models.

Conclusion: The AXIS framework effectively bridges the modality gap for explainable time-series anomaly detection by conditioning LLMs with complementary hints rather than direct serialization.

Abstract: Time-series anomaly detection (TSAD) increasingly demands explanations that articulate not only if an anomaly occurred, but also what pattern it exhibits and why it is anomalous. Leveraging the impressive explanatory capabilities of Large Language Models (LLMs), recent works have attempted to treat time series as text for explainable TSAD. However, this approach faces a fundamental challenge: LLMs operate on discrete tokens and struggle to directly process long, continuous signals. Consequently, naive time-to-text serialization suffers from a lack of contextual grounding and representation alignment between the two modalities. To address this gap, we introduce AXIS, a framework that conditions a frozen LLM for nuanced time-series understanding. Instead of direct serialization, AXIS enriches the LLM’s input with three complementary hints derived from the series: (i) a symbolic numeric hint for numerical grounding, (ii) a context-integrated, step-aligned hint distilled from a pretrained time-series encoder to capture fine-grained dynamics, and (iii) a task-prior hint that encodes global anomaly characteristics. Furthermore, to facilitate robust evaluation of explainability, we introduce a new benchmark featuring multi-format questions and rationales that supervise contextual grounding and pattern-level semantics. Extensive experiments, including both LLM-based and human evaluations, demonstrate that AXIS yields explanations of significantly higher quality and achieves competitive detection accuracy compared to general-purpose LLMs, specialized time-series LLMs, and time-series Vision Language Models.

[549] Federated Learning with Gramian Angular Fields for Privacy-Preserving ECG Classification on Heterogeneous IoT Devices

Youssef Elmir, Yassine Himeur, Abbes Amira

Main category: cs.LG

TL;DR: Federated learning framework for ECG classification using GAF image transformation, achieving 95.18% accuracy while preserving privacy across IoT devices.

Details

Motivation: To enable privacy-preserving ECG classification in IoT healthcare environments by keeping sensitive medical data local to devices while maintaining high accuracy.

Method: Transform 1D ECG signals into 2D Gramian Angular Field (GAF) images and use Convolutional Neural Networks (CNNs) within a federated learning framework deployed across heterogeneous IoT devices (server, laptop, Raspberry Pi 4).

Result: FL-GAF model achieves 95.18% classification accuracy in multi-client setup, significantly outperforming single-client baseline in both accuracy and training time, while maintaining efficient resource utilization and communication overhead.

Conclusion: The framework demonstrates potential for lightweight, privacy-preserving AI in IoT healthcare monitoring, supporting scalable and secure edge deployments in smart health systems.

Abstract: This study presents a federated learning (FL) framework for privacy-preserving electrocardiogram (ECG) classification in Internet of Things (IoT) healthcare environments. By transforming 1D ECG signals into 2D Gramian Angular Field (GAF) images, the proposed approach enables efficient feature extraction through Convolutional Neural Networks (CNNs) while ensuring that sensitive medical data remain local to each device. This work is among the first to experimentally validate GAF-based federated ECG classification across heterogeneous IoT devices, quantifying both performance and communication efficiency. To evaluate feasibility in realistic IoT settings, we deployed the framework across a server, a laptop, and a resource-constrained Raspberry Pi 4, reflecting edge-cloud integration in IoT ecosystems. Experimental results demonstrate that the FL-GAF model achieves a high classification accuracy of 95.18% in a multi-client setup, significantly outperforming a single-client baseline in both accuracy and training time. Despite the added computational complexity of GAF transformations, the framework maintains efficient resource utilization and communication overhead. These findings highlight the potential of lightweight, privacy-preserving AI for IoT-based healthcare monitoring, supporting scalable and secure edge deployments in smart health systems.

[550] Variational Diffusion Unlearning: A Variational Inference Framework for Unlearning in Diffusion Models under Data Constraints

Subhodip Panda, MS Varun, Shreyans Jain, Sarthak Kumar Maharana, Prathosh A. P

Main category: cs.LG

TL;DR: VDU is a computationally efficient machine unlearning method for diffusion models that works in data-constrained settings, requiring only a subset of undesired training data to prevent generation of unwanted outputs while maintaining image quality.

Details

Motivation: To enable safe deployment of diffusion models by preventing generation of undesired, violent, and obscene outputs, especially in data-constrained settings where full training datasets are inaccessible.

Method: Variational Diffusion Unlearning (VDU) uses variational inference with a loss function containing plasticity inducer (reduces log-likelihood of undesired data) and stability regularizer (preserves image generation quality by regularizing in parameter space).

Result: Comprehensive experiments show effectiveness for both class unlearning (removing specific classes from MNIST, CIFAR-10, tinyImageNet) and feature unlearning (removing high-level features from Stable Diffusion).

Conclusion: VDU provides an effective solution for machine unlearning in diffusion models under data-constrained conditions, successfully preventing generation of undesired outputs while maintaining generation quality.

Abstract: For a responsible and safe deployment of diffusion models in various domains, regulating the generated outputs from these models is desirable because such models could generate undesired, violent, and obscene outputs. To tackle this problem, recent works use machine unlearning methodology to forget training data points containing these undesired features from pre-trained generative models. However, these methods proved to be ineffective in data-constrained settings where the whole training dataset is inaccessible. Thus, the principal objective of this work is to propose a machine unlearning methodology that can prevent the generation of outputs containing undesired features from a pre-trained diffusion model in such a data-constrained setting. Our proposed method, termed as Variational Diffusion Unlearning (VDU), is a computationally efficient method that only requires access to a subset of training data containing undesired features. Our approach is inspired by the variational inference framework with the objective of minimizing a loss function consisting of two terms: plasticity inducer and stability regularizer. Plasticity inducer reduces the log-likelihood of the undesired training data points, while the stability regularizer, essential for preventing loss of image generation quality, regularizes the model in parameter space. We validate the effectiveness of our method through comprehensive experiments for both class unlearning and feature unlearning. For class unlearning, we unlearn some user-identified classes from MNIST, CIFAR-10, and tinyImageNet datasets from a pre-trained unconditional denoising diffusion probabilistic model (DDPM). Similarly, for feature unlearning, we unlearn the generation of certain high-level features from a pre-trained Stable Diffusion model

[551] Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks

Diego García-Pérez, Emilio Parrado-Hernández, John Shawe-Taylor

Main category: cs.LG

TL;DR: This paper presents four theoretical contributions that improve PAC-Bayes risk certificates for neural networks, including tighter bounds on KL divergence, implicit differentiation methodology, optimization of non-differentiable objectives, and achieves first non-vacuous generalization bounds on CIFAR-10.

Details

Motivation: To improve the usability and tightness of risk certificates for neural networks based on PAC-Bayes bounds, addressing limitations in existing methods for deriving practical generalization guarantees.

Method: Develops four theoretical contributions: 1) Two bounds on KL divergence between Bernoulli distributions for tighter risk bounds, 2) Implicit differentiation methodology to optimize PAC-Bayesian risk certificates within loss functions, 3) Method to optimize bounds on non-differentiable objectives like 0-1 loss, with empirical evaluation on MNIST and CIFAR-10 datasets.

Result: Achieves the first non-vacuous generalization bounds on CIFAR-10 for neural networks, demonstrating practical improvements in risk certificate tightness and usability across different empirical risk ranges.

Conclusion: The proposed theoretical contributions significantly enhance the practical applicability of PAC-Bayes risk certificates for neural networks, enabling tighter generalization bounds and optimization of non-differentiable objectives, with empirical validation on standard benchmarks.

Abstract: This paper presents four theoretical contributions that improve the usability of risk certificates for neural networks based on PAC-Bayes bounds. First, two bounds on the KL divergence between Bernoulli distributions enable the derivation of the tightest explicit bounds on the true risk of classifiers across different ranges of empirical risk. The paper next focuses on the formalization of an efficient methodology based on implicit differentiation that enables the introduction of the optimization of PAC-Bayesian risk certificates inside the loss/objective function used to fit the network/model. The last contribution is a method to optimize bounds on non-differentiable objectives such as the 0-1 loss. These theoretical contributions are complemented with an empirical evaluation on the MNIST and CIFAR-10 datasets. In fact, this paper presents the first non-vacuous generalization bounds on CIFAR-10 for neural networks. Code to reproduce all experiments is available at github.com/Diegogpcm/pacbayesgradients.

[552] Alternative Fairness and Accuracy Optimization in Criminal Justice

Shaolong Wu, James Blume, Geshi Yeung

Main category: cs.LG

TL;DR: The paper proposes a modified group fairness approach that minimizes weighted error loss while keeping false negative rate differences within tolerance, addressing fairness conflicts in criminal justice risk assessment.

Details

Motivation: Algorithmic fairness concepts remain unsettled in criminal justice contexts, with conflicts between different fairness definitions that need practical resolution.

Method: Modified group fairness approach using weighted error loss minimization with constrained false negative rate differences, plus a deployment framework with need-based decisions, transparency, and narrowly tailored solutions.

Result: The approach makes solutions easier to find, can improve predictive accuracy, and highlights ethical choices in error cost assignment.

Conclusion: The proposed framework links technical design to legitimacy and provides actionable guidance for agencies using risk assessment tools, addressing key critiques like biased data and subgroup constraints.

Abstract: Algorithmic fairness has grown rapidly as a research area, yet key concepts remain unsettled, especially in criminal justice. We review group, individual, and process fairness and map the conditions under which they conflict. We then develop a simple modification to standard group fairness. Rather than exact parity across protected groups, we minimize a weighted error loss while keeping differences in false negative rates within a small tolerance. This makes solutions easier to find, can raise predictive accuracy, and surfaces the ethical choice of error costs. We situate this proposal within three classes of critique: biased and incomplete data, latent affirmative action, and the explosion of subgroup constraints. Finally, we offer a practical framework for deployment in public decision systems built on three pillars: need-based decisions, Transparency and accountability, and narrowly tailored definitions and solutions. Together, these elements link technical design to legitimacy and provide actionable guidance for agencies that use risk assessment and related tools.

[553] Policy Transfer for Continuous-Time Reinforcement Learning: A (Rough) Differential Equation Approach

Xin Guo, Zijiu Lyu

Main category: cs.LG

TL;DR: This paper provides the first theoretical proof of policy transfer for continuous-time reinforcement learning, showing that optimal policies from one RL problem can initialize near-optimal policies in related problems while maintaining convergence rates.

Details

Motivation: To establish theoretical foundations for policy transfer in continuous-time RL, addressing the gap in understanding how optimal policies from one problem can accelerate learning in related problems.

Method: Analyzes two classes of continuous-time RL problems: linear-quadratic systems with entropy regularization using Gaussian structure and Riccati equation stability, and non-linear bounded systems using rough path theory for diffusion SDE stability.

Result: Proves policy transfer works in continuous-time RL, proposes a novel policy learning algorithm for LQRs achieving global linear and local super-linear convergence, and derives stability results for score-based diffusion models.

Conclusion: Policy transfer is theoretically valid for continuous-time RL, enabling efficient initialization of learning in related problems while preserving convergence properties, with applications extending to diffusion models.

Abstract: This paper studies policy transfer, one of the well-known transfer learning techniques adopted in large language models, for two classes of continuous-time reinforcement learning problems. In the first class of continuous-time linear-quadratic systems with Shannon’s entropy regularization (a.k.a. LQRs), we fully exploit the Gaussian structure of their optimal policy and the stability of their associated Riccati equations. In the second class where the system has possibly non-linear and bounded dynamics, the key technical component is the stability of diffusion SDEs which is established by invoking the rough path theory. Our work provides the first theoretical proof of policy transfer for continuous-time RL: an optimal policy learned for one RL problem can be used to initialize the search for a near-optimal policy in a closely related RL problem, while maintaining the convergence rate of the original algorithm. To illustrate the benefit of policy transfer for RL, we propose a novel policy learning algorithm for continuous-time LQRs, which achieves global linear convergence and local super-linear convergence. As a byproduct of our analysis, we derive the stability of a concrete class of continuous-time score-based diffusion models via their connection with LQRs.

[554] Disentangled Representation Learning via Modular Compositional Bias

Whie Jung, Dong Hoon Lee, Seunghoon Hong

Main category: cs.LG

TL;DR: Proposes compositional bias as a modular inductive bias for disentangled representation learning, enabling disentanglement of attributes, objects, or both by simply adjusting mixing strategies without changing objectives or architectures.

Details

Motivation: Current DRL methods require redesigning architectures or objectives for different factors of variation, creating significant overhead when novel factors don't align with prior assumptions like statistical independence or spatial exclusivity.

Method: Introduces compositional bias that randomly remixes latents according to factor-specific rules (mixing strategies) and uses two objectives: prior loss for realistic remix decoding and compositional consistency loss for aligning composite images with composite latents.

Result: Achieves competitive performance in both attribute and object disentanglement, and uniquely accomplishes joint disentanglement of global style and objects.

Conclusion: The proposed framework enables flexible disentanglement across different factor types by simply adjusting mixing strategies, providing a unified approach without requiring architectural or objective modifications.

Abstract: Recent disentangled representation learning (DRL) methods heavily rely on factor specific strategies-either learning objectives for attributes or model architectures for objects-to embed inductive biases. Such divergent approaches result in significant overhead when novel factors of variation do not align with prior assumptions, such as statistical independence or spatial exclusivity, or when multiple factors coexist, as practitioners must redesign architectures or objectives. To address this, we propose a compositional bias, a modular inductive bias decoupled from both objectives and architectures. Our key insight is that different factors obey distinct recombination rules in the data distribution: global attributes are mutually exclusive, e.g., a face has one nose, while objects share a common support (any subset of objects can co-exist). We therefore randomly remix latents according to factor-specific rules, i.e., a mixing strategy, and force the encoder to discover whichever factor structure the mixing strategy reflects through two complementary objectives: (i) a prior loss that ensures every remix decodes into a realistic image, and (ii) the compositional consistency loss introduced by Wiedemer et al. (arXiv:2310.05327), which aligns each composite image with its corresponding composite latent. Under this general framework, simply adjusting the mixing strategy enables disentanglement of attributes, objects, and even both, without modifying the objectives or architectures. Extensive experiments demonstrate that our method shows competitive performance in both attribute and object disentanglement, and uniquely achieves joint disentanglement of global style and objects. Code is available at https://github.com/whieya/Compositional-DRL.

[555] Autoencoding Dynamics: Topological Limitations and Capabilities

Matthew D. Kvalheim, Eduardo D. Sontag

Main category: cs.LG

TL;DR: The paper analyzes topological limitations and capabilities of autoencoders for data manifolds and their application to dynamical systems.

Details

Motivation: To understand the fundamental topological constraints and possibilities when constructing autoencoders that map data manifolds to latent spaces, particularly for dynamical systems with invariant manifolds.

Method: Theoretical analysis of autoencoder topology, examining continuous encoder-decoder pairs and their approximation of identity maps on data manifolds.

Result: Identifies various topological limitations and capabilities inherent in autoencoder design, including constraints on the round-trip map D∘E approximating the identity on M.

Conclusion: Autoencoders have specific topological constraints that affect their ability to represent data manifolds, with implications for encoding dynamical systems with invariant manifolds.

Abstract: Given a “data manifold” $M\subset \mathbb{R}^n$ and “latent space” $\mathbb{R}^\ell$, an autoencoder is a pair of continuous maps consisting of an “encoder” $E\colon \mathbb{R}^n\to \mathbb{R}^\ell$ and “decoder” $D\colon \mathbb{R}^\ell\to \mathbb{R}^n$ such that the “round trip” map $D\circ E$ is as close as possible to the identity map $\mbox{id}_M$ on $M$. We present various topological limitations and capabilites inherent to the search for an autoencoder, and describe capabilities for autoencoding dynamical systems having $M$ as an invariant manifold.

[556] Precipitation nowcasting of satellite data using physically-aligned neural networks

Antônio Catão, Melvin Poveda, Leonardo Voltarelli, Paulo Orenstein

Main category: cs.LG

TL;DR: TUPANN is a satellite-only precipitation nowcasting model that decomposes forecasts into physically meaningful components (motion, intensity, advection) and achieves state-of-the-art performance across multiple climates at 10-180min lead times.

Details

Motivation: Current short-term precipitation forecasts rely heavily on dense weather-radar networks, which limits operational value in regions most exposed to climate extremes where radar coverage is sparse.

Method: TUPANN uses a variational encoder-decoder to infer motion and intensity fields from satellite imagery under optical-flow supervision, a lead-time-conditioned MaxViT to evolve the latent state, and a differentiable advection operator to reconstruct future frames.

Result: TUPANN achieves best or second-best skill in most settings compared to optical-flow, deep learning and hybrid baselines, with pronounced gains at higher precipitation thresholds (4-64 mm/h). Training on multiple cities improves performance, and cross-city experiments show modest degradation with occasional gains for rare heavy-rain regimes.

Conclusion: Physically aligned learning can provide skillful, transferable and global precipitation nowcasts using satellite data only, enabling operational forecasting in radar-sparse regions vulnerable to climate extremes.

Abstract: Accurate short-term precipitation forecasts predominantly rely on dense weather-radar networks, limiting operational value in places most exposed to climate extremes. We present TUPANN (Transferable and Universal Physics-Aligned Nowcasting Network), a satellite-only model trained on GOES-16 RRQPE. Unlike most deep learning models for nowcasting, TUPANN decomposes the forecast into physically meaningful components: a variational encoder-decoder infers motion and intensity fields from recent imagery under optical-flow supervision, a lead-time-conditioned MaxViT evolves the latent state, and a differentiable advection operator reconstructs future frames. We evaluate TUPANN on both GOES-16 and IMERG data, in up to four distinct climates (Rio de Janeiro, Manaus, Miami, La Paz) at 10-180min lead times using the CSI and HSS metrics over 4-64 mm/h thresholds. Comparisons against optical-flow, deep learning and hybrid baselines show that TUPANN achieves the best or second-best skill in most settings, with pronounced gains at higher thresholds. Training on multiple cities further improves performance, while cross-city experiments show modest degradation and occasional gains for rare heavy-rain regimes. The model produces smooth, interpretable motion fields aligned with numerical optical flow and runs in near real time due to the low latency of GOES-16. These results indicate that physically aligned learning can provide nowcasts that are skillful, transferable and global.

[557] Transolver is a Linear Transformer: Revisiting Physics-Attention through the Lens of Linear Attention

Wenjie Hu, Sidun Liu, Peng Qiao, Zhenglun Sun, Yong Dou

Main category: cs.LG

TL;DR: The paper proposes Linear Attention Neural Operator (LinearNO), which reformulates Physics-Attention from Transolver as linear attention, achieving better performance with fewer parameters and lower computational cost on PDE benchmarks.

Details

Motivation: Current Transformer-based neural operators for PDEs suffer from quadratic complexity. While Transolver introduced Physics-Attention to reduce costs, the authors observed it can be reformulated as linear attention and that slice attention may hurt performance.

Method: Proposed a two-step transformation to redesign Physics-Attention into canonical linear attention, creating Linear Attention Neural Operator (LinearNO).

Result: Achieved state-of-the-art performance on 6 standard PDE benchmarks with 40.0% fewer parameters and 36.2% lower computational cost. Also showed superior performance on industrial datasets AirfRANS and Shape-Net Car.

Conclusion: LinearNO demonstrates that the effectiveness of Physics-Attention primarily comes from slice/deslice operations rather than slice interactions, and linear attention provides a more efficient and effective alternative for neural operators in PDE solving.

Abstract: Recent advances in Transformer-based Neural Operators have enabled significant progress in data-driven solvers for Partial Differential Equations (PDEs). Most current research has focused on reducing the quadratic complexity of attention to address the resulting low training and inference efficiency. Among these works, Transolver stands out as a representative method that introduces Physics-Attention to reduce computational costs. Physics-Attention projects grid points into slices for slice attention, then maps them back through deslicing. However, we observe that Physics-Attention can be reformulated as a special case of linear attention, and that the slice attention may even hurt the model performance. Based on these observations, we argue that its effectiveness primarily arises from the slice and deslice operations rather than interactions between slices. Building on this insight, we propose a two-step transformation to redesign Physics-Attention into a canonical linear attention, which we call Linear Attention Neural Operator (LinearNO). Our method achieves state-of-the-art performance on six standard PDE benchmarks, while reducing the number of parameters by an average of 40.0% and computational cost by 36.2%. Additionally, it delivers superior performance on two challenging, industrial-level datasets: AirfRANS and Shape-Net Car.

[558] Dual Mamba for Node-Specific Representation Learning: Tackling Over-Smoothing with Selective State Space Modeling

Xin He, Yili Wang, Yiwei Dai, Xin Wang

Main category: cs.LG

TL;DR: DMbaGCN is a novel GNN framework that integrates Mamba to address over-smoothing through local state evolution modeling and global context awareness.

Details

Motivation: Existing solutions like residual connections and skip layers fail to explicitly model node-specific representation evolution across layers and ignore global information, which are crucial for mitigating over-smoothing in deep GNNs.

Method: Proposes Dual Mamba-enhanced Graph Convolutional Network (DMbaGCN) with two modules: LSEMba for local neighborhood aggregation using Mamba’s selective state space modeling, and GCAMba for incorporating global context using Mamba’s global attention capabilities.

Result: Extensive experiments on multiple benchmarks demonstrate the effectiveness and efficiency of DMbaGCN in enhancing node discriminability and mitigating over-smoothing in deep GNNs.

Conclusion: DMbaGCN successfully addresses over-smoothing by combining local node-specific representation dynamics modeling with global context awareness through Mamba integration.

Abstract: Over-smoothing remains a fundamental challenge in deep Graph Neural Networks (GNNs), where repeated message passing causes node representations to become indistinguishable. While existing solutions, such as residual connections and skip layers, alleviate this issue to some extent, they fail to explicitly model how node representations evolve in a node-specific and progressive manner across layers. Moreover, these methods do not take global information into account, which is also crucial for mitigating the over-smoothing problem. To address the aforementioned issues, in this work, we propose a Dual Mamba-enhanced Graph Convolutional Network (DMbaGCN), which is a novel framework that integrates Mamba into GNNs to address over-smoothing from both local and global perspectives. DMbaGCN consists of two modules: the Local State-Evolution Mamba (LSEMba) for local neighborhood aggregation and utilizing Mamba’s selective state space modeling to capture node-specific representation dynamics across layers, and the Global Context-Aware Mamba (GCAMba) that leverages Mamba’s global attention capabilities to incorporate global context for each node. By combining these components, DMbaGCN enhances node discriminability in deep GNNs, thereby mitigating over-smoothing. Extensive experiments on multiple benchmarks demonstrate the effectiveness and efficiency of our method.

[559] Contact Wasserstein Geodesics for Non-Conservative Schrödinger Bridges

Andrea Testa, Søren Hauberg, Tamim Asfour, Leonel Rozo

Main category: cs.LG

TL;DR: Proposes NCGSB, an energy-varying Schrödinger Bridge using contact Hamiltonian mechanics, enabling modeling of varying-energy phenomena with near-linear complexity via contact Wasserstein geodesics.

Details

Motivation: Existing Schrödinger Bridge methods are limited by energy-conservation assumptions, preventing modeling of varying-energy phenomena common in real-world stochastic processes.

Method: Introduces non-conservative generalized Schrödinger Bridge (NCGSB) based on contact Hamiltonian mechanics, parameterizes Wasserstein manifold, and implements contact Wasserstein geodesic (CWG) via ResNet architecture with non-iterative solver.

Result: Validated on manifold navigation, molecular dynamics predictions, and image generation, demonstrating practical benefits and versatility in capturing richer intermediate dynamics.

Conclusion: NCGSB provides a broader class of stochastic processes with energy variation, enabling more faithful modeling of real-world phenomena with efficient computation.

Abstract: The Schrödinger Bridge provides a principled framework for modeling stochastic processes between distributions; however, existing methods are limited by energy-conservation assumptions, which constrains the bridge’s shape preventing it from model varying-energy phenomena. To overcome this, we introduce the non-conservative generalized Schrödinger bridge (NCGSB), a novel, energy-varying reformulation based on contact Hamiltonian mechanics. By allowing energy to change over time, the NCGSB provides a broader class of real-world stochastic processes, capturing richer and more faithful intermediate dynamics. By parameterizing the Wasserstein manifold, we lift the bridge problem to a tractable geodesic computation in a finite-dimensional space. Unlike computationally expensive iterative solutions, our contact Wasserstein geodesic (CWG) is naturally implemented via a ResNet architecture and relies on a non-iterative solver with near-linear complexity. Furthermore, CWG supports guided generation by modulating a task-specific distance metric. We validate our framework on tasks including manifold navigation, molecular dynamics predictions, and image generation, demonstrating its practical benefits and versatility.

[560] Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery

Anand Krishnakumar, Vengadesh Ravikumaran

Main category: cs.LG

TL;DR: Hybrid distance metric combining semantic embeddings, data types, and spatial positioning outperforms traditional methods for identifying structurally similar spreadsheets.

Details

Motivation: Traditional methods fail to capture spatial layouts and type patterns that define spreadsheet templates, limiting effective similarity quantification.

Method: Converts spreadsheets into cell-level embeddings and uses aggregation techniques like Chamfer and Hausdorff distances to calculate similarity.

Result: Achieves perfect template reconstruction (Adjusted Rand Index of 1.00 vs 0.90 baseline) on FUSTE dataset, demonstrating superior unsupervised clustering performance.

Conclusion: Enables large-scale automated template discovery for downstream applications like retrieval-augmented generation, model training, and bulk data cleaning.

Abstract: Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular collections, model training, and bulk data cleaning.

[561] Learning Quantized Continuous Controllers for Integer Hardware

Fabian Kresse, Christoph H. Lampert

Main category: cs.LG

TL;DR: This paper presents a quantization-aware training pipeline that enables deployment of reinforcement learning policies on FPGAs using only 2-3 bits per weight and activation, achieving microsecond latency and microjoule energy consumption.

Details

Motivation: Deploying continuous-control RL policies on embedded hardware requires meeting tight latency and power budgets, which FPGAs can provide but only if costly floating-point operations are avoided.

Method: Developed a learning-to-hardware pipeline using quantization-aware training for integer inference, automatically selecting low-bit policies and synthesizing them to an Artix-7 FPGA.

Result: Achieved competitive performance with FP32 policies using only 2-3 bits per weight and activation across five MuJoCo tasks, with inference latencies of microseconds and microjoules per action on hardware.

Conclusion: Quantized policies not only meet hardware constraints but also exhibit increased input noise robustness compared to floating-point baselines.

Abstract: Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

[562] REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Linna Wang, Zhixuan You, Qihui Zhang, Jiunan Wen, Ji Shi, Yimin Chen, Yusen Wang, Fanqi Ding, Ziliang Feng, Li Lu

Main category: cs.LG

TL;DR: REACT-LLM benchmark evaluates LLMs combined with causal features for clinical risk prediction, finding they haven’t yet outperformed traditional ML models and causal feature integration provides limited gains due to CD method assumptions.

Details

Motivation: There's a lack of systematic benchmarks evaluating the integration of LLMs and causal learning in clinical decision making, despite both having strong potential for clinical risk prediction.

Method: Introduced REACT-LLM benchmark evaluating 7 clinical outcomes across 2 datasets, comparing 15 LLMs, 6 traditional ML models, and 3 causal discovery algorithms.

Result: LLMs perform reasonably but haven’t outperformed traditional ML models; causal feature integration offers limited performance gains due to strict CD method assumptions.

Conclusion: While direct integration yields limited improvement, the benchmark reveals a more promising synergy between LLMs and causal learning in clinical prognostics.

Abstract: Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs’ emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that often focus on a limited set of outcomes, REACT-LLM evaluates 7 clinical outcomes across 2 real-world datasets, comparing 15 prominent LLMs, 6 traditional ML models, and 3 causal discovery (CD) algorithms. Our findings indicate that while LLMs perform reasonably in clinical prognostics, they have not yet outperformed traditional ML models. Integrating causal features derived from CD algorithms into LLMs offers limited performance gains, primarily due to the strict assumptions of many CD methods, which are often violated in complex clinical data. While the direct integration yields limited improvement, our benchmark reveals a more promising synergy.

[563] Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning

Hua Ye, Siyuan Chen, Haoliang Zhang, Weihao Luo, Yanbin Li, Xuan Zhang

Main category: cs.LG

TL;DR: A partition-based multi-stage fine-tuning framework for LLMs that groups domains into stages to maximize synergies and minimize interference, with theoretical analysis and empirical validation.

Details

Motivation: Addressing the challenge of inter-domain interference when adapting LLMs across multiple heterogeneous domains, while leveraging potential synergies between domains.

Method: Strategic partitioning of domains into subsets (stages) based on domain discrepancy, synergy, and model capacity constraints, followed by multi-stage fine-tuning.

Result: The method consistently outperforms state-of-the-art baselines across various language understanding tasks in extensive empirical evaluations.

Conclusion: The proposed framework effectively exploits inter-domain synergies while minimizing negative transfer, providing a theoretically-grounded and empirically-validated solution for multi-domain LLM adaptation.

Abstract: Large language models (LLMs) demonstrate impressive generalization abilities, yet adapting them effectively across multiple heterogeneous domains remains challenging due to inter-domain interference. To overcome this challenge, we propose a partition-based multi-stage fine-tuning framework designed to exploit inter-domain synergies while minimizing negative transfer. Our approach strategically partitions domains into subsets (stages) by balancing domain discrepancy, synergy, and model capacity constraints. We theoretically analyze the proposed framework and derive novel generalization bounds that justify our partitioning strategy. Extensive empirical evaluations on various language understanding tasks show that our method consistently outperforms state-of-the-art baselines.

cs.MA

[564] A Negotiation-Based Multi-Agent Reinforcement Learning Approach for Dynamic Scheduling of Reconfigurable Manufacturing Systems

Manonmani Sekar, Nasim Nezamoddini

Main category: cs.MA

TL;DR: This paper proposes a multi-agent reinforcement learning (MARL) framework using enhanced DQN agents for dynamic scheduling in reconfigurable manufacturing systems, demonstrating improved performance over baseline methods in reducing makespan and tardiness while adapting to stochastic events.

Details

Motivation: Reconfigurable manufacturing systems require flexible soft planning mechanisms for real-time production scheduling amid complexity and variability. Traditional approaches struggle with dynamic conditions like machine breakdowns and reconfiguration delays.

Method: Multi-agent reinforcement learning with deep Q-network agents trained centrally, incorporating attention mechanisms for state representation and DQN enhancements including prioritized experience replay, n-step returns, double DQN, and soft target updates.

Result: The proposed approach outperforms baseline heuristics in reducing makespan and tardiness while improving machine utilization, though machine breakdowns increase variability in performance metrics.

Conclusion: MARL mechanisms provide intelligent and adaptive scheduling advantages for dynamic reconfigurable manufacturing environments, effectively handling stochastic events and system variability.

Abstract: Reconfigurable manufacturing systems (RMS) are critical for future market adjustment given their rapid adaptation to fluctuations in consumer demands, the introduction of new technological advances, and disruptions in linked supply chain sections. The adjustable hard settings of such systems require a flexible soft planning mechanism that enables realtime production planning and scheduling amid the existing complexity and variability in their configuration settings. This study explores the application of multi agent reinforcement learning (MARL) for dynamic scheduling in soft planning of the RMS settings. In the proposed framework, deep Qnetwork (DQN) agents trained in centralized training learn optimal job machine assignments in real time while adapting to stochastic events such as machine breakdowns and reconfiguration delays. The model also incorporates a negotiation with an attention mechanism to enhance state representation and improve decision focus on critical system features. Key DQN enhancements including prioritized experience replay, nstep returns, double DQN and soft target update are used to stabilize and accelerate learning. Experiments conducted in a simulated RMS environment demonstrate that the proposed approach outperforms baseline heuristics in reducing makespan and tardiness while improving machine utilization. The reconfigurable manufacturing environment was extended to simulate realistic challenges, including machine failures and reconfiguration times. Experimental results show that while the enhanced DQN agent is effective in adapting to dynamic conditions, machine breakdowns increase variability in key performance metrics such as makespan, throughput, and total tardiness. The results confirm the advantages of applying the MARL mechanism for intelligent and adaptive scheduling in dynamic reconfigurable manufacturing environments.

[565] A Historical Interaction-Enhanced Shapley Policy Gradient Algorithm for Multi-Agent Credit Assignment

Ao Ding, Licheng Sun, Yongjie Hou, Huaqing Zhang, Hongbin Ma

Main category: cs.MA

TL;DR: HIS is a multi-agent reinforcement learning algorithm that uses historical interaction data and Shapley values for efficient credit assignment, balancing individual contributions with global rewards to improve performance in complex collaborative tasks.

Details

Motivation: Traditional credit assignment schemes in MARL struggle to reliably capture individual contributions in strongly coupled tasks while maintaining training stability, leading to limited generalization and hindered algorithm performance.

Method: Proposes a Historical Interaction-Enhanced Shapley Policy Gradient Algorithm (HIS) that uses historical interaction data to calculate Shapley values efficiently, employing a hybrid credit assignment mechanism to balance base rewards with individual contribution incentives.

Result: HIS outperforms state-of-the-art methods in three benchmark environments (Multi-Agent Particle Environment, Multi-Agent MuJoCo, Bi-DexHands), particularly excelling in strongly coupled, complex collaborative tasks.

Conclusion: The hybrid credit assignment mechanism provides theoretical guarantees for efficiency and stability, enabling better individual contribution perception while maintaining training stability through global rewards.

Abstract: Multi-agent reinforcement learning (MARL) has demonstrated remarkable performance in multi-agent collaboration problems and has become a prominent topic in artificial intelligence research in recent years. However, traditional credit assignment schemes in MARL cannot reliably capture individual contributions in strongly coupled tasks while maintaining training stability, which leads to limited generalization capabilities and hinders algorithm performance. To address these challenges, we propose a Historical Interaction-Enhanced Shapley Policy Gradient Algorithm (HIS) for Multi-Agent Credit Assignment, which employs a hybrid credit assignment mechanism to balance base rewards with individual contribution incentives. By utilizing historical interaction data to calculate the Shapley value in a sample-efficient manner, HIS enhances the agent’s ability to perceive its own contribution, while retaining the global reward to maintain training stability. Additionally, we provide theoretical guarantees for the hybrid credit assignment mechanism, ensuring that the assignment results it generates are both efficient and stable. We evaluate the proposed algorithm in three widely used continuous-action benchmark environments: Multi-Agent Particle Environment, Multi-Agent MuJoCo, and Bi-DexHands. Experimental results demonstrate that HIS outperforms state-of-the-art methods, particularly excelling in strongly coupled, complex collaborative tasks.

[566] Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning

Haolun Wu, Zhenkun Li, Lingyao Li

Main category: cs.MA

TL;DR: Multi-agent debate improves LLM reasoning, but its genuine deliberative nature is unclear. Using logic puzzles, the study finds reasoning strength and diversity drive success more than structural factors.

Details

Motivation: To determine if LLM agents can genuinely engage in deliberative reasoning beyond simple ensembling or majority voting in multi-agent debates.

Method: Controlled study using Knight-Knave-Spy logic puzzles with systematic manipulation of six factors: team size, composition, confidence visibility, debate order, depth, and task difficulty.

Result: Intrinsic reasoning strength and group diversity are dominant success drivers; structural parameters offer limited gains. Process analysis reveals majority pressure suppresses correction and rational reasoning predicts improvement.

Conclusion: LLM debates succeed through reasoning strength and diversity, providing guidance for designing truth-seeking multi-agent reasoning systems.

Abstract: Multi-agent debate (MAD) has recently emerged as a promising framework for improving the reasoning performance of large language models (LLMs). Yet, whether LLM agents can genuinely engage in deliberative reasoning, beyond simple ensembling or majority voting, remains unclear. We address this question through a controlled study using the Knight–Knave–Spy logic puzzle, which enables precise, step-wise evaluation of debate outcomes and processes under verifiable ground truth. We systematically set up six structural and cognitive factors, including agent team size, composition, confidence visibility, debate order, debate depth, and task difficulty, to disentangle their respective effects on collective reasoning. Our results show that intrinsic reasoning strength and group diversity are the dominant drivers of debate success, while structural parameters such as order or confidence visibility offer limited gains. Beyond outcomes, process-level analyses identify key behavioral patterns: majority pressure suppresses independent correction, effective teams overturn incorrect consensus, and rational, validity-aligned reasoning most strongly predicts improvement. These findings provide valuable insights into how and why LLM debates succeed or fail, offering guidance for designing interpretable and truth-seeking multi-agent reasoning systems.

[567] How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity

Zihan Ma, Dongsheng Zhu, Shudong Liu, Taolin Zhang, Junnan Liu, Qingqiu Li, Minnan Luo, Songyang Zhang, Kai Chen

Main category: cs.MA

TL;DR: OASIS is a new benchmark that reveals LLM agent safety brittleness under intent concealment and task complexity, showing safety degradation as intent becomes obscured and a “Complexity Paradox” where agents appear safer on harder tasks due to capability limitations.

Details

Motivation: Current safety evaluations focus on atomic harms but fail to address sophisticated threats where malicious intent is concealed or diluted within complex tasks.

Method: Introduce OASIS (Orthogonal Agent Safety Inquiry Suite), a hierarchical benchmark with fine-grained annotations and a high-fidelity simulation sandbox for two-dimensional analysis of agent safety brittleness under intent concealment and task complexity.

Result: Two critical phenomena revealed: safety alignment degrades sharply as intent becomes obscured, and a “Complexity Paradox” emerges where agents seem safer on harder tasks only due to capability limitations.

Conclusion: OASIS provides a principled foundation for probing and strengthening agent safety in overlooked dimensions of intent concealment and task complexity.

Abstract: Current safety evaluations for LLM-driven agents primarily focus on atomic harms, failing to address sophisticated threats where malicious intent is concealed or diluted within complex tasks. We address this gap with a two-dimensional analysis of agent safety brittleness under the orthogonal pressures of intent concealment and task complexity. To enable this, we introduce OASIS (Orthogonal Agent Safety Inquiry Suite), a hierarchical benchmark with fine-grained annotations and a high-fidelity simulation sandbox. Our findings reveal two critical phenomena: safety alignment degrades sharply and predictably as intent becomes obscured, and a “Complexity Paradox” emerges, where agents seem safer on harder tasks only due to capability limitations. By releasing OASIS and its simulation environment, we provide a principled foundation for probing and strengthening agent safety in these overlooked dimensions.

[568] Climate Driven Interactions Between Malaria Transmission and Diabetes Prevalence

Shivank, Anurag Singha, Fakhteh Ghanbarnejad, Ajay K Sharma

Main category: cs.MA

TL;DR: Climate change increases malaria risk for diabetic populations, with modeling showing 1.8-4.0 times higher infection odds in diabetics compared to non-diabetics in India.

Details

Motivation: Climate change intensifies infectious and chronic diseases, with rising temperatures extending malaria transmission windows and worsening diabetes outcomes due to metabolic stress, yet most models don't capture these interactions.

Method: Developed a compartmental epidemiological model using synthetic data from India (2019-2021) with temperature-dependent transmission parameters, seasonal variability, and separate disease dynamics for diabetic/non-diabetic groups, calibrated using Multi-Start optimization with Sequential Quadratic Programming.

Result: Diabetic individuals had 1.8-4.0 times higher malaria infection odds, with peak infection levels of 35-36% vs 20-21% in non-diabetics. Basic reproduction number averaged 2.3 (range 0.31-2.75 across seasons).

Conclusion: With India’s diabetic population projected to reach 157 million by 2050, there’s urgent need for climate-informed health strategies and monitoring systems that jointly address malaria and diabetes.

Abstract: Climate change is intensifying infectious and chronic diseases like malaria and diabetes, respectively, especially among the vulnerable populations. Global temperatures have risen by approximately $0.6^\circ$C since 1950, extending the window of transmission for mosquito-borne infections and worsening outcomes in diabetes due to metabolic stress caused by heat. People living with diabetes have already weakened immune defenses and, therefore, are at an alarmingly increased risk of contraction of malaria. However, most models rarely include both ways of interaction in changing climate conditions. In the paper, we introduce a new compartmental epidemiological model based on synthetic data fitted to disease patterns of India from 2019 to 2021. The framework captures temperature-dependent transmission parameters, seasonal variability, and different disease dynamics between diabetic and non-diabetic groups within the three-compartment system. Model calibration using Multi-Start optimization combined with Sequential Quadratic Programming allows us to find outstanding differences between populations. The odds of malaria infection in diabetic individuals were found to be 1.8–4.0 times higher, with peak infection levels in 35–36%, as compared to 20–21% in the non-diabetic ones. The fitted model was able to capture well the epidemiological patterns observed, while the basic reproduction number averaged around 2.3, ranging from 0.31 to 2.75 in different seasons. Given that India’s diabetic population is set to rise to about 157 million people by 2050, these findings point to a pressing need for concerted efforts toward climate-informed health strategies and monitoring systems that address both malaria and diabetes jointly.

[569] The Curse of Shared Knowledge: Recursive Belief Reasoning in a Coordination Game with Imperfect Information

Thomas Bolander, Robin Engelhardt, Thomas S. Nicolet

Main category: cs.MA

TL;DR: Humans struggle to distinguish between common knowledge and nth-order shared knowledge, often acting as if they have common knowledge even with shallow shared knowledge, leading to coordination failures.

Details

Motivation: To investigate whether humans can differentiate between common knowledge and nth-order shared knowledge in coordination games, since common knowledge is crucial for safe group coordination but is often unavailable.

Method: Three experiments with 802 participants using a two-person coordination game with imperfect information where coordination for highest payoff requires common knowledge, but only nth-order shared knowledge is possible.

Result: Players behave as if they possess common knowledge even at shallow depths of shared knowledge (low n), claiming similar certainty despite incurring significant penalties when coordination fails.

Conclusion: Humans exhibit ’the curse of shared knowledge’ - either unable to distinguish between higher-order shared knowledge and common knowledge, or assuming their co-players cannot make this distinction.

Abstract: Common knowledge is crucial for safe group coordination. In its absence, humans must rely on shared knowledge, which is inherently limited in depth and therefore prone to coordination failures, because any finite-order knowledge attribution allows for an even higher order attribution that may change what is known by whom. In three separate experiments involving 802 participants, we investigate the extent to which humans can differentiate between common knowledge and nth-order shared knowledge. We designed a two-person coordination game with imperfect information to simplify the recursive game structure and higher-order uncertainties into a relatable everyday scenario. In this game, coordination for the highest payoff requires a specific fact to be common knowledge between players. However, this fact cannot become common knowledge in the game. The fact can at most be nth-order shared knowledge for some n. Our findings reveal that even at quite shallow depths of shared knowledge (low values of n), players behave as though they possess common knowledge, and claim similar levels of certainty in their actions, despite incurring significant penalties when falsely assuming guaranteed coordination. We term this phenomenon ’the curse of shared knowledge’. It arises either from the players’ inability to distinguish between higher-order shared knowledge and common knowledge, or from their implicit assumption that their co-player cannot make this distinction.

[570] MA-GTS: A Multi-Agent Framework for Solving Complex Graph Problems in Real-World Applications

Zike Yuan, Ming Liu, Hui Wang, Bing Qin

Main category: cs.MA

TL;DR: MA-GTS is a multi-agent framework that solves complex graph theory problems by decomposing them through agent collaboration, achieving state-of-the-art performance on real-world graph datasets.

Details

Motivation: Graph-theoretic problems in real-world applications are complex, noisy, and irregular, posing challenges for traditional algorithms. LLMs offer potential but face accuracy and input length limitations.

Method: MA-GTS uses multi-agent collaboration to map text-based graph data into structured representations and dynamically selects optimal algorithms based on problem constraints and graph scale.

Result: MA-GTS outperforms state-of-the-art approaches with strong results across benchmarks: G-REAL 94.2%, GraCoRe 96.9%, NLGraph 98.4%, demonstrating superior efficiency, accuracy, and scalability.

Conclusion: The multi-agent framework effectively addresses graph theory challenges, providing efficient, accurate, and interpretable solutions while outperforming existing methods across multiple real-world benchmarks.

Abstract: Graph-theoretic problems arise in real-world applications like logistics, communication networks, and traffic optimization. These problems are often complex, noisy, and irregular, posing challenges for traditional algorithms. Large language models (LLMs) offer potential solutions but face challenges, including limited accuracy and input length constraints. To address these challenges, we propose MA-GTS (Multi-Agent Graph Theory Solver), a multi-agent framework that decomposes these complex problems through agent collaboration. MA-GTS maps the implicitly expressed text-based graph data into clear, structured graph representations and dynamically selects the most suitable algorithm based on problem constraints and graph structure scale. This approach ensures that the solution process remains efficient and the resulting reasoning path is interpretable. We validate MA-GTS using the G-REAL dataset, a real-world-inspired graph theory dataset we created. Experimental results show that MA-GTS outperforms state-of-the-art approaches in terms of efficiency, accuracy, and scalability, with strong results across multiple benchmarks (G-REAL 94.2%, GraCoRe 96.9%, NLGraph 98.4%).MA-GTS is open-sourced at https://github.com/ZIKEYUAN/MA-GTS.git.

[571] Socialized Learning and Emergent Behaviors in Multi-Agent Systems based on Multimodal Large Language Models

Sureyya Akin, Shruti T. Tiwari, Ram Bhattacharya, Sagar A. Raman, Kiran Mohanty, Sita Krishnan

Main category: cs.MA

TL;DR: M-S2L framework integrates multimodal LLMs with social learning to develop AI agents with emergent social intelligence, enabling better collaboration in complex tasks through multimodal perception and communication.

Details

Motivation: To foster emergent social intelligence in AI agents by combining multimodal perception with social learning mechanisms, addressing limitations of text-only approaches in collaborative environments.

Method: Multimodal Socialized Learning Framework (M-S2L) with multimodal perception (vision+text), structured action capabilities, reinforcement learning, two social learning pathways (observational learning and communication-driven learning), and episodic memory for social context.

Result: M-S2L agents outperform text-only and no-social-learning baselines in Collaborative Assembly Environment tasks, showing improved task completion rates and time efficiency, especially in dynamic scenarios. Emergence of efficient communication protocols and role specialization observed.

Conclusion: Integrating multimodal perception with explicit social learning is critical for developing human-like collaborative intelligence in multi-agent systems, enabling emergent social cognition and efficient team coordination.

Abstract: This search introduces the Multimodal Socialized Learning Framework (M-S2L), designed to foster emergent social intelligence in AI agents by integrating Multimodal Large Language Models (M-LLMs) with social learning mechanisms. The framework equips agents with multimodal perception (vision and text) and structured action capabilities, enabling physical manipulation and grounded multimodal communication (e.g., text with visual pointers). M-S2L combines direct reinforcement learning with two novel social learning pathways: multimodal observational learning and communication-driven learning from feedback, augmented by an episodic memory system for long-term social context. We evaluate M-S2L in a Collaborative Assembly Environment (CAE), where agent teams must construct complex devices from ambiguous blueprints under informational asymmetry. Across tasks of increasing complexity, M-S2L agents consistently outperform Text-Only and No-Social-Learning baselines in Task Completion Rate and Time to Completion, particularly in dynamic problem-solving scenarios. Ablation studies confirm the necessity of both multimodality and socialized learning. Our analysis reveals the emergence of efficient communication protocols integrating visual pointers with concise text, alongside rapid role specialization leading to stable labor division. Qualitative case studies demonstrate agents’ abilities for shared awareness, dynamic re-planning, and adaptive problem-solving, suggesting a nascent form of machine social cognition. These findings indicate that integrating multimodal perception with explicit social learning is critical for developing human-like collaborative intelligence in multi-agent systems.

[572] TrustResearcher: Automating Knowledge-Grounded and Transparent Research Ideation with Multi-Agent Collaboration

Jiawei Zhou, Ruicheng Zhu, Mengshi Chen, Jianwei Wang, Kai Wang

Main category: cs.MA

TL;DR: TrustResearcher is a transparent multi-agent system for literature-based ideation that generates evidence-aligned hypotheses through structured knowledge curation, diversified idea generation, multi-stage selection, and expert panel review.

Details

Motivation: Current agentic systems for literature-based ideation are often black-box, producing plausible but weakly grounded outputs with limited transparency and control for researchers.

Method: A four-stage unified framework: (A) Structured Knowledge Curation, (B) Diversified Idea Generation, (C) Multi-stage Idea Selection, and (D) Expert Panel Review & Synthesis, with exposed intermediate reasoning states, execution logs, and tunable agents.

Result: Successfully demonstrated on a graph-mining case study (k-truss breaking problem), generating distinct, plausible hypotheses with evidence and critiques.

Conclusion: TrustResearcher provides a domain-agnostic, transparent approach to literature-based ideation that produces diverse, evidence-aligned hypotheses while maintaining full inspectability and control.

Abstract: Effective research relies on organizing extensive information and stimulating novel solutions. Agentic systems have recently emerged as a promising tool to automate literature-based ideation. However, current systems often remain black-box. Their outputs may appear plausible but weakly grounded, with limited transparency or control for researchers. Our work introduces TrustResearcher, a multi-agent demo system for knowledge-grounded and transparent ideation. Specifically, TrustResearcher integrates meticulously designed four stages into a unified framework: (A) Structured Knowledge Curation, (B) Diversified Idea Generation, (C) Multi-stage Idea Selection, and (D) Expert Panel Review & Synthesis. Different from prior pipelines, our system not only exposes intermediate reasoning states, execution logs, and tunable agents for inspections, but also enables the generation of hypotheses that are both diverse and evidence-aligned. Our design is also domain-agnostic: as long as literature sources exist, the same pipeline can be instantiated in any scientific field. As an illustrative case, we demonstrate TrustResearcher on a graph-mining case study (k-truss breaking problem), where it generates distinct, plausible hypotheses with evidence and critiques. A live demo and source code are available at https://github.com/valleysprings/TrustResearcher.

Ahmet Akkaya Melih, Yamuna Singh, Kunal L. Agarwal, Priya Mukherjee, Kiran Pattnaik, Hanuman Bhatia

Main category: cs.MA

TL;DR: Proposes HMS-HI framework for collaborative human-AI decision-making with shared cognitive space, dynamic task allocation, and trust calibration, achieving 72% fewer casualties and 70% lower cognitive load in emergency response simulations.

Details

Motivation: Current Human-in-the-Loop paradigms inadequately integrate human expertise, causing cognitive overload and decision bottlenecks in complex, high-stakes environments despite advancements in foundation models and multi-agent systems.

Method: Three-pillar framework: Shared Cognitive Space for unified situational awareness, Dynamic Role and Task Allocation for adaptive agent assignment, and Cross-Species Trust Calibration for transparency and mutual adaptation through explainable declarations and structured feedback.

Result: In urban emergency response simulation, HMS-HI reduced civilian casualties by 72% and cognitive load by 70% compared to traditional HiTL approaches, with superior decision quality, efficiency, and human-AI trust.

Conclusion: Engineered trust and shared context are foundational for scalable, synergistic human-AI collaboration, with each module critically contributing to the framework’s success.

Abstract: The rapid advancements in large foundation models and multi-agent systems offer unprecedented capabilities, yet current Human-in-the-Loop (HiTL) paradigms inadequately integrate human expertise, often leading to cognitive overload and decision-making bottlenecks in complex, high-stakes environments. We propose the “Human-Machine Social Hybrid Intelligence” (HMS-HI) framework, a novel architecture designed for deep, collaborative decision-making between groups of human experts and LLM-powered AI agents. HMS-HI is built upon three core pillars: (1) a \textbf{Shared Cognitive Space (SCS)} for unified, multi-modal situational awareness and structured world modeling; (2) a \textbf{Dynamic Role and Task Allocation (DRTA)} module that adaptively assigns tasks to the most suitable agent (human or AI) based on capabilities and workload; and (3) a \textbf{Cross-Species Trust Calibration (CSTC)} protocol that fosters transparency, accountability, and mutual adaptation through explainable declarations and structured feedback. Validated in a high-fidelity urban emergency response simulation, HMS-HI significantly reduced civilian casualties by 72% and cognitive load by 70% compared to traditional HiTL approaches, demonstrating superior decision quality, efficiency, and human-AI trust. An ablation study confirms the critical contribution of each module, highlighting that engineered trust and shared context are foundational for scalable, synergistic human-AI collaboration.

[574] From Pixels to Cooperation Multi Agent Reinforcement Learning based on Multimodal World Models

Sureyya Akin, Kavita Srivastava, Prateek B. Kapoor, Pradeep G. Sethi, Sunita Q. Patel, Rahu Srivastava

Main category: cs.MA

TL;DR: Proposes a Multimodal World Model (MWM) framework for sample-efficient multi-agent reinforcement learning that fuses multimodal observations and enables policy training in latent space.

Details

Motivation: Address sample inefficiency in learning cooperative multi-agent policies from high-dimensional multimodal sensory inputs, overcoming challenges of representation learning, partial observability, and credit assignment.

Method: Train a shared generative Multimodal World Model using attention-based fusion of distributed multimodal observations, then use it as a fast simulator to train MARL policies in latent space, decoupling representation from policy learning.

Result: Achieves orders-of-magnitude greater sample efficiency than state-of-the-art model-free MARL baselines, with superior robustness to sensor-dropout and essential multimodal fusion for sensory asymmetric environments.

Conclusion: The MWM-MARL framework provides a highly sample-efficient approach for multimodal multi-agent learning with practical robustness benefits for real-world deployment.

Abstract: Learning cooperative multi-agent policies directly from high-dimensional, multimodal sensory inputs like pixels and audio (from pixels) is notoriously sample-inefficient. Model-free Multi-Agent Reinforcement Learning (MARL) algorithms struggle with the joint challenge of representation learning, partial observability, and credit assignment. To address this, we propose a novel framework based on a shared, generative Multimodal World Model (MWM). Our MWM is trained to learn a compressed latent representation of the environment’s dynamics by fusing distributed, multimodal observations from all agents using a scalable attention-based mechanism. Subsequently, we leverage this learned MWM as a fast, “imagined” simulator to train cooperative MARL policies (e.g., MAPPO) entirely within its latent space, decoupling representation learning from policy learning. We introduce a new set of challenging multimodal, multi-agent benchmarks built on a 3D physics simulator. Our experiments demonstrate that our MWM-MARL framework achieves orders-of-magnitude greater sample efficiency compared to state-of-the-art model-free MARL baselines. We further show that our proposed multimodal fusion is essential for task success in environments with sensory asymmetry and that our architecture provides superior robustness to sensor-dropout, a critical feature for real-world deployment.

cs.MM

[575] Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

Ali Vosoughi, Dimitra Emmanouilidou, Hannes Gamper

Main category: cs.MM

TL;DR: AVVA framework improves audio-video alignment using LLM-curated data and contrastive learning, achieving better retrieval performance with less training data.

Details

Motivation: Current multimodal foundation models struggle with effective audio-visual integration beyond simple temporal synchronization.

Method: Uses LLMs for data curation, Whisper for audio, DINOv2 for video, and contrastive learning in dual-encoder structure.

Result: Significant improvement in top-k accuracies for video-to-audio retrieval on AudioCaps, VALOR, and VGGSound datasets using only 192 hrs of curated data.

Conclusion: Data curation effectively trades quality for quantity, yielding better performance than training on uncurated data.

Abstract: Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.

[576] Enabling American Sign Language Communication Under Low Data Rates

Panneer Selvam Santhalingam, Swann Thantsin, Ahmad Kamari, Parth Pathak, Kenneth DeHaan

Main category: cs.MM

TL;DR: VC4ASL enables ASL communication over audio channels in video conferencing apps when video fails due to poor connectivity, using pose encoding and reconstruction without platform modifications.

Details

Motivation: Video conferencing apps rely on high-speed internet, forcing ASL users to use audio-only mode which excludes their visual language, creating communication barriers for the deaf community.

Method: Encodes human pose data through audio channel, reconstructs signed content, uses error detection/correction exploiting human pose constraints, integrates with existing platforms without modifications.

Result: System effectively enables intelligible ASL communication over audio in low-bandwidth scenarios where video transmission fails, as validated through user studies and network simulations.

Conclusion: VC4ASL successfully bridges the accessibility gap for ASL users in degraded network conditions by leveraging audio channels for pose-based sign language transmission.

Abstract: In recent years, video conferencing applications have become increasingly prevalent, relying heavily on high-speed internet connectivity. When such connectivity is lacking, users often default to audio-only communication, a mode that significantly disadvantages American Sign Language (ASL) users, whose communication relies on hand gestures, body movement, and facial expressions. In this work, we introduce VC4ASL, a system designed to enable ASL communication over the audio channel of existing video conferencing applications, even in the absence of reliable video. VC4ASL integrates seamlessly with current platforms without requiring any modifications. Our approach establishes a communication channel through audio by encoding and transmitting human pose information, which is then rendered to reconstruct signed content. We propose novel receive-side error detection and correction mechanisms that exploit the inherent structural constraints of human pose data. To evaluate the system, we simulate network-degraded environments, generate pose-based ASL video sequences, and conduct user studies to assess comprehension among ASL users. Experimental results demonstrate that VC4ASL effectively facilitates intelligible ASL communication over audio in low-bandwidth scenarios where video transmission is impaired.

eess.AS

[577] Automatic Music Mixing using a Generative Model of Effect Embeddings

Eloi Moliner, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Kin Wai Cheuk, Joan Serrà, Vesa Välimäki, Yuki Mitsufuji

Main category: eess.AS

TL;DR: MEGAMI is a generative framework for automatic music mixing that models multiple valid mixing solutions rather than treating it as a deterministic regression problem.

Details

Motivation: Existing automatic mixing systems ignore the subjectivity and multiplicity of valid solutions in music mixing by treating it as a deterministic regression problem.

Method: Uses a track-agnostic effects processor conditioned on per-track generated embeddings, employs permutation-equivariant architecture for arbitrary unlabeled tracks, and enables training on both dry and wet recordings via domain adaptation.

Result: Objective evaluation shows consistent improvements over existing methods, and listening tests indicate performance approaching human-level quality across diverse musical genres.

Conclusion: MEGAMI successfully models the conditional distribution of professional mixes and demonstrates improved performance over deterministic approaches in automatic music mixing.

Abstract: Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring this multiplicity of solutions. Here we introduce MEGAMI (Multitrack Embedding Generative Auto MIxing), a generative framework that models the conditional distribution of professional mixes given unprocessed tracks. MEGAMI uses a track-agnostic effects processor conditioned on per-track generated embeddings, handles arbitrary unlabeled tracks through a permutation-equivariant architecture, and enables training on both dry and wet recordings via domain adaptation. Our objective evaluation using distributional metrics shows consistent improvements over existing methods, while listening tests indicate performances approaching human-level quality across diverse musical genres.

[578] Pruning as Regularization: Sensitivity-Aware One-Shot Pruning in ASR

Julian Irigoyen, Arthur Söhler, Andreas Søeborg Kirkedal

Main category: eess.AS

TL;DR: Neural network pruning serves as implicit regularization for ASR, not just compression. Targeted pruning of decoder self-attention and last encoder layers improves generalization without fine-tuning.

Details

Motivation: To challenge the conventional view of pruning as solely compression and demonstrate its role as implicit regularizer for automatic speech recognition (ASR).

Method: Combined gradient- and Fisher-based sensitivity diagnostics with targeted, component-wise pruning on Whisper-small model, focusing on architectural asymmetries.

Result: Pruning 50% of decoder self-attention reduced WER by 2.38% absolute (20.44% relative) on LibriSpeech; pruning last four encoder layers at 50% yielded 1.72% absolute improvement. Gains persisted across multiple datasets.

Conclusion: Pruning should be viewed as a first-class architectural design tool where knowing where to prune is as important as how much to prune, enabling both regularization benefits and aggressive compression.

Abstract: We challenge the conventional view of neural network pruning as solely a compression technique, demonstrating that one-shot magnitude pruning serves as a powerful implicit regularizer for ASR. Using Whisper-small, we combine gradient- and Fisher-based sensitivity diagnostics with targeted, component-wise pruning. This reveals architectural asymmetries: decoder FFNs are pruning-fragile, whereas decoder self-attention and the last encoder layers contain redundancy that, when removed, improves generalization. Without fine-tuning, pruning 50% of decoder self-attention reduces WER by 2.38% absolute (20.44% relative) on LibriSpeech test-other; pruning the last four encoder layers at 50% instead yields a 1.72% absolute (14.8% relative) improvement. Gains persisted on Common Voice and TED-LIUM datasets. Beyond regularization benefits, our sensitivity-aware approach enables more aggressive one-shot compression. At 40% sparsity, where established global pruning approaches catastrophically fail, our method preserves near-baseline accuracy. This positions pruning as a first-class architectural design tool: knowing where to prune is as important as how much to prune.

[579] Quantizing Whisper-small: How design choices affect ASR performance

Arthur Söhler, Julian Irigoyen, Andreas Søeborg Kirkedal

Main category: eess.AS

TL;DR: Evaluation of post-training quantization methods for Whisper-small speech recognition model, showing dynamic int8 quantization with Quanto reduces model size by 57% while maintaining or improving accuracy.

Details

Motivation: Large speech recognition models like Whisper-small are computationally demanding and difficult to deploy on edge devices, requiring efficient compression methods.

Method: Unified cross-library evaluation of PTQ using PyTorch, Optimum-Quanto, HQQ, and bitsandbytes, testing quantization schemes, methods, granularity, and bit-widths on LibriSpeech datasets.

Result: Dynamic int8 quantization with Quanto achieved best trade-off: 57% size reduction with improved word error rate. More aggressive formats (nf4, int3) achieved 71% compression but with accuracy loss in noisy conditions.

Conclusion: Carefully chosen PTQ methods can substantially reduce Whisper-small’s size and inference cost without retraining, enabling efficient deployment on constrained hardware.

Abstract: Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline’s word error rate. Static quantization performed worse, likely due to Whisper’s Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.

[580] Unifying Model and Layer Fusion for Speech Foundation Models

Yi-Jen Shih, David Harwath

Main category: eess.AS

TL;DR: Proposes a unified fusion interface for multiple speech foundation models that integrates representations across both different models and their layers, outperforming prior fusion approaches on ASR and paralinguistic tasks.

Details

Motivation: To improve downstream speech task performance by unifying two existing fusion strategies: intra-model layer fusion and inter-model fusion, enabling more comprehensive representation integration.

Method: An interface module that enables fusion across multiple upstream speech models while integrating information across their different layers, allowing cross-model and cross-layer representation fusion.

Result: Outperforms prior fusion approaches across various speech tasks including ASR and paralinguistic analysis, with performance improvements dependent on appropriate upstream model selection.

Conclusion: The proposed interface provides additional performance boost when given suitable upstream models, making it a promising approach for utilizing Speech Foundation Models effectively.

Abstract: Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable upstream model selection, making it a promising approach for utilizing Speech Foundation Models.

[581] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding

Rui-Chen Zheng, Wenrui Liu, Hui-Peng Du, Qinglin Zhang, Chong Deng, Qian Chen, Wen Wang, Yang Ai, Zhen-Hua Ling

Main category: eess.AS

TL;DR: VARSTok is a variable-frame-rate speech tokenizer that adapts token allocation based on local feature similarity, achieving superior performance with fewer tokens than fixed-rate baselines.

Details

Motivation: Existing speech tokenizers use fixed token allocation per second, which mismatches the uneven information distribution in speech signals.

Method: Uses temporal-aware density peak clustering for adaptive segmentation and implicit duration coding that embeds content and temporal span into single tokens.

Result: Achieves 23% fewer tokens than 40 Hz baseline while improving reconstruction naturalness, lower word error rates, and better TTS synthesis.

Conclusion: First work demonstrating seamless integration of fully dynamic variable-frame-rate acoustic tokenizer into downstream speech language models.

Abstract: Existing speech tokenizers typically assign a fixed number of tokens per second, regardless of the varying information density or temporal fluctuations in the speech signal. This uniform token allocation mismatches the intrinsic structure of speech, where information is distributed unevenly over time. To address this, we propose VARSTok, a VAriable-frame-Rate Speech Tokenizer that adapts token allocation based on local feature similarity. VARSTok introduces two key innovations: (1) a temporal-aware density peak clustering algorithm that adaptively segments speech into variable-length units, and (2) a novel implicit duration coding scheme that embeds both content and temporal span into a single token index, eliminating the need for auxiliary duration predictors. Extensive experiments show that VARSTok significantly outperforms strong fixed-rate baselines. Notably, it achieves superior reconstruction naturalness while using up to 23% fewer tokens than a 40 Hz fixed-frame-rate baseline. VARSTok further yields lower word error rates and improved naturalness in zero-shot text-to-speech synthesis. To the best of our knowledge, this is the first work to demonstrate that a fully dynamic, variable-frame-rate acoustic speech tokenizer can be seamlessly integrated into downstream speech language models.

eess.IV

[582] EvoPS: Evolutionary Patch Selection for Whole Slide Image Analysis in Computational Pathology

Saya Hashemian, Azam Asilian Bidgoli

Main category: eess.IV

TL;DR: EvoPS is an evolutionary patch selection framework that optimizes the trade-off between computational cost and diagnostic accuracy in computational pathology by selecting minimal patch embeddings while maintaining or improving classification performance.

Details

Motivation: Current patch selection methods for Whole-Slide Images are inefficient, computationally expensive, and fail to explicitly manage the crucial trade-off between the number of selected patches and slide representation accuracy.

Method: Formulates patch selection as a multi-objective optimization problem using evolutionary search to simultaneously minimize selected patch embeddings and maximize downstream similarity search performance, generating a Pareto front of optimal solutions.

Result: EvoPS reduces required training patch embeddings by over 90% while maintaining or improving classification F1-score compared to using all available patches across four TCGA cancer cohorts using five pretrained deep learning models.

Conclusion: EvoPS provides a robust and principled method for creating efficient, accurate, and interpretable WSI representations, enabling optimal balance between computational cost and diagnostic performance.

Abstract: In computational pathology, the gigapixel scale of Whole-Slide Images (WSIs) necessitates their division into thousands of smaller patches. Analyzing these high-dimensional patch embeddings is computationally expensive and risks diluting key diagnostic signals with many uninformative patches. Existing patch selection methods often rely on random sampling or simple clustering heuristics and typically fail to explicitly manage the crucial trade-off between the number of selected patches and the accuracy of the resulting slide representation. To address this gap, we propose EvoPS (Evolutionary Patch Selection), a novel framework that formulates patch selection as a multi-objective optimization problem and leverages an evolutionary search to simultaneously minimize the number of selected patch embeddings and maximize the performance of a downstream similarity search task, generating a Pareto front of optimal trade-off solutions. We validated our framework across four major cancer cohorts from The Cancer Genome Atlas (TCGA) using five pretrained deep learning models to generate patch embeddings, including both supervised CNNs and large self-supervised foundation models. The results demonstrate that EvoPS can reduce the required number of training patch embeddings by over 90% while consistently maintaining or even improving the final classification F1-score compared to a baseline that uses all available patches’ embeddings selected through a standard extraction pipeline. The EvoPS framework provides a robust and principled method for creating efficient, accurate, and interpretable WSI representations, empowering users to select an optimal balance between computational cost and diagnostic performance.

[583] Deep generative priors for robust and efficient electron ptychography

Arthur R. C. McCray, Stephanie M. Ribet, Georgios Varnavides, Colin Ophus

Main category: eess.IV

TL;DR: Deep generative prior framework for electron ptychography that uses neural networks to improve noise robustness, convergence speed, and reduce manual hyperparameter tuning in 3D multislice reconstructions.

Details

Motivation: Conventional electron ptychography algorithms suffer from noise sensitivity, slow convergence, and extensive manual hyperparameter tuning for regularization, especially in 3D multislice reconstructions.

Method: Two deep generative priors parameterize the complex-valued sample and probe within an automatic-differentiation mixed-state multislice forward model, using convolutional neural networks for implicit regularization.

Result: DGPs offer greater noise robustness, improved information limits at low dose, faster convergence (especially at low spatial frequencies), improved depth regularization, and minimal user-specified regularization compared to pixel-based reconstructions.

Conclusion: DGP-enabled ptychography reduces expertise barriers and computational cost, delivering robust, high-resolution imaging across diverse materials and biological systems.

Abstract: Electron ptychography enables dose-efficient atomic-resolution imaging, but conventional reconstruction algorithms suffer from noise sensitivity, slow convergence, and extensive manual hyperparameter tuning for regularization, especially in three-dimensional multislice reconstructions. We introduce a deep generative prior (DGP) framework for electron ptychography that uses the implicit regularization of convolutional neural networks to address these challenges. Two DGPs parameterize the complex-valued sample and probe within an automatic-differentiation mixed-state multislice forward model. Compared to pixel-based reconstructions, DGPs offer four key advantages: (i) greater noise robustness and improved information limits at low dose; (ii) markedly faster convergence, especially at low spatial frequencies; (iii) improved depth regularization; and (iv) minimal user-specified regularization. The DGP framework promotes spatial coherence and suppresses high-frequency noise without extensive tuning, and a pre-training strategy stabilizes reconstructions. Our results establish DGP-enabled ptychography as a robust approach that reduces expertise barriers and computational cost, delivering robust, high-resolution imaging across diverse materials and biological systems.

[584] Deep Learning Analysis of Prenatal Ultrasound for Identification of Ventriculomegaly

Youssef Megahed, Inok Lee, Robin Ducharme, Aylin Erman, Olivier X. Miguel, Kevin Dick, Adrian D. C. Chan, Steven Hawken, Mark Walker, Felipe Moretti

Main category: eess.IV

TL;DR: Developed a deep learning model using fine-tuned Ultrasound Self-Supervised Foundation Model (USF-MAE) to detect ventriculomegaly in prenatal ultrasound images with high accuracy (97.24%) and explainable focus on ventricle areas.

Details

Motivation: Ventriculomegaly is a prenatal condition with dilated cerebral ventricles that requires early diagnosis due to association with fetal aneuploidies and genetic syndromes, necessitating automated detection methods.

Method: Fine-tuned USF-MAE (Vision Transformer encoder pretrained on 370,000+ ultrasound images) for binary classification of normal vs ventriculomegaly fetal brain ultrasound images using 5-fold cross-validation and independent test cohort.

Result: Achieved F1-score of 91.76% (cross-validation) and 91.78% (test set), outperforming baseline models (VGG-19, ResNet-50, ViT-B/16) by significant margins, with 97.24% accuracy and 94.47% precision.

Conclusion: The USF-MAE model effectively detects ventriculomegaly with high performance and clinical plausibility, as evidenced by Eigen-CAM heatmaps showing focus on ventricle areas, providing explainable AI for prenatal diagnosis.

Abstract: The proposed study aimed to develop a deep learning model capable of detecting ventriculomegaly on prenatal ultrasound images. Ventriculomegaly is a prenatal condition characterized by dilated cerebral ventricles of the fetal brain and is important to diagnose early, as it can be associated with an increased risk for fetal aneuploidies and/or underlying genetic syndromes. An Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), recently developed by our group, was fine-tuned for a binary classification task to distinguish fetal brain ultrasound images as either normal or showing ventriculomegaly. The USF-MAE incorporates a Vision Transformer encoder pretrained on more than 370,000 ultrasound images from the OpenUS-46 corpus. For this study, the pretrained encoder was adapted and fine-tuned on a curated dataset of fetal brain ultrasound images to optimize its performance for ventriculomegaly detection. Model evaluation was conducted using 5-fold cross-validation and an independent test cohort, and performance was quantified using accuracy, precision, recall, specificity, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed USF-MAE model reached an F1-score of 91.76% on the 5-fold cross-validation and 91.78% on the independent test set, with much higher scores than those obtained by the baseline models by 19.37% and 16.15% compared to VGG-19, 2.31% and 2.56% compared to ResNet-50, and 5.03% and 11.93% compared to ViT-B/16, respectively. The model also showed a high mean test precision of 94.47% and an accuracy of 97.24%. The Eigen-CAM (Eigen Class Activation Map) heatmaps showed that the model was focusing on the ventricle area for the diagnosis of ventriculomegaly, which has explainability and clinical plausibility.

[585] DynaQuant: Dynamic Mixed-Precision Quantization for Learned Image Compression

Youneng Bao, Yulong Cheng, Yiping Liu, Yichen Yang, Peng Qin, Mu Li, Yongsheng Liang

Main category: eess.IV

TL;DR: DynaQuant introduces dynamic mixed-precision quantization for learned image compression, adapting bit-widths to data distributions and layer sensitivity using content-aware quantization and dynamic bit-width selection.

Details

Motivation: Static uniform bit-width quantization in LIC fails to adapt to diverse data distributions and sensitivity characteristics, leading to suboptimal performance-efficiency trade-offs.

Method: Two-level framework: 1) Content-aware quantization with learnable scaling/offset parameters and Distance-aware Gradient Modulator for training, 2) Dynamic bit-width selector that assigns optimal precision per layer based on input data.

Result: Achieves rate-distortion performance comparable to full-precision models while significantly reducing computational and storage requirements.

Conclusion: DynaQuant enables practical deployment of advanced LIC on diverse hardware platforms through flexible balancing of performance and efficiency.

Abstract: Prevailing quantization techniques in Learned Image Compression (LIC) typically employ a static, uniform bit-width across all layers, failing to adapt to the highly diverse data distributions and sensitivity characteristics inherent in LIC models. This leads to a suboptimal trade-off between performance and efficiency. In this paper, we introduce DynaQuant, a novel framework for dynamic mixed-precision quantization that operates on two complementary levels. First, we propose content-aware quantization, where learnable scaling and offset parameters dynamically adapt to the statistical variations of latent features. This fine-grained adaptation is trained end-to-end using a novel Distance-aware Gradient Modulator (DGM), which provides a more informative learning signal than the standard Straight-Through Estimator. Second, we introduce a data-driven, dynamic bit-width selector that learns to assign an optimal bit precision to each layer, dynamically reconfiguring the network’s precision profile based on the input data. Our fully dynamic approach offers substantial flexibility in balancing rate-distortion (R-D) performance and computational cost. Experiments demonstrate that DynaQuant achieves rd performance comparable to full-precision models while significantly reducing computational and storage requirements, thereby enabling the practical deployment of advanced LIC on diverse hardware platforms.

[586] From Noise to Latent: Generating Gaussian Latents for INR-Based Image Compression

Chaoyi Lin, Yaojun Wu, Yue Li, Junru Li, Kai Zhang, Li Zhang

Main category: eess.IV

TL;DR: Proposes a novel image compression method that generates image-specific latents from Gaussian noise using a shared random seed, eliminating the need to transmit latent codes while maintaining competitive performance.

Details

Motivation: Address limitations of both implicit neural representation (INR) methods (lack of expressive latents) and end-to-end (E2E) methods (complex entropy models and transmission overhead) by exploring inverse Gaussian transformation for latent generation.

Method: Uses Gaussian Parameter Prediction (GPP) module to estimate distribution parameters from Gaussian noise, generates latents via reparameterization trick, and reconstructs images through synthesis network.

Result: Achieves competitive rate-distortion performance on Kodak and CLIC datasets without transmitting latent codes.

Conclusion: First work to explore Gaussian latent generation for learned image compression, offering a new paradigm that combines benefits of latent-based approaches without transmission overhead.

Abstract: Recent implicit neural representation (INR)-based image compression methods have shown competitive performance by overfitting image-specific latent codes. However, they remain inferior to end-to-end (E2E) compression approaches due to the absence of expressive latent representations. On the other hand, E2E methods rely on transmitting latent codes and requiring complex entropy models, leading to increased decoding complexity. Inspired by the normalization strategy in E2E codecs where latents are transformed into Gaussian noise to demonstrate the removal of spatial redundancy, we explore the inverse direction: generating latents directly from Gaussian noise. In this paper, we propose a novel image compression paradigm that reconstructs image-specific latents from a multi-scale Gaussian noise tensor, deterministically generated using a shared random seed. A Gaussian Parameter Prediction (GPP) module estimates the distribution parameters, enabling one-shot latent generation via reparameterization trick. The predicted latent is then passed through a synthesis network to reconstruct the image. Our method eliminates the need to transmit latent codes while preserving latent-based benefits, achieving competitive rate-distortion performance on Kodak and CLIC dataset. To the best of our knowledge, this is the first work to explore Gaussian latent generation for learned image compression.

[587] Targeted Unlearning Using Perturbed Sign Gradient Methods With Applications On Medical Images

George R. Nahass, Zhu Wang, Homa Rashidisabet, Won Hwa Kim, Sasha Hubschman, Jeffrey C. Peterson, Chad A. Purnell, Pete Setabutr, Ann Q. Tran, Darvin Yi, Sathya N. Ravi

Main category: eess.IV

TL;DR: Machine unlearning as a practical tool for clinical model maintenance, using bilevel optimization with tunable loss design and model composition strategies to handle data shifts and policy changes without full retraining.

Details

Motivation: To establish machine unlearning as a general-purpose tool for post-deployment model revision in clinical contexts where data shifts, device deprecation, and policy changes are common, moving beyond privacy-focused applications.

Method: Proposes a bilevel optimization formulation of boundary-based unlearning with iterative algorithms, featuring tunable loss design for controlling forgetting-retention tradeoff and supporting model composition strategies that merge strengths from different unlearning runs.

Result: Outperforms baselines on both forgetting and retention metrics across benchmark and real-world clinical imaging datasets, including scenarios with imaging devices and anatomical outliers.

Conclusion: Establishes machine unlearning as a modular, practical alternative to retraining for real-world model maintenance in clinical applications, with convergence guarantees for first-order algorithms.

Abstract: Machine unlearning aims to remove the influence of specific training samples from a trained model without full retraining. While prior work has largely focused on privacy-motivated settings, we recast unlearning as a general-purpose tool for post-deployment model revision. Specifically, we focus on utilizing unlearning in clinical contexts where data shifts, device deprecation, and policy changes are common. To this end, we propose a bilevel optimization formulation of boundary-based unlearning that can be solved using iterative algorithms. We provide convergence guarantees when first-order algorithms are used to unlearn. Our method introduces tunable loss design for controlling the forgetting-retention tradeoff and supports novel model composition strategies that merge the strengths of distinct unlearning runs. Across benchmark and real-world clinical imaging datasets, our approach outperforms baselines on both forgetting and retention metrics, including scenarios involving imaging devices and anatomical outliers. This work establishes machine unlearning as a modular, practical alternative to retraining for real-world model maintenance in clinical applications.

[588] py360tool: Um framework para manipulação de vídeo 360$^\circ$ com ladrilhos

Henrique Domingues Garcia, Marcelo Menezes de Carvalho

Main category: eess.IV

TL;DR: py360tools is a Python library for handling tile-based 360° video streaming that automates client-side tasks like spherical projection reconstruction, viewport extraction, and tile selection to facilitate streaming simulation and analysis.

Details

Motivation: 360° video streaming is bandwidth-intensive and requires complex architecture involving viewport prediction, tile selection, and bitrate adaptation, creating a need for tools to evaluate QoE and QoS due to interactive nature and low reproducibility.

Method: Developed a Python library (py360tools) that automates key client-side tasks including spherical projection reconstruction, viewport extraction, and tile selection, with flexible architecture for analyzing different projections and tiling strategies.

Result: Created a library that facilitates playback and simulation of streaming sessions for 360° videos, enabling efficient analysis of different streaming approaches.

Conclusion: py360tools provides a valuable tool for researchers and developers working with 360° video streaming by automating complex client-side tasks and enabling flexible analysis of streaming strategies.

Abstract: The streaming of 360$^\circ$ videos is one of the most bandwidth-demanding virtual reality (VR) applications, as the video must be encoded in ultra-high resolution to ensure an immersive experience. To optimize its transmission, current approaches partition the spherical video into tiles, which are encoded at different bitrates and selectively delivered, based on the viewing direction of the user (viewport). The complexity of this architecture, which involves viewport prediction, tile selection, bit rate adaptation, and handling of parallel streaming, requires new tools to evaluate quality of experience (QoE) and quality of service (QoS), especially due to its interactive nature and low reproducibility. This work introduces py360tools, a Python library to handle tile-based 360$^\circ$ video streaming. The library automates key client-side tasks, such as spherical projection reconstruction, viewport extraction, and tile selection, facilitating the playback and simulation of streaming sessions. Furthermore, py360tools offers a flexible architecture, enabling efficient analysis of different projections and tiling strategies.

[589] On hallucinations in AI-generated content for nuclear medicine imaging (the DREAM report)

Menghua Xia, Reimund Bayerlein, Yanis Chemli, Xiaofeng Liu, Jinsong Ouyang, MingDe Lin, Georges El Fakhri, Ramsey D. Badawi, Quanzheng Li, Chi Liu

Main category: eess.IV

TL;DR: AIGC shows promise in nuclear medicine imaging but faces hallucination risks that can compromise diagnostic accuracy. This paper introduces the DREAM report framework to address hallucination challenges through definition, examples, detection, causes, and mitigation strategies.

Details

Motivation: While AIGC offers cost-effective solutions for nuclear medicine imaging tasks like image enhancement and motion correction, the risk of hallucinations generating realistic but factually incorrect content threatens diagnostic accuracy and clinical trust.

Method: The paper presents the DREAM report framework, which provides comprehensive recommendations covering: Definition of hallucinations, Representative examples, Detection and evaluation metrics, underlying causes, and Mitigation strategies for AIGC in nuclear medicine imaging.

Result: The DREAM report establishes a structured approach to understand and address hallucination challenges in AIGC for nuclear medicine imaging, providing a foundation for safer clinical deployment.

Conclusion: This position statement aims to create a common understanding and framework for future research to enhance AIGC applications in nuclear medicine imaging, supporting their safe and effective clinical implementation.

Abstract: Artificial intelligence-generated content (AIGC) has shown remarkable performance in nuclear medicine imaging (NMI), offering cost-effective software solutions for tasks such as image enhancement, motion correction, and attenuation correction. However, these advancements come with the risk of hallucinations, generating realistic yet factually incorrect content. Hallucinations can misrepresent anatomical and functional information, compromising diagnostic accuracy and clinical trust. This paper presents a comprehensive perspective of hallucination-related challenges in AIGC for NMI, introducing the DREAM report, which covers recommendations for definition, representative examples, detection and evaluation metrics, underlying causes, and mitigation strategies. This position statement paper aims to initiate a common understanding for discussions and future research toward enhancing AIGC applications in NMI, thereby supporting their safe and effective deployment in clinical practice.

[590] Filling of incomplete sinograms from sparse PET detector configurations using a residual U-Net

Klara Leffler, Luigi Tommaso Luppino, Samuel Kuttner, Karin Söderkvist, Jan Axelsson

Main category: eess.IV

TL;DR: Deep learning-based sinogram restoration enables sparse PET detector configurations, reducing costs while maintaining image quality through a modified Residual U-Net that recovers missing data from 75% undersampled sinograms.

Details

Motivation: To address the high cost of densely packed photodetectors in long axial field-of-view PET scanners by developing sparse system configurations that maintain similar detector costs to standard PET systems.

Method: A modified Residual U-Net trained on clinical PET scans from GE Signa PET/MR, simulating removal of 50% detectors in chessboard pattern (retaining only 25% of lines of response) to restore missing sinogram data.

Result: The model successfully recovers missing counts with mean absolute error below two events per pixel, outperforming 2D interpolation in both sinogram and reconstructed image domains, though reconstructed images lack sharpness in finer details.

Conclusion: Sparse detector configurations combined with deep learning offer a viable alternative to conventional PET designs, supporting development of cost-effective total body PET scanners.

Abstract: Long axial field-of-view PET scanners offer increased field-of-view and sensitivity compared to traditional PET scanners. However, a significant cost is associated with the densely packed photodetectors required for the extended-coverage systems, limiting clinical utilisation. To mitigate the cost limitations, alternative sparse system configurations have been proposed, allowing an extended field-of-view PET design with detector costs similar to a standard PET system, albeit at the expense of image quality. In this work, we propose a deep sinogram restoration network to fill in the missing sinogram data. Our method utilises a modified Residual U-Net, trained on clinical PET scans from a GE Signa PET/MR, simulating the removal of 50% of the detectors in a chessboard pattern (retaining only 25% of all lines of response). The model successfully recovers missing counts, with a mean absolute error below two events per pixel, outperforming 2D interpolation in both sinogram and reconstructed image domain. Notably, the predicted sinograms exhibit a smoothing effect, leading to reconstructed images lacking sharpness in finer details. Despite these limitations, the model demonstrates a substantial capacity for compensating for the undersampling caused by the sparse detector configuration. This proof-of-concept study suggests that sparse detector configurations, combined with deep learning techniques, offer a viable alternative to conventional PET scanner designs. This approach supports the development of cost-effective, total body PET scanners, allowing a significant step forward in medical imaging technology.

[591] CoCoLIT: ControlNet-Conditioned Latent Image Translation for MRI to Amyloid PET Synthesis

Alec Sargood, Lemuel Puglisi, James H. Cole, Neil P. Oxtoby, Daniele Ravì, Daniel C. Alexander

Main category: eess.IV

TL;DR: CoCoLIT is a diffusion-based latent generative framework that synthesizes amyloid PET scans from structural MRI, achieving state-of-the-art performance in amyloid-positivity classification with significant improvements over existing methods.

Details

Motivation: To enable cost-effective large-scale Alzheimer's Disease screening by synthesizing amyloid PET scans from widely available structural MRI, leveraging evidence that MRI encodes information correlated with amyloid deposition.

Method: CoCoLIT uses a diffusion-based latent generative framework with three innovations: Weighted Image Space Loss (WISL) for better latent representation, analysis of Latent Average Stabilization (LAS) for inference consistency, and ControlNet-based conditioning for MRI-to-PET translation.

Result: Significantly outperforms state-of-the-art methods on both image-based and amyloid-related metrics, with +10.5% improvement on internal dataset and +23.7% on external dataset for amyloid-positivity classification.

Conclusion: CoCoLIT provides an effective solution for MRI-to-PET translation, enabling more accessible Alzheimer’s Disease screening through advanced latent space modeling and diffusion-based generation.

Abstract: Synthesizing amyloid PET scans from the more widely available and accessible structural MRI modality offers a promising, cost-effective approach for large-scale Alzheimer’s Disease (AD) screening. This is motivated by evidence that, while MRI does not directly detect amyloid pathology, it may nonetheless encode information correlated with amyloid deposition that can be uncovered through advanced modeling. However, the high dimensionality and structural complexity of 3D neuroimaging data pose significant challenges for existing MRI-to-PET translation methods. Modeling the cross-modality relationship in a lower-dimensional latent space can simplify the learning task and enable more effective translation. As such, we present CoCoLIT (ControlNet-Conditioned Latent Image Translation), a diffusion-based latent generative framework that incorporates three main innovations: (1) a novel Weighted Image Space Loss (WISL) that improves latent representation learning and synthesis quality; (2) a theoretical and empirical analysis of Latent Average Stabilization (LAS), an existing technique used in similar generative models to enhance inference consistency; and (3) the introduction of ControlNet-based conditioning for MRI-to-PET translation. We evaluate CoCoLIT’s performance on publicly available datasets and find that our model significantly outperforms state-of-the-art methods on both image-based and amyloid-related metrics. Notably, in amyloid-positivity classification, CoCoLIT outperforms the second-best method with improvements of +10.5% on the internal dataset and +23.7% on the external dataset. The code and models of our approach are available at https://github.com/brAIn-science/CoCoLIT.

[592] HyDeFuse: Provably Convergent Denoiser-Driven Hyperspectral Fusion

Sagar Kumar, Unni V S, Kunal Narayan Chaudhury

Main category: eess.IV

TL;DR: HyDeFuse is a denoiser-driven fusion algorithm that integrates hyperspectral and multispectral images using pseudo-linear denoisers for regularization, with proven global linear convergence.

Details

Motivation: HS images have fine spectral resolution but limited spatial resolution, while MS images have finer spatial details but fewer bands. Fusion aims to combine both to get improved spatial and spectral resolution, but reconstruction using forward models alone is challenging and requires regularization.

Method: Uses denoiser-driven regularization paradigm with off-the-shelf denoisers for implicit regularization within iterative algorithm. Specifically employs HyDeFuse algorithm leveraging pseudo-linear denoisers and applies contraction mapping theorem to establish convergence.

Result: Demonstrates global linear convergence of HyDeFuse algorithm and presents fusion results on publicly available datasets showing the performance of the proposed method.

Conclusion: Denoiser-driven regularization with pseudo-linear denoisers provides an effective approach for HS-MS fusion, with HyDeFuse achieving both theoretical convergence guarantees and practical performance improvements.

Abstract: Hyperspectral (HS) images provide fine spectral resolution but have limited spatial resolution, whereas multispectral (MS) images capture finer spatial details but have fewer bands. HS-MS fusion aims to integrate HS and MS images to generate a single image with improved spatial and spectral resolution. This is commonly formulated as an inverse problem with a linear forward model. However, reconstructing high-quality images using the forward model alone is challenging, necessitating the use of regularization techniques. In this work, we investigate the paradigm of denoiser-driven regularization, where a powerful off-the-shelf denoiser is used for implicit regularization within an iterative algorithm. This has shown much promise but remains relatively underexplored in hyperspectral imaging. The technical challenge lies in designing hyperspectral denoisers that can guarantee convergence while strong denoisers can produce high-quality reconstructions, they may also cause instability or divergence. Specifically, we consider a denoiser-driven fusion algorithm, HyDeFuse, which leverages a class of pseudo-linear denoisers for implicit regularization. We demonstrate how the contraction mapping theorem can be applied to establish global linear convergence of HyDeFUse. Finally, we validate our theoretical findings and present fusion results on publicly available datasets to demonstrate the performance of HyDeFuse.

[593] HarmoQ: Harmonized Post-Training Quantization for High-Fidelity Image

Hongjun Wang, Jiyuan Chen, Xuan Song, Yinqiang Zheng

Main category: eess.IV

TL;DR: HarmoQ is a unified quantization framework for super-resolution models that coordinates weight and activation quantization through structural residual calibration, harmonized scale optimization, and adaptive boundary refinement, achieving superior performance under aggressive compression.

Details

Motivation: Existing post-training quantization methods treat weight and activation quantization independently, missing their critical interplay in super-resolution models, where weights encode restoration priors and activations carry intensity information.

Method: Three synergistic steps: structural residual calibration to adjust weights for activation-induced detail loss, harmonized scale optimization via closed-form solutions to balance quantization difficulty, and adaptive boundary refinement to maintain balance during optimization.

Result: Achieves 0.46 dB improvement on Set5 at 2-bit quantization while delivering 3.2x speedup and 4x memory reduction on A100 GPUs, substantially outperforming prior methods.

Conclusion: This work provides the first systematic analysis of weight-activation coupling in super-resolution quantization and establishes a principled solution for efficient high-quality image restoration.

Abstract: Post-training quantization offers an efficient pathway to deploy super-resolution models, yet existing methods treat weight and activation quantization independently, missing their critical interplay. Through controlled experiments on SwinIR, we uncover a striking asymmetry: weight quantization primarily degrades structural similarity, while activation quantization disproportionately affects pixel-level accuracy. This stems from their distinct roles–weights encode learned restoration priors for textures and edges, whereas activations carry input-specific intensity information. Building on this insight, we propose HarmoQ, a unified framework that harmonizes quantization across components through three synergistic steps: structural residual calibration proactively adjusts weights to compensate for activation-induced detail loss, harmonized scale optimization analytically balances quantization difficulty via closed-form solutions, and adaptive boundary refinement iteratively maintains this balance during optimization. Experiments show HarmoQ achieves substantial gains under aggressive compression, outperforming prior art by 0.46 dB on Set5 at 2-bit while delivering 3.2x speedup and 4x memory reduction on A100 GPUs. This work provides the first systematic analysis of weight-activation coupling in super-resolution quantization and establishes a principled solution for efficient high-quality image restoration.

[594] EndoIR: Degradation-Agnostic All-in-One Endoscopic Image Restoration via Noise-Aware Routing Diffusion

Tong Chen, Xinyu Ma, Long Bai, Wenyang Wang, Yue Sun, Luping Zhou

Main category: eess.IV

TL;DR: EndoIR is a diffusion-based framework that restores multiple types of endoscopic image degradations using a single model, achieving state-of-the-art performance with fewer parameters.

Details

Motivation: Endoscopic images often suffer from diverse co-occurring degradations like low lighting, smoke, and bleeding that obscure clinical details. Existing methods are task-specific and require prior knowledge of degradation types, limiting real-world clinical robustness.

Method: Proposes EndoIR with Dual-Domain Prompter for spatial-frequency features, adaptive embedding for shared/task-specific cues, Dual-Stream Diffusion architecture with Rectified Fusion Block, and Noise-Aware Routing Block for efficient feature selection.

Result: Experiments on SegSTRONG-C and CEC datasets show state-of-the-art performance across multiple degradation scenarios with fewer parameters than baselines. Downstream segmentation confirms clinical utility.

Conclusion: EndoIR provides an effective, all-in-one solution for endoscopic image restoration that handles multiple degradation types without requiring prior knowledge, demonstrating strong clinical applicability.

Abstract: Endoscopic images often suffer from diverse and co-occurring degradations such as low lighting, smoke, and bleeding, which obscure critical clinical details. Existing restoration methods are typically task-specific and often require prior knowledge of the degradation type, limiting their robustness in real-world clinical use. We propose EndoIR, an all-in-one, degradation-agnostic diffusion-based framework that restores multiple degradation types using a single model. EndoIR introduces a Dual-Domain Prompter that extracts joint spatial-frequency features, coupled with an adaptive embedding that encodes both shared and task-specific cues as conditioning for denoising. To mitigate feature confusion in conventional concatenation-based conditioning, we design a Dual-Stream Diffusion architecture that processes clean and degraded inputs separately, with a Rectified Fusion Block integrating them in a structured, degradation-aware manner. Furthermore, Noise-Aware Routing Block improves efficiency by dynamically selecting only noise-relevant features during denoising. Experiments on SegSTRONG-C and CEC datasets demonstrate that EndoIR achieves state-of-the-art performance across multiple degradation scenarios while using fewer parameters than strong baselines, and downstream segmentation experiments confirm its clinical utility.

Today’s Research Highlights

Table of Contents

cs.CL

[1] A Preliminary Study of RAG for Taiwanese Historical Archives

[2] Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

[3] GRIP: In-Parameter Graph Reasoning through Fine-Tuning Large Language Models

[4] REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

[5] It Takes Two: A Dual Stage Approach for Terminology-Aware Translation

[6] Motif 2 12.7B technical report

[7] Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models

[8] LLM Optimization Unlocks Real-Time Pairwise Reranking

[9] LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives

[10] LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

[11] Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

[12] Stress Testing Factual Consistency Metrics for Long-Document Summarization

[13] CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences

[14] Critical Confabulation: Can LLMs Hallucinate for Social Good?

[15] Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production

[16] Design, Results and Industry Implications of the World’s First Insurance Large Language Model Evaluation Benchmark

[17] From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

[18] Adaptive Multi-Agent Response Refinement in Conversational Systems

[19] AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys

[20] Planned Event Forecasting using Future Mentions and Related Entity Extraction in News Articles

[21] Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification

[22] Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning

[23] Social Media for Mental Health: Data, Methods, and Findings

[24] Distinct Theta Synchrony across Speech Modes: Perceived, Spoken, Whispered, and Imagined

[25] Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

[26] NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLMs for NOTAM Interpretation

[27] State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

[28] Self-Correction Distillation for Structured Data Question Answering

[29] HyCoRA: Hyper-Contrastive Role-Adaptive Learning for Role-Playing

[30] BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

[31] Estranged Predictions: Measuring Semantic Category Disruption with Masked Language Modelling

[32] Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

[33] Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition

[34] Sentence-Anchored Gist Compression for Long-Context LLMs

[35] On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

[36] Relation as a Prior: A Novel Paradigm for LLM-based Document-level Relation Extraction

[37] Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?

[38] Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?

[39] Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction

[40] Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

[41] VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

[42] Prompt Tuning for Natural Language to SQL with Embedding Fine-Tuning and RAG

[43] ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech

[44] Hierarchical structure understanding in complex tables with VLLMs: a benchmark and experiments

[45] Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM-Simulated Reviewer-Author Debates

[46] AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

[47] DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

[48] The Dynamic Articulatory Model DYNARTmo: Dynamic Movement Generation and Speech Gestures

[49] TurkEmbed: Turkish Embedding Model on NLI & STS Tasks

[50] PCRLLM: Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints

[51] Interaction Dynamics as a Reward Signal for LLMs

[52] Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?

[53] SPEAR-MM: Selective Parameter Evaluation and Restoration via Model Merging for Efficient Financial LLM Adaptation

[54] Structured RAG for Answering Aggregative Questions

[55] Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

[56] AlphaResearch: Accelerating New Algorithm Discovery with Language Models

[57] Investigating CoT Monitorability in Large Reasoning Models

[58] From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL

[59] Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

[60] Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

[61] Training Language Models to Explain Their Own Computations

[62] Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

[63] Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Question Answering

[64] Selection of LLM Fine-Tuning Data based on Orthogonal Rules

[65] VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

[66] From Word Vectors to Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models

[67] Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction

[68] Thus Spake Long-Context Large Language Model

[69] Figurative Archive: an open dataset and web-based application for the study of metaphor

[70] CLEV: LLM-Based Evaluation Through Lightweight Efficient Voting for Free-Form Question-Answering

[71] “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection

[72] ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models

[73] CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment

[74] STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

[75] ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

[76] Evaluating BERTopic on Open-Ended Data: A Case Study with Belgian Dutch Daily Narratives

[77] Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation