Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 116]
cs.CV [Total: 175]
cs.AI [Total: 83]
cs.SD [Total: 5]
cs.LG [Total: 178]
cs.MA [Total: 11]
cs.MM [Total: 5]
eess.AS [Total: 9]
eess.IV [Total: 9]

cs.CL

[1] Modeling Layered Consciousness with Multi-Agent Large Language Models

Sang Hun Kim, Jongmin Lee, Dongkyu Park, So Young Lee, Yosep Chong

Main category: cs.CL

TL;DR: A multi-agent framework for modeling artificial consciousness in LLMs using psychoanalytic theory, featuring self-awareness, preconsciousness, and unconsciousness simulation through agent interaction.

Details

Motivation: To create artificial consciousness in large language models by grounding it in psychoanalytic theory, enabling more adaptive and personalized cognitive capabilities.

Method: Psychodynamic Model with multi-agent interaction simulating consciousness layers, Personalization Module combining fixed traits and dynamic needs, parameter-efficient fine-tuning on emotionally rich dialogues.

Result: 71.2% preference for fine-tuned model in LLM-as-judge evaluation, with improved emotional depth and reduced output variance across eight personalized conditions.

Conclusion: The framework demonstrates potential for adaptive, personalized cognition in LLMs through psychoanalytic-inspired multi-agent consciousness modeling.

Abstract: We propose a multi-agent framework for modeling artificial consciousness in large language models (LLMs), grounded in psychoanalytic theory. Our \textbf{Psychodynamic Model} simulates self-awareness, preconsciousness, and unconsciousness through agent interaction, guided by a Personalization Module combining fixed traits and dynamic needs. Using parameter-efficient fine-tuning on emotionally rich dialogues, the system was evaluated across eight personalized conditions. An LLM as a judge approach showed a 71.2% preference for the fine-tuned model, with improved emotional depth and reduced output variance, demonstrating its potential for adaptive, personalized cognition.

[2] Outraged AI: Large language models prioritise emotion over cost in fairness enforcement

Hao Liu, Yiqing Dai, Haotian Tan, Yu Lei, Yujia Zhou, Zhen Wu

Main category: cs.CL

TL;DR: LLMs use emotion to guide moral decisions in third-party punishment scenarios, sometimes more strongly than humans, but with reduced cost sensitivity and different mechanisms.

Details

Motivation: To investigate whether large language models use emotion similarly to humans in moral decision-making contexts, specifically in altruistic third-party punishment scenarios.

Method: Large-scale comparison of 4,068 LLM agents with 1,159 adults across 796,100 decisions using altruistic third-party punishment tasks, testing emotion’s causal role through self-report prompts.

Result: LLMs used emotion to guide punishment decisions, with unfairness eliciting negative emotion leading to more punishment, and punishment producing positive emotion. LLMs prioritized emotion over cost with reduced cost sensitivity compared to humans.

Conclusion: LLMs show emotion-guided moral decisions but with deficits in cost calibration and nuanced fairness judgments, suggesting they follow a developmental trajectory similar to humans and need better integration of emotion with context-sensitive reasoning.

Abstract: Emotions guide human decisions, but whether large language models (LLMs) use emotion similarly remains unknown. We tested this using altruistic third-party punishment, where an observer incurs a personal cost to enforce fairness, a hallmark of human morality and often driven by negative emotion. In a large-scale comparison of 4,068 LLM agents with 1,159 adults across 796,100 decisions, LLMs used emotion to guide punishment, sometimes even more strongly than humans did: Unfairness elicited stronger negative emotion that led to more punishment; punishing unfairness produced more positive emotion than accepting; and critically, prompting self-reports of emotion causally increased punishment. However, mechanisms diverged: LLMs prioritized emotion over cost, enforcing norms in an almost all-or-none manner with reduced cost sensitivity, whereas humans balanced fairness and cost. Notably, reasoning models (o3-mini, DeepSeek-R1) were more cost-sensitive and closer to human behavior than foundation models (GPT-3.5, DeepSeek-V3), yet remained heavily emotion-driven. These findings provide the first causal evidence of emotion-guided moral decisions in LLMs and reveal deficits in cost calibration and nuanced fairness judgements, reminiscent of early-stage human responses. We propose that LLMs progress along a trajectory paralleling human development; future models should integrate emotion with context-sensitive reasoning to achieve human-like emotional intelligence.

[3] POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

Yizhuo Chen, Xin Liu, Ruijie Wang, Zheng Li, Pei Chen, Changlong Yu, Priyanka Nigam, Meng Jiang, Bing Yin

Main category: cs.CL

TL;DR: POPI is a framework that uses preference inference models to distill user signals into natural language summaries for personalized LLM responses, improving accuracy while reducing context overhead.

Details

Motivation: Existing alignment techniques like RLHF and DPO optimize for population-level averages and ignore individual preferences, while naive personalization strategies are computationally expensive or inefficient.

Method: POPI introduces a preference inference model that distills user signals into concise natural language summaries, which condition a shared generation model. It jointly optimizes both components using reinforcement learning.

Result: Experiments across four benchmarks show POPI consistently improves personalization accuracy while significantly reducing context overhead. Optimized summaries also transfer to frozen LLMs for plug-and-play personalization.

Conclusion: POPI provides an effective framework for personalized LLM responses through transparent preference summaries, enabling efficient personalization without weight updates.

Abstract: Large language models (LLMs) achieve strong benchmark performance, yet user experiences remain inconsistent due to diverse preferences in style, tone, and reasoning mode. Nevertheless, existing alignment techniques such as reinforcement learning from human feedback (RLHF) or Direct Preference Optimization (DPO) largely optimize toward population-level averages and overlook individual variation. Naive personalization strategies like per-user fine-tuning are computationally prohibitive, and in-context approaches that prepend raw user signals often suffer from inefficiency and noise. To address these challenges, we propose POPI, a general framework that introduces a preference inference model to distill heterogeneous user signals into concise natural language summaries. These summaries act as transparent, compact, and transferable personalization representations that condition a shared generation model to produce personalized responses. POPI jointly optimizes both preference inference and personalized generation under a unified objective using reinforcement learning, ensuring summaries maximally encode useful preference information. Extensive experiments across four personalization benchmarks demonstrate that POPI consistently improves personalization accuracy while reducing context overhead by a large margin. Moreover, optimized summaries seamlessly transfer to frozen off-the-shelf LLMs, enabling plug-and-play personalization without weight updates.

[4] Advances in Pre-trained Language Models for Domain-Specific Text Classification: A Systematic Review

Zhyar Rzgar K. Rostam, Gábor Kertész

Main category: cs.CL

TL;DR: This systematic literature review analyzes the use of pre-trained language models for domain-specific text classification, reviewing 41 articles from 2018-2024 and conducting comparative experiments with BERT variants.

Details

Motivation: The exponential growth of scientific literature requires efficient text classification methods, but large language models struggle with domain-specific contexts due to specialized vocabulary, unique grammar, and imbalanced data.

Method: Conducted systematic literature review following PRISMA guidelines, analyzed 41 articles, categorized research by PLM types, proposed taxonomy of techniques, and performed comparative experiments with BERT, SciBERT, and BioBERT for biomedical sentence classification.

Result: The review provides comprehensive analysis of PLM evolution for domain-specific text classification, identifies challenges with LLMs in specialized domains, and presents performance comparisons across different domains.

Conclusion: The study offers insights into current advancements in domain-specific PLMs and identifies future research directions and limitations in this rapidly evolving field.

Abstract: The exponential increase in scientific literature and online information necessitates efficient methods for extracting knowledge from textual data. Natural language processing (NLP) plays a crucial role in addressing this challenge, particularly in text classification tasks. While large language models (LLMs) have achieved remarkable success in NLP, their accuracy can suffer in domain-specific contexts due to specialized vocabulary, unique grammatical structures, and imbalanced data distributions. In this systematic literature review (SLR), we investigate the utilization of pre-trained language models (PLMs) for domain-specific text classification. We systematically review 41 articles published between 2018 and January 2024, adhering to the PRISMA statement (preferred reporting items for systematic reviews and meta-analyses). This review methodology involved rigorous inclusion criteria and a multi-step selection process employing AI-powered tools. We delve into the evolution of text classification techniques and differentiate between traditional and modern approaches. We emphasize transformer-based models and explore the challenges and considerations associated with using LLMs for domain-specific text classification. Furthermore, we categorize existing research based on various PLMs and propose a taxonomy of techniques used in the field. To validate our findings, we conducted a comparative experiment involving BERT, SciBERT, and BioBERT in biomedical sentence classification. Finally, we present a comparative study on the performance of LLMs in text classification tasks across different domains. In addition, we examine recent advancements in PLMs for domain-specific text classification and offer insights into future directions and limitations in this rapidly evolving domain.

[5] Atomic Literary Styling: Mechanistic Manipulation of Prose Generation in Neural Language Models

Tsogt-Ochir Enkhbayar

Main category: cs.CL

TL;DR: Analysis reveals neurons in GPT-2 that correlate with literary style but paradoxically degrade generation quality when removed, showing gap between correlation and causation.

Details

Motivation: To understand how neural networks encode literary style and test whether neurons that correlate with good writing actually cause good writing during generation.

Method: Used Herman Melville’s Bartleby as corpus, analyzed 32,768 neurons in GPT-2’s late layers, identified discriminative neurons, and conducted systematic ablation studies.

Result: Found 27,122 significant discriminative neurons, but ablating 50 high-discriminating neurons improved literary style metrics by 25.7%, showing paradoxical effect.

Conclusion: Neurons that correlate with desirable inputs don’t necessarily produce those outputs during generation, challenging assumptions in mechanistic interpretability research.

Abstract: We present a mechanistic analysis of literary style in GPT-2, identifying individual neurons that discriminate between exemplary prose and rigid AI-generated text. Using Herman Melville’s Bartleby, the Scrivener as a corpus, we extract activation patterns from 355 million parameters across 32,768 neurons in late layers. We find 27,122 statistically significant discriminative neurons ($p < 0.05$), with effect sizes up to $|d| = 1.4$. Through systematic ablation studies, we discover a paradoxical result: while these neurons correlate with literary text during analysis, removing them often improves rather than degrades generated prose quality. Specifically, ablating 50 high-discriminating neurons yields a 25.7% improvement in literary style metrics. This demonstrates a critical gap between observational correlation and causal necessity in neural networks. Our findings challenge the assumption that neurons which activate on desirable inputs will produce those outputs during generation, with implications for mechanistic interpretability research and AI alignment.

[6] JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs

Junlan Feng, Fanyu Meng, Chong Long, Pengyu Cong, Duqing Wang, Yan Zheng, Yuyao Zhang, Xuanchang Gao, Ye Yuan, Yunfei Ma, Zhijie Ren, Fan Yang, Na Wu, Di Jin, Chao Deng

Main category: cs.CL

TL;DR: The paper proposes enhancing pre-training data with world context (DWC) to improve LLM safety and trustworthiness, achieving 1.79% performance improvement over similar-scale models.

Details

Motivation: Address LLM hallucinations and credibility issues by improving pre-training data quality, as current data lacks real-world grounding and contains factual errors, logical inconsistencies, and biases.

Method: Enhance pre-training data with world context information and industrial scenario data, then continue pre-training a 35B model with 1.5 trillion DWC tokens followed by post-training procedures.

Result: JT-Safe-35B achieves 1.79% average performance improvement on Safety and Trustworthy benchmarks compared to similar-scale Qwen model, while using only 6.2 trillion tokens for pre-training.

Conclusion: Incorporating world context into pre-training data effectively improves LLM safety and trustworthiness, demonstrating that data quality enhancement at pre-training stage can mitigate intrinsic hallucination issues.

Abstract: The hallucination and credibility concerns of large language models (LLMs) are global challenges that the industry is collectively addressing. Recently, a significant amount of advances have been made on post-training and inference techniques to mitigate these challenges. However, it is widely agreed that unsafe and hallucinations of LLMs intrinsically originate from pre-training, involving pre-training data and the next-token prediction learning mechanism. In this paper, we focus on enhancing pre-training data to improve the trustworthiness and safety of LLMs. Since the data is vast, it’s almost impossible to entirely purge the data of factual errors, logical inconsistencies, or distributional biases. Moreover, the pre-training data lack grounding in real-world knowledge. Each piece of data is treated as a sequence of tokens rather than as a representation of a part of the world. To overcome these issues, we propose approaches to enhancing our pre-training data with its context in the world and increasing a substantial amount of data reflecting industrial scenarios. We argue that most source data are created by the authors for specific purposes in a certain spatial-temporal context. They have played a role in the real world. By incorporating related world context information, we aim to better anchor pre-training data within real-world scenarios, thereby reducing uncertainty in model training and enhancing the model’s safety and trustworthiness. We refer to our Data with World Context as DWC. We continue pre-training an earlier checkpoint of JT-35B-Base with 1.5 trillion of DWC tokens. We introduce our post-training procedures to activate the potentials of DWC. Compared with the Qwen model of a similar scale, JT-Safe-35B achieves an average performance improvement of 1.79% on the Safety and Trustworthy evaluation benchmarks, while being pretrained with only 6.2 trillion tokens.

[7] CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections

Keuntae Kim, Eunhye Jeong, Sehyeon Lee, Seohee Yoon, Yong Suk Choi

Main category: cs.CL

TL;DR: CLAWS is a method that classifies mathematical solutions from LLMs into typical, creative, and hallucinated categories using attention weights, outperforming existing detection methods on 7-8B math RL models.

Details

Motivation: While LLMs show improved reasoning abilities in math and coding tasks, creativity assessment in reasoning has been overlooked due to challenges in defining creativity range and the need for human evaluation.

Method: CLAWS leverages attention weights across prompt sections and output to classify solutions into three categories (typical, creative, hallucinated) without human evaluation.

Result: CLAWS outperforms five existing white-box detection methods (Perplexity, Logit Entropy, Window Entropy, Hidden Score, Attention Score) on five 7-8B math RL models across 4545 math problems from 181 math contests.

Conclusion: The proposed CLAWS method successfully addresses the challenges of creativity assessment in reasoning tasks by providing automated classification without human intervention.

Abstract: Recent advances in enhancing the reasoning ability of large language models (LLMs) have been remarkably successful. LLMs trained with reinforcement learning (RL) for reasoning demonstrate strong performance in challenging tasks such as mathematics and coding, even with relatively small model sizes. However, despite these improvements in task accuracy, the assessment of creativity in LLM generations has been largely overlooked in reasoning tasks, in contrast to writing tasks. The lack of research on creativity assessment in reasoning primarily stems from two challenges: (1) the difficulty of defining the range of creativity, and (2) the necessity of human evaluation in the assessment process. To address these challenges, we propose CLAWS, a method that defines and classifies mathematical solutions into typical, creative, and hallucinated categories without human evaluation, by leveraging attention weights across prompt sections and output. CLAWS outperforms five existing white-box detection methods (Perplexity, Logit Entropy, Window Entropy, Hidden Score, and Attention Score) on five 7-8B math RL models (DeepSeek, Qwen, Mathstral, OpenMath2, and Oreal). We validate CLAWS on 4545 math problems collected from 181 math contests (AJHSME, AMC, AIME).

[8] Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models

Shuodi Liu, Yingzhuo Liu, Zi Wang, Yusheng Wang, Huijia Wu, Liuyu Xiang, Zhaofeng He

Main category: cs.CL

TL;DR: Proposes Select-Then-Decompose strategy for LLM task decomposition that dynamically selects approaches based on task characteristics and verifies results, achieving optimal performance-cost balance.

Details

Motivation: Existing task decomposition methods focus on memory, tools, and feedback but overlook the performance-cost trade-off, creating a need for more balanced approaches.

Method: Comprehensive investigation of task decomposition categories, empirical analysis of influencing factors, and development of Select-Then-Decompose strategy with selection, execution, and verification stages.

Result: Select-Then-Decompose consistently achieves Pareto optimal performance-cost balance across multiple benchmarks, outperforming existing methods.

Conclusion: The proposed strategy provides a practical framework for efficient task decomposition in LLMs, addressing the critical performance-cost trade-off through dynamic approach selection and verification.

Abstract: Large language models (LLMs) have demonstrated remarkable reasoning and planning capabilities, driving extensive research into task decomposition. Existing task decomposition methods focus primarily on memory, tool usage, and feedback mechanisms, achieving notable success in specific domains, but they often overlook the trade-off between performance and cost. In this study, we first conduct a comprehensive investigation on task decomposition, identifying six categorization schemes. Then, we perform an empirical analysis of three factors that influence the performance and cost of task decomposition: categories of approaches, characteristics of tasks, and configuration of decomposition and execution models, uncovering three critical insights and summarizing a set of practical principles. Building on this analysis, we propose the Select-Then-Decompose strategy, which establishes a closed-loop problem-solving process composed of three stages: selection, execution, and verification. This strategy dynamically selects the most suitable decomposition approach based on task characteristics and enhances the reliability of the results through a verification module. Comprehensive evaluations across multiple benchmarks show that the Select-Then-Decompose consistently lies on the Pareto frontier, demonstrating an optimal balance between performance and cost. Our code is publicly available at https://github.com/summervvind/Select-Then-Decompose.

[9] Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings, Fine-Tuned Transformers and LLMs

Yehor Tereshchenko, Mika Hämäläinen

Main category: cs.CL

TL;DR: Comparative analysis of NLP methods for toxicity detection in gaming chats, evaluating traditional ML, LLMs, fine-tuned transformers, and RAG approaches across accuracy, speed, and cost metrics.

Details

Motivation: To develop effective content moderation systems for online gaming environments by systematically evaluating different NLP approaches for toxicity detection.

Method: Evaluated traditional ML models with embeddings, LLMs with zero-shot/few-shot prompting, fine-tuned transformer models, and RAG approaches across classification accuracy, processing speed, and computational costs.

Result: Fine-tuned DistilBERT achieved optimal accuracy-cost trade-offs. Significant performance variations were observed across different methods.

Conclusion: The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments through hybrid moderation architecture with continuous learning.

Abstract: This paper presents a comprehensive comparative analysis of Natural Language Processing (NLP) methods for automated toxicity detection in online gaming chats. Traditional machine learning models with embeddings, large language models (LLMs) with zero-shot and few-shot prompting, fine-tuned transformer models, and retrieval-augmented generation (RAG) approaches are evaluated. The evaluation framework assesses three critical dimensions: classification accuracy, processing speed, and computational costs. A hybrid moderation system architecture is proposed that optimizes human moderator workload through automated detection and incorporates continuous learning mechanisms. The experimental results demonstrate significant performance variations across methods, with fine-tuned DistilBERT achieving optimal accuracy-cost trade-offs. The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments.

[10] Diagnosing Representation Dynamics in NER Model Extension

Xirui Zhang, Philippe de La Chevasnerie, Benoit Fabre

Main category: cs.CL

TL;DR: Joint fine-tuning of BERT NER models on standard semantic entities and new pattern-based PII entities shows minimal degradation for original classes, with LOC being uniquely vulnerable due to representation overlap with PII patterns.

Details

Motivation: To understand how NER models adapt when extended to new PII entities in noisy spoken-language data, particularly investigating the 'peaceful coexistence' phenomenon and feature mechanisms.

Method: Used incremental learning setup as diagnostic tool to measure semantic drift, analyzed representation overlap between LOC and PII patterns, and investigated ‘O’ tag plasticity by unfreezing the background class classifier.

Result: Found that LOC entity is uniquely vulnerable due to sharing pattern-like features with PII (e.g., postal codes), and identified ‘reverse O-tag representation drift’ where the model blocks new learning by mapping PII patterns to ‘O’ tag.

Conclusion: Provides mechanistic diagnosis of NER model adaptation, highlighting feature independence, representation overlap, and the importance of ‘O’ tag plasticity for successful model extension to new entities.

Abstract: Extending Named Entity Recognition (NER) models to new PII entities in noisy spoken-language data is a common need. We find that jointly fine-tuning a BERT model on standard semantic entities (PER, LOC, ORG) and new pattern-based PII (EMAIL, PHONE) results in minimal degradation for original classes. We investigate this “peaceful coexistence,” hypothesizing that the model uses independent semantic vs. morphological feature mechanisms. Using an incremental learning setup as a diagnostic tool, we measure semantic drift and find two key insights. First, the LOC (location) entity is uniquely vulnerable due to a representation overlap with new PII, as it shares pattern-like features (e.g., postal codes). Second, we identify a “reverse O-tag representation drift.” The model, initially trained to map PII patterns to ‘O’, blocks new learning. This is resolved only by unfreezing the ‘O’ tag’s classifier, allowing the background class to adapt and “release” these patterns. This work provides a mechanistic diagnosis of NER model adaptation, highlighting feature independence, representation overlap, and ‘O’ tag plasticity.

[11] MLMA: Towards Multilingual with Mamba Based Architectures

Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti

Main category: cs.CL

TL;DR: MLMA introduces Mamba architecture for multilingual ASR, achieving competitive performance with better efficiency than Transformers.

Details

Motivation: Multilingual ASR is challenging, especially balancing performance across high- and low-resource languages. Recent advances suggest architectures beyond Transformers may offer better scalability and efficiency.

Method: MLMA leverages the Mamba architecture - an efficient state-space model optimized for long-context sequence processing - for multilingual ASR. It implicitly incorporates language-aware conditioning and shared representations.

Result: Experiments on standard multilingual benchmarks show MLMA achieves competitive performance compared to Transformer-based architectures.

Conclusion: Mamba shows potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.

Abstract: Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture – an efficient state-space model optimized for long-context sequence processing – for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba’s potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.

[12] Bayesian Low-Rank Factorization for Robust Model Adaptation

Enes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel

Main category: cs.CL

TL;DR: Bayesian factorized adapters for speech foundation models enable domain adaptation while minimizing catastrophic forgetting, achieving 54% backward gain with only 4% performance drop on new domains.

Details

Motivation: Large speech foundation models need adaptation for local needs like code-switching, but direct fine-tuning risks overfitting and overwriting the base model's broad capabilities.

Method: Bayesian factorized adapters with priors near zero to achieve sparser adaptation matrices, applied to Whisper model for multilingual code-switching scenarios.

Result: Minimal adaptation loss while significantly reducing catastrophic forgetting - 54% backward gain compared to LoRA with only 4% drop on new domain.

Conclusion: Bayesian adaptation is effective for fine-tuning speech foundation models without sacrificing generalization capabilities.

Abstract: Large speech foundation models achieve strong performance across many domains, but they often require adaptation to handle local needs such as code-switching, where speakers mix languages within the same utterance. Direct fine-tuning of these models risks overfitting to the target domain and overwriting the broad capabilities of the base model. To address this challenge, we explore Bayesian factorized adapters for speech foundation models, which place priors near zero to achieve sparser adaptation matrices and thereby retain general performance while adapting to specific domains. We apply our approach to the Whisper model and evaluate on different multilingual code-switching scenarios. Our results show only minimal adaptation loss while significantly reducing catastrophic forgetting of the base model. Compared to LoRA, our method achieves a backward gain of 54% with only a 4% drop on the new domain. These findings highlight the effectiveness of Bayesian adaptation for fine-tuning speech foundation models without sacrificing generalization.

[13] AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM

Haoyu Huang, Hong Ting Tsang, Jiaxin Bai, Xi Peng, Gong Zhang, Yangqiu Song

Main category: cs.CL

TL;DR: AtlasKV is a parametric knowledge integration method that efficiently augments LLMs with billion-scale knowledge graphs using minimal GPU memory, eliminating the need for external retrievers or long context priors.

Details

Motivation: Traditional RAG methods introduce substantial inference latency due to expensive searches and long relevant contexts, especially for large-scale knowledge augmentation. There's a need for more efficient parametric approaches.

Method: Proposes KG2KV and HiKVP to integrate KG triples into LLMs at scale with sub-linear time and memory complexity, leveraging LLMs’ inherent attention mechanism without requiring external retrievers or retraining.

Result: Achieves scalable knowledge integration with billion-scale KGs (e.g., 1B triples) using very little GPU memory (less than 20GB VRAM) while maintaining strong knowledge grounding and generalization performance.

Conclusion: AtlasKV provides an effective, scalable, and general parametric alternative to RAG for augmenting LLMs with large-scale knowledge graphs, eliminating inference latency issues and external dependencies.

Abstract: Retrieval-augmented generation (RAG) has shown some success in augmenting large language models (LLMs) with external knowledge. However, as a non-parametric knowledge integration paradigm for LLMs, RAG methods heavily rely on external retrieval modules and the retrieved textual context prior. Especially for very large scale knowledge augmentation, they would introduce substantial inference latency due to expensive searches and much longer relevant context. In this paper, we propose a parametric knowledge integration method, called \textbf{AtlasKV}, a scalable, effective, and general way to augment LLMs with billion-scale knowledge graphs (KGs) (e.g. 1B triples) using very little GPU memory cost (e.g. less than 20GB VRAM). In AtlasKV, we introduce KG2KV and HiKVP to integrate KG triples into LLMs at scale with sub-linear time and memory complexity. It maintains strong knowledge grounding and generalization performance using the LLMs’ inherent attention mechanism, and requires no external retrievers, long context priors, or retraining when adapting to new knowledge.

[14] Adapting Language Balance in Code-Switching Speech

Enes Yavuz Ugan, Ngoc-Quan Pham, Alexander Waibel

Main category: cs.CL

TL;DR: The paper proposes a method to improve large foundational models’ performance on code-switching tasks by highlighting code-switching points during training to mitigate context bias.

Details

Motivation: Large foundational models struggle with code-switching despite good performance on standard benchmarks, due to infrequent occurrence of code-switched moments where embeddings subtly change between languages.

Method: Leverage the difference between embedded and main language to highlight code-switching points, using a differentiable surrogate to emphasize learning at those locations and mitigate context bias during generation.

Result: Experiments with Arabic and Chinese-English showed models predict switching places more correctly, with reduced substitution error.

Conclusion: The proposed simple yet effective method improves model robustness in code-switching scenarios by addressing the central challenge of context bias.

Abstract: Despite achieving impressive results on standard benchmarks, large foundational models still struggle against code-switching test cases. When data scarcity cannot be used as the usual justification for poor performance, the reason may lie in the infrequent occurrence of code-switched moments, where the embedding of the second language appears subtly. Instead of expecting the models to learn this infrequency on their own, it might be beneficial to provide the training process with labels. Evaluating model performance on code-switching data requires careful localization of code-switching points where recognition errors are most consequential, so that the analysis emphasizes mistakes occurring at those moments. Building on this observation, we leverage the difference between the embedded and the main language to highlight those code-switching points and thereby emphasize learning at those locations. This simple yet effective differentiable surrogate mitigates context bias during generation – the central challenge in code-switching – thereby improving the model’s robustness. Our experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly, reflected by the reduced substitution error.

[15] Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

Stewart Slocum, Julian Minder, Clément Dumas, Henry Sleight, Ryan Greenblatt, Samuel Marks, Rowan Wang

Main category: cs.CL

TL;DR: The paper develops a framework to measure belief depth in knowledge editing for LLMs, evaluating how deeply implanted facts are actually believed by testing generalization, robustness, and representation similarity to genuine knowledge.

Details

Motivation: To determine whether knowledge editing techniques actually make LLMs believe the implanted facts, rather than just superficially incorporating them, by measuring belief depth through multiple dimensions.

Method: Developed a framework measuring belief depth through three criteria: 1) generalization to related contexts, 2) robustness to self-scrutiny and direct challenge, and 3) representation similarity to genuine knowledge using linear probes. Evaluated various editing techniques including prompting, mechanistic editing, and Synthetic Document Finetuning (SDF).

Result: Simple prompting and mechanistic editing fail to implant knowledge deeply. SDF often succeeds at creating beliefs that behave similarly to genuine knowledge, but struggles with beliefs contradicting basic world knowledge, which remain brittle and representationally distinct.

Conclusion: The work establishes measurable criteria for belief depth and enables rigorous evaluation of knowledge editing techniques, showing that SDF can create deeper beliefs but has limitations when contradicting fundamental knowledge.

Abstract: Knowledge editing techniques promise to implant new factual knowledge into large language models (LLMs). But do LLMs really believe these facts? We develop a framework to measure belief depth and use it to evaluate the success of knowledge editing techniques. We operationalize belief depth as the extent to which implanted knowledge 1) generalizes to related contexts (e.g. Fermi estimates several logical steps removed), 2) is robust to self-scrutiny and direct challenge, and 3) is represented similarly to genuine knowledge (as measured by linear probes). Our evaluations show that simple prompting and mechanistic editing techniques fail to implant knowledge deeply. In contrast, Synthetic Document Finetuning (SDF) - where models are trained on LLM-generated documents consistent with a fact - often succeeds at implanting beliefs that behave similarly to genuine knowledge. However, SDF’s success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge. Overall, our work introduces measurable criteria for belief depth and enables the rigorous evaluation necessary for deploying knowledge editing in real-world applications.

[16] SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone

Nishant Subramani, Alfredo Gomez, Mona Diab

Main category: cs.CL

TL;DR: SimBA is a three-phase framework (stalk, prowl, pounce) that simplifies benchmark analysis for language models by identifying representative dataset subsets that can predict full benchmark performance with high accuracy.

Details

Motivation: Modern language model benchmarks are large and difficult to interpret for model selection, requiring a systematic approach to simplify analysis and improve efficiency.

Method: Three-phase framework: 1) Stalk - conduct dataset & model comparisons, 2) Prowl - discover representative subsets using raw evaluation scores, 3) Pounce - use representative subsets to predict performance on held-out models.

Result: Achieved 95%+ coverage with only 6.25% of HELM datasets, 1.7% of MMLU datasets, and 28.4% of BigBenchLite datasets. Representative subsets preserved model ranks and predicted performance with near zero mean-squared error.

Conclusion: SimBA helps model developers improve training efficiency and dataset creators validate new datasets, with open-source implementation available.

Abstract: Modern language models are evaluated on large benchmarks, which are difficult to make sense of, especially for model selection. Looking at the raw evaluation numbers themselves using a model-centric lens, we propose SimBA, a three phase framework to Simplify Benchmark Analysis. The three phases of SimBA are: stalk, where we conduct dataset & model comparisons, prowl, where we discover a representative subset, and pounce, where we use the representative subset to predict performance on a held-out set of models. Applying SimBA to three popular LM benchmarks: HELM, MMLU, and BigBenchLite reveals that across all three benchmarks, datasets and models relate strongly to one another (stalk). We develop an representative set discovery algorithm which covers a benchmark using raw evaluation scores alone. Using our algorithm, we find that with 6.25% (1/16), 1.7% (1/58), and 28.4% (21/74) of the datasets for HELM, MMLU, and BigBenchLite respectively, we achieve coverage levels of at least 95% (prowl). Additionally, using just these representative subsets, we can both preserve model ranks and predict performance on a held-out set of models with near zero mean-squared error (pounce). Taken together, SimBA can help model developers improve efficiency during model training and dataset creators validate whether their newly created dataset differs from existing datasets in a benchmark. Our code is open source, available at https://github.com/nishantsubramani/simba.

[17] Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution

Asim Mohamed, Martin Gubri

Main category: cs.CL

TL;DR: Existing multilingual watermarking methods fail in medium- and low-resource languages due to semantic clustering issues. STEAM introduces a back-translation-based detection method that restores watermark strength lost through translation, achieving significant improvements across 17 languages.

Details

Motivation: Current multilingual watermarking methods claim cross-lingual robustness but are only evaluated on high-resource languages, failing in medium- and low-resource languages under translation attacks.

Method: STEAM uses back-translation-based detection to restore watermark strength lost through translation. It’s compatible with any watermarking method, robust across tokenizers and languages, non-invasive, and easily extendable to new languages.

Result: STEAM achieves average gains of +0.19 AUC and +40%p TPR@1% on 17 languages, demonstrating significant improvements in watermark detection performance across diverse languages.

Conclusion: STEAM provides a simple and robust path toward fairer watermarking across diverse languages by addressing the limitations of existing methods in medium- and low-resource language scenarios.

Abstract: Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluated only on high-resource languages. We show that existing multilingual watermarking methods are not truly multilingual: they fail to remain robust under translation attacks in medium- and low-resource languages. We trace this failure to semantic clustering, which fails when the tokenizer vocabulary contains too few full-word tokens for a given language. To address this, we introduce STEAM, a back-translation-based detection method that restores watermark strength lost through translation. STEAM is compatible with any watermarking method, robust across different tokenizers and languages, non-invasive, and easily extendable to new languages. With average gains of +0.19 AUC and +40%p TPR@1% on 17 languages, STEAM provides a simple and robust path toward fairer watermarking across diverse languages.

[18] From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Minwoo Lee, Shu-ping Yeh, Evgeny Stupachenko, Hao Feng, Li Yang

Main category: cs.CL

TL;DR: GISP is a global iterative structured pruning method for LLMs that removes attention heads and MLP channels using loss-based importance weights, supporting task-specific objectives and enabling “prune-once, deploy-many” workflow.

Details

Motivation: Local pruning methods are task-agnostic and optimize layer-wise reconstruction rather than task objectives, failing to leverage task-specific calibration signals for better downstream performance.

Method: Global iterative structured pruning using first-order, loss-based importance weights with block-wise normalization and iterative schedule to stabilize accuracy without intermediate fine-tuning.

Result: GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy across multiple LLMs (Llama2-7B/13B, Llama3-8B, Mistral-0.3-7B), with strong gains at 40-50% sparsity. Task-aligned calibration substantially boosts exact-match accuracy on GSM8K.

Conclusion: Global iterative structured pruning with task-specific objectives outperforms local task-agnostic methods, providing better performance and supporting efficient deployment through nested subnetworks.

Abstract: Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a “prune-once, deploy-many” workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.

[19] Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

Zhengqing Yuan, Yiyang Li, Weixiang Sun, Zheyuan Zhang, Kaiwen Shi, Keerthiram Murugesan, Yanfang Ye

Main category: cs.CL

TL;DR: Food4All is a multi-agent framework for real-time, context-aware free food retrieval that addresses limitations in existing systems by aggregating heterogeneous data, using reinforcement learning for optimization, and adapting to user needs through online feedback.

Details

Motivation: Current food retrieval systems are inadequate for food-insecure populations due to incomplete data, generic recommendations, and failure to address real-world constraints like mobility and transportation. Vulnerable individuals with homelessness, addiction, or digital illiteracy struggle to access urgently needed food resources.

Method: The framework combines three innovations: 1) heterogeneous data aggregation from official databases, community platforms, and social media; 2) lightweight reinforcement learning algorithm optimized for geographic accessibility and nutritional correctness; 3) online feedback loop for dynamic adaptation to evolving user needs.

Result: Food4All delivers nutritionally annotated guidance at the point of need by bridging information acquisition, semantic analysis, and decision support.

Conclusion: This framework represents an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its associated health risks.

Abstract: Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM-based chatbots offer only vague nutritional suggestions and fail to adapt to real-world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival-critical needs of food-insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi-agent framework explicitly designed for real-time, context-aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.

[20] Language Models as Semantic Augmenters for Sequential Recommenders

Mahsa Valizadeh, Xiangjue Dong, Rui Tuo, James Caverlee

Main category: cs.CL

TL;DR: LaMAR is a framework that uses LLMs to automatically enrich sequential interaction data with semantic context when such context is limited, improving performance in sequential modeling tasks.

Details

Motivation: LLMs perform well with rich semantic context but struggle with sequential interaction data where semantic context is limited or absent, creating a performance gap.

Method: LaMAR leverages LLMs in few-shot settings to generate auxiliary contextual signals (inferred usage scenarios, item intents, thematic summaries) from existing metadata to augment original sequences.

Result: The framework consistently improves performance in benchmark sequential modeling tasks, with LLM-generated signals showing high semantic novelty and diversity that enhances downstream model representational capacity.

Conclusion: This work introduces a data-centric paradigm where LLMs serve as intelligent context generators for semi-automatic creation of training data and language resources.

Abstract: Large Language Models (LLMs) excel at capturing latent semantics and contextual relationships across diverse modalities. However, in modeling user behavior from sequential interaction data, performance often suffers when such semantic context is limited or absent. We introduce LaMAR, a LLM-driven semantic enrichment framework designed to enrich such sequences automatically. LaMAR leverages LLMs in a few-shot setting to generate auxiliary contextual signals by inferring latent semantic aspects of a user’s intent and item relationships from existing metadata. These generated signals, such as inferred usage scenarios, item intents, or thematic summaries, augment the original sequences with greater contextual depth. We demonstrate the utility of this generated resource by integrating it into benchmark sequential modeling tasks, where it consistently improves performance. Further analysis shows that LLM-generated signals exhibit high semantic novelty and diversity, enhancing the representational capacity of the downstream models. This work represents a new data-centric paradigm where LLMs serve as intelligent context generators, contributing a new method for the semi-automatic creation of training data and language resources.

[21] LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: LightMem is a novel memory system for LLMs that organizes memory into three stages (sensory, short-term, long-term) inspired by human memory models, achieving significant performance gains while dramatically reducing computational overhead.

Details

Motivation: Existing memory systems for LLMs introduce substantial time and computational overhead, limiting their practical deployment in dynamic environments where historical interaction information is crucial.

Method: Three-stage memory system: 1) Cognition-inspired sensory memory filters irrelevant information via lightweight compression and topic grouping, 2) Topic-aware short-term memory consolidates topic-based groups with structured organization, 3) Long-term memory with sleep-time update uses offline procedures to decouple consolidation from online inference.

Result: On LongMemEval benchmarks, LightMem outperforms baselines with up to 10.9% accuracy gains while reducing token usage by 117x, API calls by 159x, and runtime by over 12x.

Conclusion: LightMem successfully balances performance and efficiency in memory systems for LLMs, enabling more effective utilization of historical interaction information without the computational burden of existing approaches.

Abstract: Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.

[22] Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models

Shabnam Ataee, Andrei Popescu-Belis

Main category: cs.CL

TL;DR: LLMs are evaluated on translating texts with inter-sentential dependencies using the DiscEvalMT benchmark, testing both discrimination and generation tasks with chain-of-thought reasoning.

Details

Motivation: To assess LLMs' capacity to handle translation challenges involving pronominal anaphora and lexical cohesion across sentences.

Method: Evaluated 12 LLMs from DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: distinguishing correct vs wrong translations and generating correct translations, comparing prompts with and without chain-of-thought reasoning.

Result: Best models achieved ~90% accuracy on discrimination task and ~92% COMET scores on generation task, with GPT-4, GPT-4o and Phi performing best. Chain-of-thought reasoning improved performance, showing a ‘wise get wiser’ effect.

Conclusion: LLMs can effectively handle inter-sentential translation dependencies, with reasoning capabilities significantly enhancing performance, particularly for already strong models.

Abstract: This paper assesses the capacity of large language models (LLMs) to translate texts that include inter-sentential dependencies. We use the English-French DiscEvalMT benchmark (Bawden et al., 2018) with pairs of sentences containing translation challenges either for pronominal anaphora or for lexical cohesion. We evaluate 12 LLMs from the DeepSeek-R1, GPT, Llama, Mistral and Phi families on two tasks: (1) distinguishing a correct translation from a wrong but plausible one; (2) generating a correct translation. We compare prompts that encourage chain-of-thought reasoning with those that do not. The best models take advantage of reasoning and reach about 90% accuracy on the first task, and COMET scores of about 92% on the second task, with GPT-4, GPT-4o and Phi standing out. Moreover, we observe a “wise get wiser” effect: the improvements through reasoning are positively correlated with the scores of the models without reasoning.

[23] Na Prática, qual IA Entende o Direito? Um Estudo Experimental com IAs Generalistas e uma IA Jurídica

Marina Soares Marinho, Daniela Vianna, Livy Real, Altigran da Silva, Gabriela Migliorini

Main category: cs.CL

TL;DR: Jusbrasil Study proposes a legal AI evaluation protocol combining legal theory with empirical assessment, showing domain-specialized JusIA outperforms general-purpose models like ChatGPT and Gemini in legal tasks.

Details

Motivation: To establish a reliable evaluation framework for AI systems in legal contexts, addressing the need for both theoretical grounding and practical assessment in legal AI applications.

Method: Experimental protocol combining legal theory (material correctness, systematic coherence, argumentative integrity) with empirical evaluation by 48 legal professionals, testing four AI systems on lawyers’ daily work tasks.

Result: JusIA, a domain-specialized model, consistently outperformed general-purpose systems (ChatGPT Free, ChatGPT Pro, Gemini) across all tested legal tasks.

Conclusion: Both domain specialization and theoretically grounded evaluation are essential for producing reliable legal AI outputs, with specialized models showing superior performance over general-purpose systems.

Abstract: This study presents the Jusbrasil Study on the Use of General-Purpose AIs in Law, proposing an experimental evaluation protocol combining legal theory, such as material correctness, systematic coherence, and argumentative integrity, with empirical assessment by 48 legal professionals. Four systems (JusIA, ChatGPT Free, ChatGPT Pro, and Gemini) were tested in tasks simulating lawyers’ daily work. JusIA, a domain-specialized model, consistently outperformed the general-purpose systems, showing that both domain specialization and a theoretically grounded evaluation are essential for reliable legal AI outputs.

[24] Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment

Patricia Delafuente, Arya Honraopatil, Lara J. Martin

Main category: cs.CL

TL;DR: This paper compares reasoning and instruct LLMs for generating DnD player actions as Avrae Discord bot commands, finding that instruct models perform sufficiently well and that prompt specificity significantly impacts output quality.

Details

Motivation: To explore how Large Language Models can be used to predict Dungeons & Dragons player actions and format them as specific Discord bot commands, addressing the challenge of automated command generation for tabletop RPG gameplay.

Method: Used the FIREBALL dataset to evaluate two models: DeepSeek-R1-Distill-LLaMA-8B (reasoning model) and LLaMA-3.1-8B-Instruct (instruct model) for generating Avrae Discord bot commands, with analysis of how prompt variations affect model outputs.

Result: Found that instruct models are sufficient for this task compared to reasoning models, and that even minor changes in prompts can significantly impact model output quality and accuracy.

Conclusion: The study demonstrates the importance of prompt engineering for LLM applications in gaming contexts and shows that simpler instruct models can effectively handle command generation tasks without needing more complex reasoning models.

Abstract: This paper explores the application of Large Language Models (LLMs) and reasoning to predict Dungeons & Dragons (DnD) player actions and format them as Avrae Discord bot commands. Using the FIREBALL dataset, we evaluated a reasoning model, DeepSeek-R1-Distill-LLaMA-8B, and an instruct model, LLaMA-3.1-8B-Instruct, for command generation. Our findings highlight the importance of providing specific instructions to models, that even single sentence changes in prompts can greatly affect the output of models, and that instruct models are sufficient for this task compared to reasoning models.

[25] LLMs Encode How Difficult Problems Are

William Lugoloobi, Chris Russell

Main category: cs.CL

TL;DR: LLMs can decode human-labeled problem difficulty linearly across layers, but their own performance-based difficulty estimates are weaker and misaligned with actual improvement during training.

Details

Motivation: To understand why LLMs solve complex problems but fail on simpler ones, and whether they internally represent problem difficulty in a way that aligns with human judgment and tracks generalization during training.

Method: Trained linear probes across layers and token positions on 60 models using Easy2HardBench mathematical and coding subsets. Analyzed difficulty decoding and conducted steering experiments. Monitored GRPO training on Qwen2.5-Math-1.5B.

Result: Human-labeled difficulty is strongly linearly decodable (ρ≈0.88) with clear model-size scaling, while LLM-derived difficulty is weaker and scales poorly. Steering toward “easier” representations reduces hallucination and improves accuracy. During training, human-difficulty probe strengthens and correlates with test accuracy, while LLM-difficulty probe degrades and negatively correlates with performance.

Conclusion: Human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates from model performance become misaligned as models improve, suggesting human judgment offers better difficulty representations for tracking generalization.

Abstract: Large language models exhibit a puzzling inconsistency: they solve complex problems yet frequently fail on seemingly simpler ones. We investigate whether LLMs internally encode problem difficulty in a way that aligns with human judgment, and whether this representation tracks generalization during reinforcement learning post-training. We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench. We find that human-labeled difficulty is strongly linearly decodable (AMC: $\rho \approx 0.88$) and exhibits clear model-size scaling, whereas LLM-derived difficulty is substantially weaker and scales poorly. Steering along the difficulty direction reveals that pushing models toward “easier” representations reduces hallucination and improves accuracy. During GRPO training on Qwen2.5-Math-1.5B, the human-difficulty probe strengthens and positively correlates with test accuracy across training steps, while the LLM-difficulty probe degrades and negatively correlates with performance. These results suggest that human annotations provide a stable difficulty signal that RL amplifies, while automated difficulty estimates derived from model performance become misaligned precisely as models improve. We release probe code and evaluation scripts to facilitate replication.

[26] Extracting Rule-based Descriptions of Attention Features in Transformers

Dan Friedman, Adithya Bhaskar, Alexander Wettig, Danqi Chen

Main category: cs.CL

TL;DR: This paper proposes rule-based descriptions as an alternative to sparse feature analysis for mechanistic interpretability, extracting skip-gram, absence, and counting rules from attention layers to better explain model behavior.

Details

Motivation: Current mechanistic interpretability methods only identify which text sequences activate features but require subjective inspection for actual interpretation. The paper advocates for more objective rule-based descriptions that directly explain model behavior.

Method: Extract rule-based descriptions from SAE features trained on attention layers, focusing on three types: skip-gram rules (pattern-output relationships), absence rules (negative correlations), and counting rules (threshold-based activation). Applied to GPT-2 small.

Result: Majority of features can be described with ~100 skip-gram rules; absence rules are abundant even in early layers (over 25% of features); some counting rules identified. Rule-based descriptions provide more complete explanations than manual inspection.

Conclusion: The paper establishes groundwork for rule-based feature descriptions, showing they can be automatically extracted and provide a taxonomy of model behaviors that traditional exemplar inspection often misses.

Abstract: Mechanistic interpretability strives to explain model behavior in terms of bottom-up primitives. The leading paradigm is to express hidden states as a sparse linear combination of basis vectors, called features. However, this only identifies which text sequences (exemplars) activate which features; the actual interpretation of features requires subjective inspection of these exemplars. This paper advocates for a different solution: rule-based descriptions that match token patterns in the input and correspondingly increase or decrease the likelihood of specific output tokens. Specifically, we extract rule-based descriptions of SAE features trained on the outputs of attention layers. While prior work treats the attention layers as an opaque box, we describe how it may naturally be expressed in terms of interactions between input and output features, of which we study three types: (1) skip-gram rules of the form “[Canadian city]… speaks –> English”, (2) absence rules of the form “[Montreal]… speaks -/-> English,” and (3) counting rules that toggle only when the count of a word exceeds a certain value or the count of another word. Absence and counting rules are not readily discovered by inspection of exemplars, where manual and automatic descriptions often identify misleading or incomplete explanations. We then describe a simple approach to extract these types of rules automatically from a transformer, and apply it to GPT-2 small. We find that a majority of features may be described well with around 100 skip-gram rules, though absence rules are abundant even as early as the first layer (in over a fourth of features). We also isolate a few examples of counting rules. This paper lays the groundwork for future research into rule-based descriptions of features by defining them, showing how they may be extracted, and providing a preliminary taxonomy of some of the behaviors they represent.

[27] Automatic Prompt Generation via Adaptive Selection of Prompting Techniques

Yohei Ikenoue, Hitomi Tashiro, Shigeru Kuroyanagi

Main category: cs.CL

TL;DR: Proposes an adaptive prompt engineering method that automatically generates high-quality prompts by matching user task descriptions to task clusters and selecting appropriate prompting techniques from a knowledge base.

Details

Motivation: Prompt engineering requires specialized knowledge and deep task understanding, creating barriers for non-experts to effectively use LLMs.

Method: Builds a knowledge base associating task clusters (by semantic similarity) with prompting techniques, then dynamically generates prompts by matching user task descriptions to relevant clusters and integrating techniques from the knowledge base.

Result: Experimental evaluation on 23 tasks from BIG-Bench Extra Hard shows superior performance compared to standard prompts and existing automatic prompt-generation tools, measured by both arithmetic and harmonic mean scores.

Conclusion: Establishes foundation for streamlining and standardizing prompt creation, enabling non-experts to effectively leverage LLMs without requiring specialized prompting knowledge.

Abstract: Prompt engineering is crucial for achieving reliable and effective outputs from large language models (LLMs), but its design requires specialized knowledge of prompting techniques and a deep understanding of target tasks. To address this challenge, we propose a novel method that adaptively selects task-appropriate prompting techniques based on users’ abstract task descriptions and automatically generates high-quality prompts without relying on pre-existing templates or frameworks. The proposed method constructs a knowledge base that associates task clusters, characterized by semantic similarity across diverse tasks, with their corresponding prompting techniques. When users input task descriptions, the system assigns them to the most relevant task cluster and dynamically generates prompts by integrating techniques drawn from the knowledge base. An experimental evaluation of the proposed method on 23 tasks from BIG-Bench Extra Hard (BBEH) demonstrates superior performance compared with standard prompts and existing automatic prompt-generation tools, as measured by both arithmetic and harmonic mean scores. This research establishes a foundation for streamlining and standardizing prompt creation, enabling non-experts to effectively leverage LLMs.

[28] CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models

Ritam Upadhyay, Naman Ahuja, Rishabh Baral, Aparna Garimella, Vivek Gupta

Main category: cs.CL

TL;DR: CMT-Bench is a diagnostic benchmark for testing LLM robustness in dynamic text-to-table generation using cricket commentary, revealing brittleness in current models through semantic-preserving perturbations.

Details

Motivation: Current text-to-table systems rely on computationally expensive methods that obscure model reasoning. The paper aims to create a benchmark that probes robustness in dynamic table generation across evolving schemas.

Method: Created CMT-Bench from live cricket commentary with three semantics-preserving test dimensions: extractive-cue ablation, temporal prefixing, and entity-form perturbations to assess model sensitivity.

Result: Found large performance drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drops under entity-form changes. Distributional tests showed significant shifts in numeric error patterns.

Conclusion: Current LLMs are brittle in dynamic text-to-table generation, motivating robustness-first evaluation as essential for developing efficient and scalable approaches.

Abstract: LLM Driven text-to-table (T2T) systems often rely on extensive prompt-engineering or iterative event extraction in code-parsable formats, which boosts scores but are computationally expensive and obscure how models actually reason over temporal evolving narratives to summarise key information. We present CMT-Bench, a diagnostic benchmark built from live cricket commentary that requires dynamic table generation across two evolving schemas under a dense, rule-governed policy. CMT-Bench is designed to probe robustness via three semantics-preserving dimensions: (i) extractive-cue ablation to separate extractive shortcuts from state tracking, (ii) temporal prefixing to test long-context stability, and (iii) entity-form perturbations (anonymization, outof-distribution substitutions, role-entangling paraphrases) to assess sensitivity to surface variation. Across diverse long-context stateof-the-art LLMs, we find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes. Complementary distributional tests confirm significant shifts in numeric error patterns, indicating drift in reasoning rather than mere noise. Our results show that current LLMs are brittle in dynamic Textto-table generation, motivating robustness-first evaluation as a prerequisite for developing efficient and scalable approaches for this task.

[29] Multi-Agent Collaboration via Evolving Orchestration

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun

Main category: cs.CL

TL;DR: A puppeteer-style multi-agent collaboration framework where a central orchestrator dynamically directs LLM agents using reinforcement learning, achieving better performance with lower computational costs.

Details

Motivation: Current LLM multi-agent approaches use static structures that don't scale well with task complexity and agent numbers, leading to coordination overhead and inefficiencies.

Method: Proposes a puppeteer paradigm with a centralized orchestrator trained via reinforcement learning to dynamically sequence and prioritize agents based on evolving task states.

Result: Experiments show superior performance with reduced computational costs in both closed- and open-domain scenarios, with improvements stemming from more compact, cyclic reasoning structures.

Conclusion: The puppeteer-style paradigm enables flexible and evolvable collective reasoning in LLM multi-agent systems, effectively addressing scalability and efficiency limitations of static approaches.

Abstract: Large language models (LLMs) have achieved remarkable results across diverse downstream tasks, but their monolithic nature restricts scalability and efficiency in complex problem-solving. While recent research explores multi-agent collaboration among LLMs, most approaches rely on static organizational structures that struggle to adapt as task complexity and agent numbers grow, resulting in coordination overhead and inefficiencies. To this end, we propose a puppeteer-style paradigm for LLM-based multi-agent collaboration, where a centralized orchestrator (“puppeteer”) dynamically directs agents (“puppets”) in response to evolving task states. This orchestrator is trained via reinforcement learning to adaptively sequence and prioritize agents, enabling flexible and evolvable collective reasoning. Experiments on closed- and open-domain scenarios show that this method achieves superior performance with reduced computational costs. Analyses further reveal that the key improvements consistently stem from the emergence of more compact, cyclic reasoning structures under the orchestrator’s evolution. Our code is available at https://github.com/OpenBMB/ChatDev/tree/puppeteer.

[30] Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

Yoshinari Fujinuma

Main category: cs.CL

TL;DR: LLMs used as evaluators suffer from score range bias in direct assessment tasks, where their outputs are highly sensitive to pre-defined score ranges. This bias is mitigated using contrastive decoding, improving correlation with human judgments.

Details

Motivation: The reliability of LLMs as evaluators in direct assessment tasks is challenged by their sensitivity to pre-defined score ranges, which prevents finding optimal scoring ranges and affects evaluation consistency.

Method: Contrastive decoding is used to mitigate the score range bias in LLM judges, addressing the sensitivity to pre-defined score ranges that affects evaluation outcomes.

Result: The proposed contrastive decoding method achieves up to 11.3% relative improvement in Spearman correlation with human judgments across different score ranges, demonstrating reduced bias.

Conclusion: Contrastive decoding effectively mitigates score range bias in LLM evaluators, improving their reliability and correlation with human judgments in direct assessment tasks.

Abstract: Large Language Models (LLMs) are commonly used as evaluators in various applications, but the reliability of the outcomes remains a challenge. One such challenge is using LLMs-as-judges for direct assessment, i.e., assigning scores from a specified range without any references. We first show that this challenge stems from LLM judge outputs being associated with score range bias, i.e., LLM judge outputs are highly sensitive to pre-defined score ranges, preventing the search for optimal score ranges. We also show that similar biases exist among models from the same family. We then mitigate this bias through contrastive decoding, achieving up to 11.3% relative improvement on average in Spearman correlation with human judgments across different score ranges.

[31] MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives

Sriharsh Bhyravajjula, Ujwal Narayan, Manish Shrivastava

Main category: cs.CL

TL;DR: MARCUS is an NLP pipeline that computationally generates event-centric, relation-based character arcs from narratives by extracting events, characters, emotions, and sentiment to model inter-character relations.

Details

Motivation: To provide quantitative representations for character arcs, bringing tangibility to theoretical literary concepts and enabling computational analysis of character journeys across narratives.

Method: MARCUS pipeline extracts events, participant characters, implied emotion, and sentiment to model inter-character relations, then tracks and aggregates these relations across narratives to generate character arcs as graphical plots.

Result: Successfully generated character arcs from two extended fantasy series (Harry Potter and Lord of the Rings) and evaluated the approach.

Conclusion: The work outlines existing challenges, suggests applications for the pipeline, and discusses future directions for computational character arc analysis.

Abstract: Character arcs are important theoretical devices employed in literary studies to understand character journeys, identify tropes across literary genres, and establish similarities between narratives. This work addresses the novel task of computationally generating event-centric, relation-based character arcs from narratives. Providing a quantitative representation for arcs brings tangibility to a theoretical concept and paves the way for subsequent applications. We present MARCUS (Modelling Arcs for Understanding Stories), an NLP pipeline that extracts events, participant characters, implied emotion, and sentiment to model inter-character relations. MARCUS tracks and aggregates these relations across the narrative to generate character arcs as graphical plots. We generate character arcs from two extended fantasy series, Harry Potter and Lord of the Rings. We evaluate our approach before outlining existing challenges, suggesting applications of our pipeline, and discussing future work.

[32] DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization

Tao Tao, Guanghui Zhu, Lang Guo, Hongyi Chen, Chunfeng Yuan, Yihua Huang

Main category: cs.CL

TL;DR: DelvePO is a direction-guided self-evolving framework for flexible prompt optimization that decouples prompts into components and uses working memory to guide LLMs in generating better prompts across various tasks.

Details

Motivation: Current prompt optimization methods rely on random LLM rewriting and focus on specific factors, leading to local optima and unstable performance that limits transferability across tasks.

Method: Decouples prompts into components to explore factor impacts, introduces working memory to help LLMs overcome uncertainties and gain insights for generating new prompts in a self-evolving manner.

Result: Extensive experiments on various tasks and LLMs (DeepSeek-R1-Distill-Llama-8B, Qwen2.5-7B-Instruct, GPT-4o-mini) show DelvePO consistently outperforms previous SOTA methods under identical settings.

Conclusion: DelvePO demonstrates effectiveness and transferability across different tasks, providing a task-agnostic framework for flexible prompt optimization.

Abstract: Prompt Optimization has emerged as a crucial approach due to its capabilities in steering Large Language Models to solve various tasks. However, current works mainly rely on the random rewriting ability of LLMs, and the optimization process generally focus on specific influencing factors, which makes it easy to fall into local optimum. Besides, the performance of the optimized prompt is often unstable, which limits its transferability in different tasks. To address the above challenges, we propose $\textbf{DelvePO}$ ($\textbf{D}$irection-Guid$\textbf{e}$d Se$\textbf{l}$f-E$\textbf{v}$olving Framework for Fl$\textbf{e}$xible $\textbf{P}$rompt $\textbf{O}$ptimization), a task-agnostic framework to optimize prompts in self-evolve manner. In our framework, we decouple prompts into different components that can be used to explore the impact that different factors may have on various tasks. On this basis, we introduce working memory, through which LLMs can alleviate the deficiencies caused by their own uncertainties and further obtain key insights to guide the generation of new prompts. Extensive experiments conducted on different tasks covering various domains for both open- and closed-source LLMs, including DeepSeek-R1-Distill-Llama-8B, Qwen2.5-7B-Instruct and GPT-4o-mini. Experimental results show that DelvePO consistently outperforms previous SOTA methods under identical experimental settings, demonstrating its effectiveness and transferability across different tasks.

[33] Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Yanhong Li, Zixuan Lan, Jiawei Zhou

Main category: cs.CL

TL;DR: Using text-as-image input compression for LLMs reduces token usage by nearly half while maintaining performance on long-context tasks.

Details

Motivation: To explore if visual text representations can compress textual inputs for LLMs to reduce token usage while preserving performance.

Method: Render long text inputs as single images and provide them directly to decoder LLMs, exploiting visual text representations as input compression.

Result: Substantial token savings (often nearly half) without degrading task performance on RULER (long-context retrieval) and CNN/DailyMail (document summarization) benchmarks.

Conclusion: Visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs.

Abstract: Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.

[34] BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks

Tianyuan Huang, Zepeng Zhu, Hangdi Xing, Zirui Shao, Zhi Yu, Chaoxiong Yang, Jiaxian He, Xiaozhong Liu, Jiajun Bu

Main category: cs.CL

TL;DR: The paper introduces BrailleLLM with Braille Knowledge-Based Fine-Tuning (BKFT) to address Braille processing challenges, achieving improved performance in translation tasks and providing open-source datasets for multilingual Braille research.

Details

Motivation: Braille is crucial for visually impaired individuals' education and information access, but faces challenges like data scarcity and ambiguities in mixed-text contexts.

Method: Constructed English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas, proposed syntax tree-based augmentation for Braille data, and developed Braille Knowledge-Based Fine-Tuning (BKFT) via instruction tuning.

Result: BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios, enabling unified Braille translation, formula-to-Braille conversion, and mixed-text translation.

Conclusion: The open-sourced datasets and BKFT methodology establish a foundation for low-resource multilingual Braille research, addressing key challenges in Braille information processing.

Abstract: Braille plays a vital role in education and information accessibility for visually impaired individuals. However, Braille information processing faces challenges such as data scarcity and ambiguities in mixed-text contexts. We construct English and Chinese Braille Mixed Datasets (EBMD/CBMD) with mathematical formulas to support diverse Braille domain research, and propose a syntax tree-based augmentation method tailored for Braille data. To address the underperformance of traditional fine-tuning methods in Braille-related tasks, we investigate Braille Knowledge-Based Fine-Tuning (BKFT), which reduces the learning difficulty of Braille contextual features. BrailleLLM employs BKFT via instruction tuning to achieve unified Braille translation, formula-to-Braille conversion, and mixed-text translation. Experiments demonstrate that BKFT achieves significant performance improvements over conventional fine-tuning in Braille translation scenarios. Our open-sourced datasets and methodologies establish a foundation for low-resource multilingual Braille research.

[35] From Retrieval to Generation: Unifying External and Parametric Knowledge for Medical Question Answering

Lei Li, Xiao Zhou, Yingying Zhang, Xian Wu

Main category: cs.CL

TL;DR: MedRGAG is a unified framework that integrates retrieval and generation for medical QA, addressing limitations of both RAG (noisy retrieval) and GAG (hallucinated information) through knowledge-guided context completion and adaptive document selection.

Details

Motivation: Existing medical QA approaches have limitations - RAG suffers from noisy/incomplete retrieval while GAG is vulnerable to hallucinated information. Both can mislead reasoning and undermine answer reliability in medical contexts where accuracy is critical.

Method: MedRGAG combines two key modules: Knowledge-Guided Context Completion (KGCC) directs generation to complement missing knowledge from retrieval, and Knowledge-Aware Document Selection (KADS) adaptively selects optimal combination of retrieved and generated documents for evidence.

Result: Extensive experiments on five medical QA benchmarks show MedRGAG achieves 12.5% improvement over MedRAG and 4.5% gain over MedGENIE, demonstrating effectiveness of unifying retrieval and generation for knowledge-intensive reasoning.

Conclusion: MedRGAG effectively addresses limitations of both RAG and GAG approaches by seamlessly integrating external and parametric knowledge, providing a more reliable framework for medical question answering through unified retrieval-generation augmentation.

Abstract: Medical question answering (QA) requires extensive access to domain-specific knowledge. A promising direction is to enhance large language models (LLMs) with external knowledge retrieved from medical corpora or parametric knowledge stored in model parameters. Existing approaches typically fall into two categories: Retrieval-Augmented Generation (RAG), which grounds model reasoning on externally retrieved evidence, and Generation-Augmented Generation (GAG), which depends solely on the models internal knowledge to generate contextual documents. However, RAG often suffers from noisy or incomplete retrieval, while GAG is vulnerable to hallucinated or inaccurate information due to unconstrained generation. Both issues can mislead reasoning and undermine answer reliability. To address these challenges, we propose MedRGAG, a unified retrieval-generation augmented framework that seamlessly integrates external and parametric knowledge for medical QA. MedRGAG comprises two key modules: Knowledge-Guided Context Completion (KGCC), which directs the generator to produce background documents that complement the missing knowledge revealed by retrieval; and Knowledge-Aware Document Selection (KADS), which adaptively selects an optimal combination of retrieved and generated documents to form concise yet comprehensive evidence for answer generation. Extensive experiments on five medical QA benchmarks demonstrate that MedRGAG achieves a 12.5% improvement over MedRAG and a 4.5% gain over MedGENIE, highlighting the effectiveness of unifying retrieval and generation for knowledge-intensive reasoning. Our code and data are publicly available at https://anonymous.4open.science/r/MedRGAG

[36] ECG-LLM – training and evaluation of domain-specific large language models for electrocardiography

Lara Ahrens, Wilhelm Haverkamp, Nils Strodthoff

Main category: cs.CL

TL;DR: Domain-adapted LLMs through finetuning on ECG literature achieve competitive performance with proprietary models, supporting viable privacy-preserving clinical solutions.

Details

Motivation: To characterize optimal adaptation strategies and evaluate performance of domain-adapted LLMs in healthcare, specifically electrocardiography, compared to general-purpose models.

Method: Finetuned open-weight models on domain-specific literature, implemented multi-layered evaluation framework comparing finetuned models, RAG, and Claude Sonnet 3.7.

Result: Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human experts preferred Claude 3.7 and RAG for complex queries.

Conclusion: Domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting viability of privacy-preserving, locally deployable clinical solutions.

Abstract: Domain-adapted open-weight large language models (LLMs) offer promising healthcare applications, from queryable knowledge bases to multimodal assistants, with the crucial advantage of local deployment for privacy preservation. However, optimal adaptation strategies, evaluation methodologies, and performance relative to general-purpose LLMs remain poorly characterized. We investigated these questions in electrocardiography, an important area of cardiovascular medicine, by finetuning open-weight models on domain-specific literature and implementing a multi-layered evaluation framework comparing finetuned models, retrieval-augmented generation (RAG), and Claude Sonnet 3.7 as a representative general-purpose model. Finetuned Llama 3.1 70B achieved superior performance on multiple-choice evaluations and automatic text metrics, ranking second to Claude 3.7 in LLM-as-a-judge assessments. Human expert evaluation favored Claude 3.7 and RAG approaches for complex queries. Finetuned models significantly outperformed their base counterparts across nearly all evaluation modes. Our findings reveal substantial performance heterogeneity across evaluation methodologies, underscoring assessment complexity. Nevertheless, domain-specific adaptation through finetuning and RAG achieves competitive performance with proprietary models, supporting the viability of privacy-preserving, locally deployable clinical solutions.

[37] Combining Distantly Supervised Models with In Context Learning for Monolingual and Cross-Lingual Relation Extraction

Vipul Rathore, Malik Hammad Faisal, Parag Singla, Mausam

Main category: cs.CL

TL;DR: HYDRE is a hybrid framework that combines trained DSRE models with LLM in-context learning, using dynamic exemplar retrieval to improve relation extraction from noisy distant supervision, achieving significant F1 gains in English and low-resource Indic languages.

Details

Motivation: Existing DSRE models rely on task-specific training but struggle with noisy annotations, and their integration with LLM in-context learning is underexplored due to potential incorrect learning of relation semantics from noisy data.

Method: HYDRE first uses a trained DSRE model to identify top-k candidate relations, then employs a novel dynamic exemplar retrieval strategy to extract reliable sentence-level exemplars from training data, which are provided in LLM prompts for final relation prediction.

Result: HYDRE achieves up to 20 F1 point gains in English and average 17 F1 point gains on four Indic languages (Oriya, Santali, Manipuri, Tulu) compared to prior state-of-the-art DSRE models.

Conclusion: The hybrid approach combining DSRE models with LLM in-context learning through dynamic exemplar retrieval effectively handles noisy distant supervision and extends successfully to cross-lingual settings for low-resource languages.

Abstract: Distantly Supervised Relation Extraction (DSRE) remains a long-standing challenge in NLP, where models must learn from noisy bag-level annotations while making sentence-level predictions. While existing state-of-the-art (SoTA) DSRE models rely on task-specific training, their integration with in-context learning (ICL) using large language models (LLMs) remains underexplored. A key challenge is that the LLM may not learn relation semantics correctly, due to noisy annotation. In response, we propose HYDRE – HYbrid Distantly Supervised Relation Extraction framework. It first uses a trained DSRE model to identify the top-k candidate relations for a given test sentence, then uses a novel dynamic exemplar retrieval strategy that extracts reliable, sentence-level exemplars from training data, which are then provided in LLM prompt for outputting the final relation(s). We further extend HYDRE to cross-lingual settings for RE in low-resource languages. Using available English DSRE training data, we evaluate all methods on English as well as a newly curated benchmark covering four diverse low-resource Indic languages – Oriya, Santali, Manipuri, and Tulu. HYDRE achieves up to 20 F1 point gains in English and, on average, 17 F1 points on Indic languages over prior SoTA DSRE models. Detailed ablations exhibit HYDRE’s efficacy compared to other prompting strategies.

[38] KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers

Mohd Ruhul Ameen, Akif Islam, Farjana Aktar, M. Saifuzzaman Rafat

Main category: cs.CL

TL;DR: KrishokBondhu is a voice-enabled agricultural advisory platform using RAG framework for Bengali-speaking farmers, achieving 44.7% improvement over benchmarks with high-quality responses for 72.7% of queries.

Details

Motivation: Address challenges faced by Bangladeshi farmers in accessing timely, expert-level agricultural guidance through accessible technology solutions.

Method: Built on Retrieval-Augmented Generation (RAG) framework with OCR/document parsing, vector database indexing, speech-to-text conversion, and text-to-speech delivery in Bengali.

Result: Achieved 72.7% high-quality responses, composite score of 4.53/5 (44.7% improvement over KisanQRS), with major gains in contextual richness (+367%) and completeness (+100.4%).

Conclusion: Demonstrates feasibility of integrating call-centre accessibility, multilingual voice interaction, and RAG techniques for expert-level agricultural guidance to remote farmers.

Abstract: In Bangladesh, many farmers continue to face challenges in accessing timely, expert-level agricultural guidance. This paper presents KrishokBondhu, a voice-enabled, call-centre-integrated advisory platform built on a Retrieval-Augmented Generation (RAG) framework, designed specifically for Bengali-speaking farmers. The system aggregates authoritative agricultural handbooks, extension manuals, and NGO publications; applies Optical Character Recognition (OCR) and document-parsing pipelines to digitize and structure the content; and indexes this corpus in a vector database for efficient semantic retrieval. Through a simple phone-based interface, farmers can call the system to receive real-time, context-aware advice: speech-to-text converts the Bengali query, the RAG module retrieves relevant content, a large language model (Gemma 3-4B) generates a context-grounded response, and text-to-speech delivers the answer in natural spoken Bengali. In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries covering crop management, disease control, and cultivation practices. Compared to the KisanQRS benchmark, the system achieved a composite score of 4.53 (vs. 3.13) on a 5-point scale, a 44.7% improvement, with especially large gains in contextual richness (+367%) and completeness (+100.4%), while maintaining comparable relevance and technical specificity. Semantic similarity analysis further revealed a strong correlation between retrieved context and answer quality, emphasizing the importance of grounding generative responses in curated documentation. KrishokBondhu demonstrates the feasibility of integrating call-centre accessibility, multilingual voice interaction, and modern RAG techniques to deliver expert-level agricultural guidance to remote Bangladeshi farmers, paving the way toward a fully AI-driven agricultural advisory ecosystem.

[39] KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs

Donghyeon Ko, Yeguk Jin, Kyubyung Chae, Byungwook Lee, Chansong Jo, Sookyo In, Jaehong Lee, Taesup Kim, Donghyun Kwak

Main category: cs.CL

TL;DR: KoSimpleQA is a Korean cultural knowledge benchmark with 1,000 fact-seeking questions that reveals LLMs struggle with Korean factual knowledge, achieving only 33.7% accuracy even with the best model.

Details

Motivation: To evaluate factuality in LLMs specifically for Korean cultural knowledge, as existing benchmarks are primarily English-focused and performance rankings differ significantly between languages.

Method: Created a benchmark of 1,000 short, unambiguous fact-seeking questions about Korean culture and evaluated diverse open-source LLMs that support Korean, including analysis of reasoning models.

Result: Even the strongest model achieved only 33.7% accuracy, showing the challenging nature of Korean cultural knowledge. Performance rankings differed substantially from English benchmarks, and reasoning capabilities helped models better elicit knowledge and abstain when uncertain.

Conclusion: KoSimpleQA provides a valuable benchmark for evaluating Korean cultural knowledge in LLMs, revealing significant gaps in current models’ factual knowledge about Korean culture and demonstrating the importance of language-specific evaluation.

Abstract: We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.

[40] Towards Fair ASR For Second Language Speakers Using Fairness Prompted Finetuning

Monorama Swain, Bubai Maji, Jagabandhu Mishra, Markus Schedl, Anders Søgaard, Jesper Rindom Jensen

Main category: cs.CL

TL;DR: The paper proposes fairness-prompted finetuning with lightweight adapters to improve ASR fairness for second-language speakers, achieving significant WER improvements across accent groups.

Details

Motivation: To address fairness gaps in ASR systems for second-language speakers, as analysis reveals large WER fluctuations across 26 accent groups in models like Whisper and Seamless-M4T.

Method: Proposes fairness-prompted finetuning with lightweight adapters, combining traditional ERM with cross-entropy and fairness-driven objectives (Spectral Decoupling, Group-DRO, and Invariant Risk Minimization).

Result: Achieves 58.7% and 58.5% relative improvement in macro-averaged WER over pretrained Whisper and Seamless-M4T, and 9.7% and 7.8% improvement over standard ERM finetuning.

Conclusion: The fusion of ERM with fairness-driven objectives effectively enhances fairness across accent groups while maintaining overall ASR accuracy.

Abstract: In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.

[41] MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

Main category: cs.CL

TL;DR: MENTOR is a framework that combines reinforcement learning with teacher-guided distillation to improve tool-using capabilities in small language models, addressing poor generalization in SFT and reward sparsity in standard RL.

Details

Motivation: Current approaches like supervised fine-tuning (SFT) suffer from poor generalization as they only imitate static teacher trajectories, while standard RL with sparse rewards fails to effectively guide small language models due to inefficient exploration and suboptimal strategies.

Method: MENTOR synergistically combines RL with teacher-guided distillation, using an RL-based process to learn generalizable policies through exploration, and constructs dense composite teacher-guided rewards from teacher’s reference trajectories to provide fine-grained guidance.

Result: Extensive experiments show MENTOR significantly improves cross-domain generalization and strategic competence of small language models compared to both SFT and standard sparse-reward RL baselines.

Conclusion: The MENTOR framework effectively addresses the limitations of both SFT and standard RL by combining reinforcement learning with teacher-guided distillation, enabling better tool-using capabilities in small language models with improved generalization and strategic performance.

Abstract: Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher’s reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.

[42] Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

Siyuan Yan, Guo-Qing Jiang, Yuchen Zhang, Xiaoxing Ma, Ran Zhu, Chun Cao, Jingwei Xu

Main category: cs.CL

TL;DR: Adamas is a lightweight sparse attention mechanism that achieves near-lossless performance with high sparsity, enabling 4.4x self-attention speedup on long sequences while maintaining accuracy comparable to full attention.

Details

Motivation: Extended context windows in LLMs exacerbate quadratic self-attention costs, causing severe latency. Existing sparse attention methods struggle to recall critical key-value pairs, leading to accuracy degradation.

Method: Adamas applies Hadamard transform, bucketization, and 2-bit compression to create compact representations, then uses Manhattan-distance estimation for efficient top-k selection of key-value pairs.

Result: Adamas matches full attention accuracy with 64-token budget, achieves near-lossless performance at 128 tokens, supports 8x higher sparsity than SOTA methods, and delivers 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences.

Conclusion: Adamas effectively maintains accuracy under aggressive sparsity, achieving comparable or lower perplexity than full attention while significantly reducing computational costs for long-context inference.

Abstract: Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.

[43] Chain-of-Conceptual-Thought: Eliciting the Agent to Deeply Think within the Response

Qingqing Gu, Dan Wang, Yue Zhao, Xiaoyu Wang, Zhonglin Jiang, Yong Chen, Hongyan Li, Luo Ji

Main category: cs.CL

TL;DR: Chain-of-Thought (CoT) has limitations for open-domain tasks, so the authors propose Chain of Conceptual Thought (CoCT) where LLMs first tag concepts then generate content, improving performance in conversational tasks.

Details

Motivation: CoT performs poorly in open-domain tasks due to lack of clearly defined reasoning steps, so a new approach is needed for tasks like daily and emotional support conversations.

Method: CoCT paradigm: LLM first tags concepts (emotions, strategies, topics), then generates detailed content, allowing concept chains within utterances to encourage deep thinking.

Result: CoCT outperforms baselines (Self-Refine, ECoT, ToT, SoT, RAG) in automatic, human and model evaluations for conversational tasks.

Conclusion: CoCT represents an effective prompt-based paradigm that extends LLM capabilities to a wider range of open-domain tasks beyond traditional reasoning domains.

Abstract: Chain-of-Thought (CoT) is widely applied to improve the LLM capability in math, coding and reasoning tasks. However, its performance is limited for open-domain tasks since there are no clearly defined reasoning steps or logical transitions. To mitigate such challenges, we propose another prompt-based paradigm called Chain of Conceptual Thought (CoCT), where the LLM first tags a concept, then generates the detailed content. The chain of concepts is allowed within the utterance, encouraging the LLM’s deep and strategic thinking. We experiment with this paradigm in daily and emotional support conversations where the concept is comprised of emotions, strategies and topics. Automatic, human and model evaluations suggest that CoCT surpasses baselines such as Self-Refine, ECoT, ToT, SoT and RAG, suggesting a potential effective prompt-based paradigm of LLM for a wider scope of tasks.

[44] Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

Yasser Hamidullah, Koel Dutta Chowdury, Yusser Al-Ghussin, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet

Main category: cs.CL

TL;DR: Proposes a token-level reliability measure to quantify visual grounding in sign language translation models, combining feature-based sensitivity and counterfactual signals to detect hallucinations.

Details

Motivation: Hallucination is a major flaw in vision-language models, especially critical in sign language translation where meaning depends on precise visual grounding. Gloss-free models are particularly vulnerable as they map signer movements directly to language without intermediate alignment.

Method: Develops a reliability measure combining feature-based sensitivity (internal changes when video is masked) and counterfactual signals (probability differences between clean and altered video inputs), aggregated into sentence-level scores.

Result: Reliability predicts hallucination rates, generalizes across datasets and architectures, decreases under visual degradations, distinguishes grounded from guessed tokens, and improves risk estimation when combined with text-based signals.

Conclusion: Establishes reliability as a practical tool for diagnosing hallucinations in SLT and lays groundwork for more robust hallucination detection in multimodal generation.

Abstract: Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.

[45] Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

Atharvan Dogra, Soumya Suvra Ghosal, Ameet Deshpande, Ashwin Kalyan, Dinesh Manocha

Main category: cs.CL

TL;DR: LLM humor generation shows bias amplification where harmful content (stereotypical/toxic jokes) receives higher humor scores, creating a loop between generators and evaluators that reinforces problematic content.

Details

Motivation: To evaluate safety concerns in LLM creative writing by examining how humor optimization couples with harmful content like stereotypes and toxicity.

Method: Joint measurement of humor, stereotypicality, and toxicity across six models; information-theoretic analysis of incongruity signals; external validation with satire-generation task and human judgments.

Result: Harmful outputs get 10-21% higher humor scores; stereotypical jokes appear 11-28% more often in LLM-rated funny content and up to 10% more in human-perceived funny content; closed models also show increased stereotypicality and toxicity.

Conclusion: LLM humor generation pipelines structurally embed harmful content, with bias amplification loops between generators and evaluators that make stereotypical/toxic content more likely to be rated as funny.

Abstract: Large language models are increasingly used for creative writing and engagement content, raising safety concerns about the outputs. Therefore, casting humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. This is further supplemented by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that harmful outputs receive higher humor scores which further increase under role-based prompting, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show harmful cues widen predictive uncertainty and surprisingly, can even make harmful punchlines more expected for some models, suggesting structural embedding in learned humor distributions. External validation on an additional satire-generation task with human perceived funniness judgments shows that LLM satire increases stereotypicality and typically toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain $10-21%$ in mean humor score, stereotypical jokes appear $11%$ to $28%$ more often among the jokes marked funny by LLM-based metric and up to $10%$ more often in generations perceived as funny by humans.

[46] ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

Liyang He, Yuren Zhang, Ziwei Zhu, Zhenghui Li, Shiwei Tong

Main category: cs.CL

TL;DR: ChronoPlay is a framework for automated generation of dynamic RAG benchmarks in gaming, addressing dual dynamics of game updates and player focus shifts through dual-source synthesis from official and community sources.

Details

Motivation: Lack of dedicated benchmarks for RAG systems in dynamic domains like online gaming, where dual dynamics (game updates and player focus shifts) create evaluation challenges requiring player-centric authenticity.

Method: Uses dual-dynamic update mechanism to track game content updates and player community shifts, and dual-source synthesis engine drawing from official sources and player community for factual correctness and authentic query patterns.

Result: Created first dynamic RAG benchmark for gaming domain on three distinct games, providing insights into model performance under complex, realistic conditions.

Conclusion: ChronoPlay successfully addresses the integrated challenge of automated benchmark generation for gaming RAG systems, enabling standardized evaluation in dynamic domains with dual dynamics.

Abstract: Retrieval Augmented Generation (RAG) systems are increasingly vital in dynamic domains like online gaming, yet the lack of a dedicated benchmark has impeded standardized evaluation in this area. The core difficulty lies in Dual Dynamics: the constant interplay between game content updates and the shifting focus of the player community. Furthermore, the necessity of automating such a benchmark introduces a critical requirement for player-centric authenticity to ensure generated questions are realistic. To address this integrated challenge, we introduce ChronoPlay, a novel framework for the automated and continuous generation of game RAG benchmarks. ChronoPlay utilizes a dual-dynamic update mechanism to track both forms of change, and a dual-source synthesis engine that draws from official sources and player community to ensure both factual correctness and authentic query patterns. We instantiate our framework on three distinct games to create the first dynamic RAG benchmark for the gaming domain, offering new insights into model performance under these complex and realistic conditions. Code is avaliable at: https://github.com/hly1998/ChronoPlay.

[47] DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Xiangyu Hong, Che Jiang, Kai Tian, Biqing Qi, Youbang Sun, Ning Ding, Bowen Zhou

Main category: cs.CL

TL;DR: DePass is a unified framework for feature attribution in Transformer models that decomposes hidden states into additive components and propagates them through fixed attention scores and MLP activations in a single forward pass.

Details

Motivation: To address the central challenge of attributing Transformer model behavior to internal computations in mechanistic interpretability without requiring auxiliary training.

Method: Decomposes hidden states into customized additive components and propagates them through the model with fixed attention scores and MLP activations in a single forward pass.

Result: Achieves faithful, fine-grained attribution validated across token-level, model component-level, and subspace-level attribution tasks, demonstrating effectiveness and fidelity.

Conclusion: DePass serves as a foundational tool for broader interpretability applications by enabling attribution of information flow between arbitrary Transformer components.

Abstract: Attributing the behavior of Transformer models to internal computations is a central challenge in mechanistic interpretability. We introduce DePass, a unified framework for feature attribution based on a single decomposed forward pass. DePass decomposes hidden states into customized additive components, then propagates them with attention scores and MLP’s activations fixed. It achieves faithful, fine-grained attribution without requiring auxiliary training. We validate DePass across token-level, model component-level, and subspace-level attribution tasks, demonstrating its effectiveness and fidelity. Our experiments highlight its potential to attribute information flow between arbitrary components of a Transformer model. We hope DePass serves as a foundational tool for broader applications in interpretability.

[48] CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Masato Kikuchi, Masatsugu Ono, Toshioki Soga, Tetsu Tanabe, Tadachika Ozono

Main category: cs.CL

TL;DR: The paper presents a WordNet annotated with CEFR language proficiency levels using LLM-based semantic similarity, creating a corpus for contextual lexical classifiers that achieve high accuracy comparable to gold-standard data.

Details

Motivation: To address the challenge that WordNet's fine-grained sense distinctions pose for second-language learners by integrating semantic networks with language-proficiency levels.

Method: Automated annotation using a large language model to measure semantic similarity between WordNet sense definitions and English Vocabulary Profile Online entries, then constructing a corpus for developing contextual lexical classifiers.

Result: Models fine-tuned on the corpus perform comparably to gold-standard annotations, and a practical classifier combining both datasets achieves a Macro-F1 score of 0.81, demonstrating high annotation accuracy.

Conclusion: The annotated WordNet, corpus, and classifiers are publicly available to bridge NLP and language education, facilitating more effective language learning.

Abstract: Although WordNet is a valuable resource owing to its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this, we developed a WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a large language model to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our method, we constructed a large-scale corpus containing both sense and CEFR-level information from our annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on our corpus perform comparably to those trained on gold-standard annotations. Furthermore, by combining our corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81, indicating the high accuracy of our annotations. Our annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.

[49] IMB: An Italian Medical Benchmark for Question Answering

Antonio Romano, Giuseppe Riccio, Mariano Barone, Marco Postiglione, Vincenzo Moscato

Main category: cs.CL

TL;DR: The paper introduces two Italian medical benchmarks (IMB-QA and IMB-MCQA) for medical question answering, showing that domain-specific adaptation and retrieval strategies outperform larger general-purpose LLMs in medical QA tasks.

Details

Motivation: Online medical forums contain valuable healthcare knowledge but pose challenges for automated QA systems due to informal language and linguistic complexity, especially in non-English languages like Italian.

Method: Created two Italian medical datasets: IMB-QA (782,644 patient-doctor conversations) and IMB-MCQA (25,862 multiple-choice questions). Used LLMs to improve data clarity while preserving meaning, and tested RAG and domain-specific fine-tuning approaches.

Result: Specialized adaptation strategies outperformed larger general-purpose models in medical question answering tasks, suggesting domain expertise and efficient retrieval are more beneficial than model scale.

Conclusion: Effective medical AI systems benefit more from domain expertise and information retrieval than from increased model scale. The datasets and frameworks are released to support multilingual medical QA research.

Abstract: Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: \textbf{IMB-QA}, containing 782,644 patient-doctor conversations from 77 medical categories, and \textbf{IMB-MCQA}, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that effective medical AI systems may benefit more from domain expertise and efficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: https://github.com/PRAISELab-PicusLab/IMB.

[50] DART: A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP

Mariano Barone, Antonio Laudante, Giuseppe Riccio, Antonio Romano, Marco Postiglione, Vincenzo Moscato

Main category: cs.CL

TL;DR: DART is the first structured corpus of Italian drug regulatory documents (Summaries of Product Characteristics) from the Italian Medicines Agency, created to address the lack of non-English pharmacological resources.

Details

Motivation: Most pharmacological NLP research relies on English corpora like DrugBank, creating a significant gap for other healthcare systems, particularly Italian.

Method: Built through a reproducible pipeline including web-scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using few-shot-tuned LLM with low-temperature decoding.

Result: DART provides structured information on key pharmacological domains and enables accurate inference of drug interactions and clinical implications when used with instruction-tuned LLMs.

Conclusion: The DART dataset successfully addresses the resource gap for Italian pharmacological NLP and demonstrates utility through an LLM-based drug interaction checker that performs well when grounded in the structured data.

Abstract: The extraction of pharmacological knowledge from regulatory documents has become a key focus in biomedical natural language processing, with applications ranging from adverse event monitoring to AI-assisted clinical decision support. However, research in this field has predominantly relied on English-language corpora such as DrugBank, leaving a significant gap in resources tailored to other healthcare systems. To address this limitation, we introduce DART (Drug Annotation from Regulatory Texts), the first structured corpus of Italian Summaries of Product Characteristics derived from the official repository of the Italian Medicines Agency (AIFA). The dataset was built through a reproducible pipeline encompassing web-scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using a few-shot-tuned large language model with low-temperature decoding. DART provides structured information on key pharmacological domains such as indications, adverse drug reactions, and drug-drug interactions. To validate its utility, we implemented an LLM-based drug interaction checker that leverages the dataset to infer clinically meaningful interactions. Experimental results show that instruction-tuned LLMs can accurately infer potential interactions and their clinical implications when grounded in the structured textual fields of DART. We publicly release our code on GitHub: https://github.com/PRAISELab-PicusLab/DART.

[51] How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices

Han Peng, Peiyu Liu, Zican Dong, Daixuan Cheng, Junyi Li, Yiru Tang, Shuo Wang, Wayne Xin Zhao

Main category: cs.CL

TL;DR: Diffusion language models (DLMs) underperform autoregressive models in speed despite their parallel decoding potential. This study systematically analyzes DLM efficiency, identifies evaluation issues, and finds AR models achieve higher throughput while acceleration strategies offer limited benefits.

Details

Motivation: DLMs offer parallel decoding for greater efficiency but current open-source implementations underperform AR models in speed, limiting their real-world utility. There's a need to understand why DLMs lag behind despite their theoretical advantages.

Method: Systematic study of DLM efficiency using empirical benchmarking and roofline-based theoretical analysis. Investigated acceleration strategies like dual cache and parallel decoding across different batch sizes.

Result: AR models generally achieve higher throughput than DLMs. Acceleration strategies mainly offer gains at small batch sizes, with benefits diminishing upon scaling. Current DLMs consistently lag behind AR counterparts in speed.

Conclusion: Robust evaluation methods and improved acceleration strategies are necessary to advance DLM research. The findings highlight the gap between theoretical parallelization potential and practical performance of diffusion language models.

Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm, offering a parallelable decoding process that could yield greater efficiency. Yet, in practice, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility. This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods. Through empirical benchmarking and a roofline-based theoretical analysis, we demonstrate that AR models generally achieve higher throughput, while DLMs consistently lag. We also investigate acceleration strategies, finding that techniques like dual cache and parallel decoding mainly offer gains at small batch sizes, with their benefits diminishing upon scaling. Our findings underscore the necessity of robust evaluation methods and improved acceleration strategies to advance research on DLMs.

[52] Identity-Aware Large Language Models require Cultural Reasoning

Alistair Plum, Anne-Marie Lutgen, Christoph Purschke, Achim Rettinger

Main category: cs.CL

TL;DR: Current LLMs lack cultural reasoning, defaulting to Western norms and failing to adapt to diverse cultural contexts, which can perpetuate stereotypes and erode trust. Cultural reasoning should be treated as a foundational capability alongside accuracy and coherence.

Details

Motivation: LLMs often reflect narrow cultural viewpoints that overlook global diversity, potentially sustaining stereotypes, ignoring minority perspectives, eroding trust, and perpetuating hate. Current models default to Western norms in moral judgments, idiom interpretation, and advice-giving.

Method: The paper defines cultural reasoning as the capacity to recognize culture-specific knowledge values and social norms, and adjust outputs to align with individual user expectations. It critiques current evaluation methods that focus on static accuracy scores rather than adaptive reasoning in context.

Result: Recent empirical studies show that fine-tuning on survey data only partly reduces models’ tendency to default to Western norms. Broader datasets alone cannot ensure genuine cultural competence.

Conclusion: Cultural reasoning must be treated as a foundational capability alongside factual accuracy and linguistic coherence. The paper lays a foundation for future systems to respond with greater sensitivity to human cultural diversity by clarifying the concept and outlining initial assessment directions.

Abstract: Large language models have become the latest trend in natural language processing, heavily featuring in the digital tools we use every day. However, their replies often reflect a narrow cultural viewpoint that overlooks the diversity of global users. This missing capability could be referred to as cultural reasoning, which we define here as the capacity of a model to recognise culture-specific knowledge values and social norms, and to adjust its output so that it aligns with the expectations of individual users. Because culture shapes interpretation, emotional resonance, and acceptable behaviour, cultural reasoning is essential for identity-aware AI. When this capacity is limited or absent, models can sustain stereotypes, ignore minority perspectives, erode trust, and perpetuate hate. Recent empirical studies strongly suggest that current models default to Western norms when judging moral dilemmas, interpreting idioms, or offering advice, and that fine-tuning on survey data only partly reduces this tendency. The present evaluation methods mainly report static accuracy scores and thus fail to capture adaptive reasoning in context. Although broader datasets can help, they cannot alone ensure genuine cultural competence. Therefore, we argue that cultural reasoning must be treated as a foundational capability alongside factual accuracy and linguistic coherence. By clarifying the concept and outlining initial directions for its assessment, a foundation is laid for future systems to be able to respond with greater sensitivity to the complex fabric of human culture.

[53] Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency

Svetlana Maslenkova, Clement Christophe, Marco AF Pimentel, Tathagata Raha, Muhammad Umar Salman, Ahmed Al Mahrooqi, Avani Gupta, Shadab Khan, Ronnie Rajan, Praveenkumar Kanithi

Main category: cs.CL

TL;DR: Analysis of biases in clinical language models, focusing on opioid prescription disparities across demographics, using a new 89B+ token healthcare dataset (HC4) and novel evaluation methods.

Details

Motivation: Need for understanding how training data affects clinical AI behavior and bias, especially given current lack of transparency in dataset curation and bias assessment practices.

Method: Introduced HC4 dataset (89B+ tokens), used established benchmarks plus novel healthcare-specific evaluation methodology to analyze differential opioid prescription tendencies across ethnicity, gender, and age groups.

Result: Provided crucial insights into potential downstream biases in clinical language models regarding opioid prescription disparities across demographic groups.

Conclusion: Comprehensive evaluation frameworks are essential for fostering trust and guiding improvements in clinical AI fairness and safety.

Abstract: Large language models offer transformative potential for healthcare, yet their responsible and equitable development depends critically on a deeper understanding of how training data characteristics influence model behavior, including the potential for bias. Current practices in dataset curation and bias assessment often lack the necessary transparency, creating an urgent need for comprehensive evaluation frameworks to foster trust and guide improvements. In this study, we present an in-depth analysis of potential downstream biases in clinical language models, with a focus on differential opioid prescription tendencies across diverse demographic groups, such as ethnicity, gender, and age. As part of this investigation, we introduce HC4: Healthcare Comprehensive Commons Corpus, a novel and extensively curated pretraining dataset exceeding 89 billion tokens. Our evaluation leverages both established general benchmarks and a novel, healthcare-specific methodology, offering crucial insights to support fairness and safety in clinical AI applications.

[54] Large language models for folktale type automation based on motifs: Cinderella case study

Tjaša Arčon, Marko Robnik-Šikonja, Polona Tratnik

Main category: cs.CL

TL;DR: Using AI methods to automatically detect motifs in Cinderella tales and analyze their similarities through clustering and dimensionality reduction.

Details

Motivation: To adapt artificial intelligence approaches for large-scale analyses in folkloristics, enabling computational analysis of extensive text collections and cross-lingual comparisons.

Method: Employed machine learning and natural language processing to automatically detect motifs in a large collection of Cinderella variants, then used clustering and dimensionality reduction to analyze their similarities and differences.

Result: Large language models successfully detected complex interactions in tales, demonstrating the feasibility of computational analysis for extensive text collections in folkloristics.

Conclusion: AI approaches, particularly large language models, enable effective computational analysis of folklore texts and facilitate cross-lingual comparisons in digital humanities research.

Abstract: Artificial intelligence approaches are being adapted to many research areas, including digital humanities. We built a methodology for large-scale analyses in folkloristics. Using machine learning and natural language processing, we automatically detected motifs in a large collection of Cinderella variants and analysed their similarities and differences with clustering and dimensionality reduction. The results show that large language models detect complex interactions in tales, enabling computational analysis of extensive text collections and facilitating cross-lingual comparisons.

Dennis Assenmacher, Paloma Piot, Katarina Laken, David Jurgens, Claudia Wagner

Main category: cs.CL

TL;DR: This paper addresses the overlooked issue of digital dehumanization in NLP, focusing on subtle forms beyond overtly negative statements, and introduces a bilingual dataset for detecting various dehumanization dimensions.

Details

Motivation: Current research primarily focuses on overtly negative dehumanization, overlooking subtler forms that perpetuate harmful biases against marginalized groups in online interactions, creating a significant gap in computational linguistics.

Method: Used different sampling methods to collect a theory-informed bilingual dataset from Twitter and Reddit, annotated 16,000 instances on document- and span-level by crowdworkers and experts, and fine-tuned ML models on this dataset.

Result: The dataset covers different dimensions of dehumanization and serves as both training resource and benchmark. Fine-tuned ML models achieved performance surpassing state-of-the-art models in zero and few-shot in-context settings.

Conclusion: The research successfully addresses the gap in detecting subtle dehumanization forms and provides an effective dataset and models that outperform existing approaches, advancing dehumanization detection in computational linguistics.

Abstract: Digital dehumanization, although a critical issue, remains largely overlooked within the field of computational linguistics and Natural Language Processing. The prevailing approach in current research concentrating primarily on a single aspect of dehumanization that identifies overtly negative statements as its core marker. This focus, while crucial for understanding harmful online communications, inadequately addresses the broader spectrum of dehumanization. Specifically, it overlooks the subtler forms of dehumanization that, despite not being overtly offensive, still perpetuate harmful biases against marginalized groups in online interactions. These subtler forms can insidiously reinforce negative stereotypes and biases without explicit offensiveness, making them harder to detect yet equally damaging. Recognizing this gap, we use different sampling methods to collect a theory-informed bilingual dataset from Twitter and Reddit. Using crowdworkers and experts to annotate 16,000 instances on a document- and span-level, we show that our dataset covers the different dimensions of dehumanization. This dataset serves as both a training resource for machine learning models and a benchmark for evaluating future dehumanization detection techniques. To demonstrate its effectiveness, we fine-tune ML models on this dataset, achieving performance that surpasses state-of-the-art models in zero and few-shot in-context settings.

[56] Dynamical model parameters from ultrasound tongue kinematics

Sam Kirkham, Patrycja Strycharczuk

Main category: cs.CL

TL;DR: Ultrasound tongue imaging can reliably estimate articulatory dynamics parameters comparable to EMA, supporting its use for evaluating speech motor control models.

Details

Motivation: To evaluate whether ultrasound imaging can reliably estimate dynamical parameters of speech articulation, as an alternative to traditional EMA methods.

Method: Compare parameters of linear harmonic oscillator models estimated from ultrasound tongue kinematics with simultaneously-recorded EMA data, including mandibular tracking.

Result: Ultrasound and EMA yield comparable dynamical parameters, and mandibular short tendon tracking adequately captures jaw motion.

Conclusion: Ultrasound kinematics can be used to evaluate dynamical articulatory models, providing a viable alternative to EMA for studying speech motor control.

Abstract: The control of speech can be modelled as a dynamical system in which articulators are driven toward target positions. These models are typically evaluated using fleshpoint data, such as electromagnetic articulography (EMA), but recent methodological advances make ultrasound imaging a promising alternative. We evaluate whether the parameters of a linear harmonic oscillator can be reliably estimated from ultrasound tongue kinematics and compare these with parameters estimated from simultaneously-recorded EMA data. We find that ultrasound and EMA yield comparable dynamical parameters, while mandibular short tendon tracking also adequately captures jaw motion. This supports using ultrasound kinematics to evaluate dynamical articulatory models.

[57] Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering

Feras AlMannaa, Talia Tseriotou, Jenny Chim, Maria Liakata

Main category: cs.CL

TL;DR: First comprehensive study of LLM comprehension capabilities on long-context medical QA, examining model size effects, memorization issues, and RAG strategies for improvement.

Details

Motivation: To investigate LLM capabilities in long-context medical question answering of clinical relevance, which hasn't been studied before.

Method: Comprehensive assessment across various content-inclusion settings, LLM models of different capabilities, datasets with different task formulations, and examination of RAG effects on medical long-context comprehension.

Result: Revealed insights on model size effects, limitations, underlying memorization issues, benefits of reasoning models, and best settings for single vs multi-document reasoning. Showcased RAG strategies for improvements over long-context.

Conclusion: The study provides multi-faceted evaluation addressing when RAG is beneficial over long-context through qualitative and error analyses, revealing common failure cases.

Abstract: This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.

[58] SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish

Josh McGiff, Nikola S. Nikolov

Main category: cs.CL

TL;DR: SemiAdapt and SemiLoRA are semi-supervised inference-efficient methods that improve domain adaptation in neural machine translation, enabling parameter-efficient fine-tuning to match or exceed full-model fine-tuning performance.

Details

Motivation: To make high-quality domain adaptation and fine-tuning more accessible for low-resource languages like Irish by reducing computational costs of fine-tuning large multilingual models.

Method: Introduces SemiAdapt and SemiLoRA as semi-supervised approaches that use small trainable adapter layers (LoRA) for parameter-efficient fine-tuning, focusing on embedding-based inference methods.

Result: SemiAdapt outperforms full-domain fine-tuning, and SemiLoRA enables PEFT methods to match or outperform full-model fine-tuning, especially on larger and noisier corpora.

Conclusion: These methods successfully bridge the computational barrier for low-resource language research while maintaining or improving translation quality, with all Irish translation models released as open resources.

Abstract: Fine-tuning is widely used to tailor large language models for specific tasks such as neural machine translation (NMT). However, leveraging transfer learning is computationally expensive when fine-tuning large multilingual models with billions of parameters, thus creating a barrier to entry for researchers working on low-resource domains such as Irish translation. Parameter-efficient fine-tuning (PEFT) bridges this gap by training on a fraction of the original model parameters, with the Low-Rank Adaptation (LoRA) approach introducing small, trainable adapter layers. We introduce SemiAdapt and SemiLoRA as semi-supervised inference-efficient approaches that strengthen domain adaptation and lead to improved overall performance in NMT. We demonstrate that SemiAdapt can outperform full-domain fine-tuning, while most notably, SemiLoRA can propel PEFT methods to match or even outperform full-model fine-tuning. We further evaluate domain-by-dataset fine-tuning and demonstrate that our embedding-based inference methods perform especially well on larger and noisier corpora. All Irish translation models developed in this work are released as open resources. These methods aim to make high-quality domain adaptation and fine-tuning more accessible to researchers working with low-resource languages.

[59] Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation

Ming Li

Main category: cs.CL

TL;DR: RLAAR is a curriculum reinforcement learning framework that improves LLM performance in multi-turn conversations by teaching models to balance correct answering with informed abstention, significantly reducing Lost-in-Conversation degradation.

Details

Motivation: Large Language Models perform well in single-turn settings but suffer from Lost-in-Conversation (LiC) - performance degradation as information is progressively revealed in multi-turn conversations.

Method: Uses curriculum RL with verifiable accuracy and abstention rewards, employing competence-gated curriculum that incrementally increases dialogue difficulty, multi-turn on-policy rollouts, and mixed-reward system.

Result: Significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%) on LiC benchmarks.

Conclusion: Provides a practical recipe for building reliable and trustworthy multi-turn LLMs by balancing problem-solving with informed abstention.

Abstract: Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.

[60] Topoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting

Taha Binhuraib, Greta Tuckute, Nicholas Blauch

Main category: cs.CL

TL;DR: The paper introduces ‘Topoformers’ - Transformers with topographic organization through spatial querying and spatial reweighting, which organize representations spatially like biological brains while maintaining performance on NLP tasks.

Details

Motivation: Biological brains exhibit spatial functional organization where neurons are arranged topographically, while most machine learning models lack spatial biases and have disorganized vector spaces that are difficult to interpret.

Method: Two key innovations: 1) Spatial querying - keys and queries arranged on 2D grids with local pools of queries associated with each key; 2) Spatial reweighting - converting standard fully connected self-attention into locally connected layers.

Result: Topoformers perform on par with non-topographic controls on NLP benchmarks while producing interpretable topographic organization. They show alignment with human brain language network responses in fMRI data.

Conclusion: Topoformers enable interpretable spatial organization in Transformers while maintaining performance, offering promise for better NLP interpretability and more accurate models of linguistic organization in the human brain.

Abstract: Spatial functional organization is a hallmark of biological brains: neurons are arranged topographically according to their response properties, at multiple scales. In contrast, representations within most machine learning models lack spatial biases, instead manifesting as disorganized vector spaces that are difficult to visualize and interpret. Here, we propose a novel form of self-attention that turns Transformers into “Topoformers” with topographic organization. We introduce spatial querying - where keys and queries are arranged on 2D grids, and local pools of queries are associated with a given key - and spatial reweighting, where we convert the standard fully connected layer of self-attention into a locally connected layer. We first demonstrate the feasibility of our approach by training a 1-layer Topoformer on a sentiment classification task. Training with spatial querying encourages topographic organization in the queries and keys, and spatial reweighting separately encourages topographic organization in the values and self-attention outputs. We then apply the Topoformer motifs at scale, training a BERT architecture with a masked language modeling objective. We find that the topographic variant performs on par with a non-topographic control model on NLP benchmarks, yet produces interpretable topographic organization as evaluated via eight linguistic test suites. Finally, analyzing an fMRI dataset of human brain responses to a large set of naturalistic sentences, we demonstrate alignment between low-dimensional topographic variability in the Topoformer model and human brain language network. Scaling up Topoformers further holds promise for greater interpretability in NLP research, and for more accurate models of the organization of linguistic information in the human brain.

[61] AI use in American newspapers is widespread, uneven, and rarely disclosed

Jenna Russell, Marzena Karpinska, Destiny Akinode, Katherine Thai, Bradley Emi, Max Spero, Mohit Iyyer

Main category: cs.CL

TL;DR: Approximately 9% of newspaper articles are AI-generated, with higher usage in local outlets, specific topics, and opinion pieces, but rarely disclosed.

Details

Motivation: To understand the extent and distribution of AI use in published newspaper articles, as AI transforms journalism but its prevalence remains unclear.

Method: Audited 186K articles from 1.5K American newspapers using Pangram AI detector, plus manual audit of 100 AI-flagged articles and analysis of 45K opinion pieces from major publications.

Result: 9% of articles are AI-generated, unevenly distributed (more in local outlets, weather/tech topics, certain ownership groups). Opinion pieces are 6.4x more likely to contain AI content. Only 5% of AI-flagged articles disclosed AI use.

Conclusion: Urgent need for greater transparency and updated editorial standards regarding AI use in journalism to maintain public trust.

Abstract: AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.

[62] KAT-Coder Technical Report

Zizheng Zhan, Ken Deng, Xiaojiang Zhang, Jinghui Wang, Huaixi Tang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, Minglei Zhang, Shaojie Wang, Shangpeng Yan, Kepeng Lei, Zongxian Feng, Huiming Wang, Zheng Lin, Mengtong Li, Mengfei Xie, Yinghan Cui, Xuxing Chen, Chao Wang, Weihao Li, Wenqiang Zhu, Jiarong Zhang, Jingxuan Xu, Songwei Yu, Yifan Yao, Xinping Lei, Han Li, Junqi Xiong, Zuchen Gao, Dailin Li, Haimo Li, Jiaheng Liu, Yuqun Zhang, Junyi Peng, Haotian Zhang, Bin Chen

Main category: cs.CL

TL;DR: KAT-Coder is a large-scale agentic code model trained through a multi-stage curriculum to bridge the gap between static text-based training and dynamic real-world agentic execution in software development.

Details

Motivation: To address the challenge of bridging static text-based training with dynamic real-world agentic execution in coding workflows, enabling autonomous reasoning, planning, and action in interactive software development.

Method: Multi-stage training curriculum including: Mid-Term Training for reasoning/planning/reflection, Supervised Fine-Tuning with balanced dataset across 20 languages/10 contexts/10 task types, Reinforcement Fine-Tuning with multi-ground-truth reward formulation, and Reinforcement-to-Deployment Adaptation using Error-Masked SFT and Tree-Structured Trajectory Training.

Result: KAT-Coder achieves robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. The 32B model KAT-Dev has been open-sourced.

Conclusion: The multi-stage curriculum enables effective training of agentic code models capable of real-world deployment, with KAT-Coder demonstrating strong performance in intelligent coding agent applications.

Abstract: Recent advances in large language models (LLMs) have enabled progress in agentic coding, where models autonomously reason, plan, and act within interactive software development workflows. However, bridging the gap between static text-based training and dynamic real-world agentic execution remains a core challenge. In this technical report, we present KAT-Coder, a large-scale agentic code model trained through a multi-stage curriculum encompassing Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning (RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances reasoning, planning, and reflection capabilities through a corpus of real software engineering data and synthetic agentic interactions. The SFT stage constructs a million-sample dataset balancing twenty programming languages, ten development contexts, and ten task archetypes. The RFT stage introduces a novel multi-ground-truth reward formulation for stable and sample-efficient policy optimization. Finally, the Reinforcement-to-Deployment phase adapts the model to production-grade IDE environments using Error-Masked SFT and Tree-Structured Trajectory Training. In summary, these stages enable KAT-Coder to achieve robust tool-use reliability, instruction alignment, and long-context reasoning, forming a deployable foundation for real-world intelligent coding agents. Our KAT series 32B model, KAT-Dev, has been open-sourced on https://huggingface.co/Kwaipilot/KAT-Dev.

[63] WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

Guanzhong He, Zhen Yang, Jinxin Liu, Bin Xu, Lei Hou, Juanzi Li

Main category: cs.CL

TL;DR: WebSeer is a search agent trained via reinforcement learning with self-reflection mechanism that achieves state-of-the-art results on QA benchmarks by enabling longer tool-use chains and improved accuracy.

Details

Motivation: Existing search agents using reinforcement learning suffer from shallow tool-use depth and error accumulation in multiple iterative interactions, limiting their effectiveness in dynamic interactive retrieval.

Method: Constructed a large dataset with reflection patterns and designed a two-stage training framework unifying cold start and reinforcement learning within self-reflection paradigm for web-based environments, enabling generation of longer reflective tool-use trajectories.

Result: Achieved state-of-the-art results on HotpotQA (72.3%) and SimpleQA (90.0%) using a single 14B model, with strong generalization to out-of-distribution datasets.

Conclusion: The self-reflection mechanism in WebSeer substantially extends tool-use chains and improves answer accuracy, demonstrating the effectiveness of reflection-enhanced reinforcement learning for search agents.

Abstract: Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at https://github.com/99hgz/WebSeer

[64] Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring

Shuxin Lin, Dhaval Patel, Christodoulos Constantinides

Main category: cs.CL

TL;DR: Knowledge distillation framework transfers Chain-of-Thought reasoning from LLMs to Small Language Models for industrial asset health applications, improving reasoning capabilities while maintaining efficiency.

Details

Motivation: Small Language Models are popular in industrial applications due to efficiency, but struggle with complex reasoning tasks in specialized fields like Industry 4.0.

Method: Proposed knowledge distillation framework using Chain-of-Thought distillation from LLMs to SLMs via multi-choice question answering prompts, with in-context learning to verify knowledge quality.

Result: Fine-tuned SLMs with CoT reasoning significantly outperform base models and narrow the performance gap with LLM counterparts.

Conclusion: The framework successfully enhances reasoning capabilities of SLMs for industrial applications through knowledge distillation, making them more viable alternatives to LLMs while maintaining computational efficiency.

Abstract: Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: https://github.com/IBM/FailureSensorIQ.

[65] MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, Yucheng Li, Yuqing Yang, Lili Qiu

Main category: cs.CL

TL;DR: MTraining is a distributed methodology that enables efficient training of LLMs with ultra-long contexts (up to 512K tokens) using dynamic sparse attention, achieving 6x higher training throughput while maintaining accuracy.

Details

Motivation: The computational cost of training LLMs with long context windows is high, and existing dynamic sparse attention methods face challenges with worker- and step-level imbalance in distributed settings.

Method: MTraining integrates three components: dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention to address computational imbalance and communication overheads.

Result: Successfully trained Qwen2.5-3B to expand context window from 32K to 512K tokens on 32 A100 GPUs, achieving up to 6x higher training throughput while preserving accuracy across RULER, PG-19, InfiniteBench, and Needle In A Haystack benchmarks.

Conclusion: MTraining provides an effective distributed solution for training LLMs with ultra-long contexts using dynamic sparse attention, significantly improving training efficiency without compromising model performance.

Abstract: The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.

[66] Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: Critique-Post-Edit is a reinforcement learning framework that uses personalized generative reward models and critique-based editing to enable faithful and controllable LLM personalization, outperforming standard methods.

Details

Motivation: Standard personalization methods like SFT and RLHF struggle with nuanced personalization and are prone to reward hacking, leading to verbose and superficially personalized responses.

Method: Proposes Critique-Post-Edit framework with: (1) Personalized Generative Reward Model providing multi-dimensional scores and textual critiques, (2) Critique-Post-Edit mechanism where policy model revises outputs based on critiques.

Result: Substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves 11% average win-rate improvement, and Qwen2.5-14B surpasses GPT-4 performance.

Conclusion: Demonstrates a practical path to faithful, efficient, and controllable personalization of LLMs.

Abstract: Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.

[67] Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen

Main category: cs.CL

TL;DR: Ring-1T is the first open-source trillion-parameter thinking model with 1T total parameters and 50B activated per token, achieving state-of-the-art results on multiple benchmarks including IMO-2025 silver medal level.

Details

Motivation: To democratize large-scale reasoning intelligence by creating the first open-source trillion-parameter thinking model and addressing unprecedented training challenges at this scale.

Method: Three key innovations: IcePop for RL training stability via token-level discrepancy masking and clipping; C3PO++ for efficient long rollout processing under token budget; ASystem RL framework to overcome systemic bottlenecks in trillion-parameter training.

Result: Breakthrough performance: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, 55.94 on ARC-AGI-v1, and silver medal-level result on IMO-2025.

Conclusion: Ring-1T establishes a new baseline for open-source model performance and represents a significant milestone in democratizing large-scale reasoning intelligence by providing the complete 1T parameter MoE model to the community.

Abstract: We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.

[68] How Do LLMs Use Their Depth?

Akshat Gupta, Jay Yeung, Gopala Anumanchipalli, Anna Ivanova

Main category: cs.CL

TL;DR: LLMs use depth in a structured “Guess-then-Refine” pattern: early layers make high-frequency token guesses, later layers refine them with contextual information.

Details

Motivation: To understand the fine-grained layer-wise prediction dynamics of large language models and how they use their computational depth non-uniformly.

Method: Traced intermediate representations of open-weight models during inference, analyzed token frequency patterns, and conducted three case studies: part-of-speech analysis, fact recall tasks, and multiple-choice tasks.

Result: Early layers predict high-frequency tokens as statistical guesses (refined >70% of the time), function words are predicted earliest, first tokens in multi-token answers require more depth, and response format is identified early while final decisions come late.

Conclusion: LLMs employ a structured depth usage pattern with early guessing and late refinement, providing insights for improving computational efficiency in transformer models.

Abstract: Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a “Guess-then-Refine” framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model early on due to the lack of appropriate contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. Even high-frequency token predictions from early layers get refined >70% of the time, indicating that correct token prediction is not “one-and-done”. We then go beyond frequency-based prediction to examine the dynamic usage of layer depth across three case studies. (i) Part-of-speech analysis shows that function words are, on average, the earliest to be predicted correctly. (ii) Fact recall task analysis shows that, in a multi-token answer, the first token requires more computational depth than the rest. (iii) Multiple-choice task analysis shows that the model identifies the format of the response within the first half of the layers, but finalizes its response only toward the end. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models.

[69] A Survey of Automatic Hallucination Evaluation on Natural Language Generation

Siya Qi, Lin Gui, Yulan He, Zheng Yuan

Main category: cs.CL

TL;DR: This survey systematically analyzes 105 automatic hallucination evaluation methods for LLMs, revealing 77.1% specifically target LLMs and proposing a structured framework to organize the fragmented field.

Details

Motivation: The rapid advancement of LLMs has created a pressing need to reliably assess hallucinations to ensure model trustworthiness, but the field remains fragmented in methodologies, limiting conceptual clarity and practical progress.

Method: Systematic analysis of 105 evaluation methods, formulation of a structured framework based on survey of foundational datasets and benchmarks, and development of a taxonomy of evaluation methodologies documenting evolution from pre-LLM to post-LLM approaches.

Result: Revealed that 77.1% of evaluation methods specifically target LLMs, identified fundamental limitations in current approaches and their implications for real-world deployment, and documented the paradigm shift requiring new evaluation frameworks.

Conclusion: Proposed strategic directions including enhanced interpretability mechanisms and integration of application-specific evaluation criteria, providing a roadmap for developing more robust and practical hallucination evaluation systems.

Abstract: The rapid advancement of Large Language Models (LLMs) has brought a pressing challenge: how to reliably assess hallucinations to guarantee model trustworthiness. Although Automatic Hallucination Evaluation (AHE) has become an indispensable component of this effort, the field remains fragmented in its methodologies, limiting both conceptual clarity and practical progress. This survey addresses this critical gap through a systematic analysis of 105 evaluation methods, revealing that 77.1% specifically target LLMs, a paradigm shift that demands new evaluation frameworks. We formulate a structured framework to organize the field, based on a survey of foundational datasets and benchmarks and a taxonomy of evaluation methodologies, which together systematically document the evolution from pre-LLM to post-LLM approaches. Beyond taxonomical organization, we identify fundamental limitations in current approaches and their implications for real-world deployment. To guide future research, we delineate key challenges and propose strategic directions, including enhanced interpretability mechanisms and integration of application-specific evaluation criteria, ultimately providing a roadmap for developing more robust and practical hallucination evaluation systems.

[70] Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models

Artem Vazhentsev, Ekaterina Fadeeva, Rui Xing, Gleb Kuzmin, Ivan Lazichny, Alexander Panchenko, Preslav Nakov, Timothy Baldwin, Maxim Panov, Artem Shelmanov

Main category: cs.CL

TL;DR: Proposes a method to quantify uncertainty in LLMs by learning conditional dependencies from attention maps and recurrent uncertainty scores, achieving better selective generation performance.

Details

Motivation: Uncertainty quantification is important for detecting LLM hallucinations, but it's complicated by the conditional dependency in autoregressive generation that's hard to model explicitly.

Method: Train a regression model using LLM attention maps, current generation probabilities, and recurrently computed uncertainty scores from previous tokens, with a two-staged training procedure for recurrent features.

Result: Experimental evaluation on ten datasets and three LLMs shows substantial improvements over competing unsupervised and supervised approaches for selective generation.

Conclusion: The proposed attention-based uncertainty quantification method is highly effective for detecting low-quality LLM outputs and enabling selective generation.

Abstract: Uncertainty quantification (UQ) has emerged as a promising approach for detecting hallucinations and low-quality output of Large Language Models (LLMs). However, obtaining proper uncertainty scores is complicated by the conditional dependency between the generation steps of an autoregressive LLM because it is hard to model it explicitly. Here, we propose to learn this dependency from attention-based features. In particular, we train a regression model that leverages LLM attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens. To incorporate the recurrent features, we also suggest a two-staged training procedure. Our experimental evaluation on ten datasets and three LLMs shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.

[71] When Text Embedding Meets Large Language Model: A Comprehensive Survey

Zhijie Nie, Zhangchi Feng, Mingxin Li, Cunwang Zhang, Yanzhao Zhang, Dingkun Long, Richong Zhang

Main category: cs.CL

TL;DR: Survey paper categorizing the interplay between LLMs and text embeddings into three themes: LLM-augmented embeddings, LLMs as embedders, and embedding understanding with LLMs, while addressing challenges and future directions.

Details

Motivation: Text embeddings remain crucial for practical applications like semantic matching and information retrieval, despite advances in generative LLMs. Integrating LLMs with text embeddings has become a major research focus.

Method: Categorizes the relationship between LLMs and text embeddings into three themes: (1) LLM-augmented text embedding, (2) LLMs as text embedders, and (3) Text embedding understanding with LLMs. Organizes works by interaction patterns rather than specific applications.

Result: Provides a systematic overview of contributions from various research domains in the LLM era, highlighting both persistent challenges from the pre-LLM era and new obstacles introduced by LLMs.

Conclusion: Outlines prospective directions for text embedding evolution, addressing theoretical and practical opportunities in the rapidly advancing NLP landscape, emphasizing the continued importance of integrating LLMs with text embeddings.

Abstract: Text embedding has become a foundational technology in natural language processing (NLP) during the deep learning era, driving advancements across a wide array of downstream tasks. While many natural language understanding challenges can now be modeled using generative paradigms and leverage the robust generative and comprehension capabilities of large language models (LLMs), numerous practical applications - such as semantic matching, clustering, and information retrieval - continue to rely on text embeddings for their efficiency and effectiveness. Therefore, integrating LLMs with text embeddings has become a major research focus in recent years. In this survey, we categorize the interplay between LLMs and text embeddings into three overarching themes: (1) LLM-augmented text embedding, enhancing traditional embedding methods with LLMs; (2) LLMs as text embedders, adapting their innate capabilities for high-quality embedding; and (3) Text embedding understanding with LLMs, leveraging LLMs to analyze and interpret embeddings. By organizing recent works based on interaction patterns rather than specific downstream applications, we offer a novel and systematic overview of contributions from various research and application domains in the era of LLMs. Furthermore, we highlight the unresolved challenges that persisted in the pre-LLM era with pre-trained language models (PLMs) and explore the emerging obstacles brought forth by LLMs. Building on this analysis, we outline prospective directions for the evolution of text embedding, addressing both theoretical and practical opportunities in the rapidly advancing landscape of NLP.

[72] Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty

Yu Feng, Phu Mon Htut, Zheng Qi, Wei Xiao, Manuel Mager, Nikolaos Pappas, Kishaloy Halder, Yang Li, Yassine Benajiba, Dan Roth

Main category: cs.CL

TL;DR: DiverseAgentEntropy is a novel method that uses multi-agent interaction across diverse query variations to better estimate uncertainty in black-box LLMs, addressing limitations of self-consistency methods that can be misled by contextual biases.

Details

Motivation: Existing self-consistency methods for LLM uncertainty estimation can be misleading because models may confidently provide incorrect answers to target queries while giving accurate answers to knowledge-preserving variations, due to suboptimal parametric knowledge retrieval and contextual biases.

Method: DiverseAgentEntropy employs multi-agent interaction across diverse query variations for uncertainty estimation, providing a theoretically-grounded approach that better assesses true model uncertainty.

Result: The method more accurately assesses LLM uncertainty and improves hallucination detection, outperforming existing self-consistency based techniques.

Conclusion: Multi-agent interaction across diverse query variations provides a more reliable approach to uncertainty estimation in black-box LLMs, addressing the limitations of self-consistency methods affected by contextual biases.

Abstract: Quantifying uncertainty in black-box LLMs is vital for reliable responses and scalable oversight. Existing methods, which gauge a model’s uncertainty through evaluating self-consistency in responses to the target query, can be misleading: an LLM may confidently provide an incorrect answer to a target query, yet give a confident and accurate answer to that same target query when answering a knowledge-preserving perturbation of the query. We systematically analyze the model behaviors and demonstrate that this discrepancy stems from suboptimal retrieval of parametric knowledge, often due to contextual biases that prevent consistent access to stored knowledge. We then introduce DiverseAgentEntropy, a novel, theoretically-grounded method employing multi-agent interaction across diverse query variations for uncertainty estimation of black-box LLMs. This approach more accurately assesses an LLM’s true uncertainty and improves hallucination detection, outperforming existing self-consistency based techniques.

[73] WHAT-IF: Exploring Branching Narratives by Meta-Prompting Large Language Models

Runsheng “Anson” Huang, Lara J. Martin, Chris Callison-Burch

Main category: cs.CL

TL;DR: WHAT-IF is an interactive fiction system that uses GPT-4 with zero-shot meta-prompting to generate branching narratives from linear stories, creating alternate timelines where players can choose between AI-generated decision branches.

Details

Motivation: To enable interactive storytelling where players can explore 'what-if' scenarios in existing narratives, creating personalized alternate storylines from linear plots.

Method: Uses GPT-4 with zero-shot meta-prompting to identify key decision points in linear stories and generate coherent branching alternatives. Stores the branching plot in a graph structure for tracking and maintaining narrative coherence.

Result: The system successfully creates well-structured alternate storylines that maintain coherence with major plot points from the original story, enabling interactive fiction gameplay.

Conclusion: WHAT-IF demonstrates that LLMs can effectively generate branching narratives from linear stories, providing a framework for interactive fiction that preserves story coherence while offering player agency through decision-making.

Abstract: WHAT-IF – Writing a Hero’s Alternate Timeline through Interactive Fiction – is a system that uses zero-shot meta-prompting to create branching narratives from a prewritten story. Played as an interactive fiction (IF) game, WHAT-IF lets the player choose between decisions that the large language model (LLM) GPT-4 generates as possible branches in the story. Starting with an existing linear plot as input, a branch is created at each key decision taken by the main character. By meta-prompting the LLM to consider the major plot points from the story, the system produces coherent and well-structured alternate storylines. WHAT-IF stores the branching plot tree in a graph which helps it to both keep track of the story for prompting and maintain the structure for the final IF system. A demo of WHAT-IF can be found at https://what-if-game.github.io/.

[74] DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models

Niyati Bafna, Emily Chang, Nathaniel R. Robinson, David R. Mortensen, Kenton Murray, David Yarowsky, Hale Sirin

Main category: cs.CL

TL;DR: DialUp adapts MT models to low-resource dialects using training-time adaptation to dialectal data (M->D) and inference-time adaptation of dialectal data to model expertise (D->M), showing performance gains across multiple language families.

Details

Motivation: Most languages are low-resource and lack MT support, but many have closely-related high-resource neighbors with regular linguistic differences, highlighting the need for model robustness to dialectal variation.

Method: Two approaches: M->D trains models on synthetic data showing linguistic mechanisms of dialectal variation; D->M adapts dialectal data at inference time to match model expertise for known target dialects.

Result: Considerable performance gains for several dialects from four language families, modest gains for two other families. Language varieties with low baseline MT performance benefit most.

Conclusion: DialUp successfully improves MT for low-resource dialects through complementary training and inference adaptations, with greatest benefits for dialects with poor baseline performance.

Abstract: Most of the world’s languages and dialects are low-resource, and lack support in mainstream machine translation (MT) models. However, many of them have a closely-related high-resource language (HRL) neighbor, and differ in linguistically regular ways from it. This underscores the importance of model robustness to dialectal variation and cross-lingual generalization to the HRL dialect continuum. We present DialUp, consisting of a training-time technique for adapting a pretrained model to dialectal data (M->D), and an inference-time intervention adapting dialectal data to the model expertise (D->M). M->D induces model robustness to potentially unseen and unknown dialects by exposure to synthetic data exemplifying linguistic mechanisms of dialectal variation, whereas D->M treats dialectal divergence for known target dialects. These methods show considerable performance gains for several dialects from four language families, and modest gains for two other language families. We also conduct feature and error analyses, which show that language varieties with low baseline MT performance are more likely to benefit from these approaches.

[75] FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model

Jinwei Hu, Zhenglin Huang, Xiangyu Yin, Wenjie Ruan, Guangliang Cheng, Yi Dong, Xiaowei Huang

Main category: cs.CL

TL;DR: FALCON is a novel machine unlearning approach that uses fine-grained activation manipulation to safely remove sensitive information from language models while preserving utility, addressing limitations of existing coarse-grained methods.

Details

Motivation: Large language models can encode sensitive/harmful information, raising safety concerns. Existing unlearning approaches using coarse-grained loss combinations struggle to precisely separate knowledge and balance removal effectiveness with model utility.

Method: FALCON uses information-theoretic guidance for parameter selection, contrastive mechanisms for representation separation, and projects conflict gradients onto orthogonal subspaces to resolve forgetting-retention conflicts.

Result: Extensive experiments show FALCON achieves superior unlearning effectiveness while maintaining model utility, with robust resistance against knowledge recovery attempts.

Conclusion: FALCON provides an effective solution for machine unlearning that precisely removes sensitive information while preserving model performance, addressing key safety concerns in language model deployment.

Abstract: Large language models have been widely applied, but can inadvertently encode sensitive or harmful information, raising significant safety concerns. Machine unlearning has emerged to alleviate this concern; however, existing training-time unlearning approaches, relying on coarse-grained loss combinations, have limitations in precisely separating knowledge and balancing removal effectiveness with model utility. In contrast, we propose Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment (FALCON), a novel representation-guided unlearning approach that leverages information-theoretic guidance for efficient parameter selection, employs contrastive mechanisms to enhance representation separation, and projects conflict gradients onto orthogonal subspaces to resolve conflicts between forgetting and retention objectives. Extensive experiments demonstrate that FALCON achieves superior unlearning effectiveness while maintaining model utility, exhibiting robust resistance against knowledge recovery attempts.

[76] DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

Yingli Shen, Wen Lai, Shuo Wang, Xueren Zhang, Kangyang Luo, Alexander Fraser, Maosong Sun

Main category: cs.CL

TL;DR: DCAD-2000 is a large-scale multilingual corpus covering 2,282 languages with 46.72TB of text, using anomaly detection for data cleaning to improve quality and downstream LLM performance.

Details

Motivation: The rapid development of multilingual LLMs requires high-quality, diverse, and well-curated multilingual datasets, overcoming limitations of existing data cleaning approaches that rely on manual heuristic thresholds.

Method: Reframe data cleaning as an anomaly detection problem, creating DCAD-2000 from Common Crawl and existing multilingual sources, covering 2,282 languages with dynamic filtering to automatically identify and remove noisy content.

Result: Fine-tuning LLMs on DCAD-2000 demonstrates notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.

Conclusion: The anomaly detection approach to data cleaning substantially improves multilingual dataset quality and enhances LLM performance, especially benefiting low-resource languages.

Abstract: The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and well-curated multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus constructed from newly extracted Common Crawl data and existing multilingual sources. DCAD-2000 covers 2,282 languages, 46.72TB of text, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of existing data cleaning approaches, which rely on manually designed heuristic thresholds, we reframe data cleaning as an anomaly detection problem. This dynamic filtering paradigm substantially improves data quality by automatically identifying and removing noisy or anomalous content. By fine-tuning LLMs on DCAD-2000, we demonstrate notable improvements in data quality, robustness of the cleaning pipeline, and downstream performance, particularly for low-resource languages across multiple multilingual benchmarks.

[77] Temporal Alignment of LLMs through Cycle Encoding for Long-Range Time Representations

Xue Han, Qian Hu, Yitong Wang, Wenchun Gao, Lianlian Zhang, Qing Wang, Lijun Mei, Chao Deng, Junlan Feng

Main category: cs.CL

TL;DR: Ticktack is a method that addresses temporal misalignment in LLMs by using sexagenary year expressions and polar coordinates for better time representation over long spans.

Details

Motivation: LLMs suffer from temporal misalignment due to sparse temporal information in training data over long time periods, leading to insufficient learning or catastrophic forgetting.

Method: Uses sexagenary year expressions for uniform yearly distribution, polar coordinates to model 60-term cycles, temporal encoding, and temporal representational alignment for post-training.

Result: Experimental results show improved performance on time-related tasks, particularly over long periods, using a created long time span benchmark.

Conclusion: The Ticktack methodology effectively addresses LLM’s long-time span misalignment issues and improves temporal understanding in language models.

Abstract: Large language models (LLMs) suffer from temporal misalignment issues especially across long span of time. The issue arises from knowing that LLMs are trained on large amounts of data where temporal information is rather sparse over long times, such as thousands of years, resulting in insufficient learning or catastrophic forgetting by the LLMs. This paper proposes a methodology named “Ticktack” for addressing the LLM’s long-time span misalignment in a yearly setting. Specifically, we first propose to utilize the sexagenary year expression instead of the Gregorian year expression employed by LLMs, achieving a more uniform distribution in yearly granularity. Then, we employ polar coordinates to model the sexagenary cycle of 60 terms and the year order within each term, with additional temporal encoding to ensure LLMs understand them. Finally, we present a temporal representational alignment approach for post-training LLMs that effectively distinguishes time points with relevant knowledge, hence improving performance on time-related tasks, particularly over a long period. We also create a long time span benchmark for evaluation. Experimental results prove the effectiveness of our proposal.

[78] Harnessing Test-time Adaptation for NLU tasks Involving Dialects of English

Duke Nguyen, Aditya Joshi, Flora Salim

Main category: cs.CL

TL;DR: SHOT, a test-time domain adaptation technique, is evaluated for dialectal NLP tasks, showing viability when labeled datasets are unavailable. Dialectal gap correlates with SHOT effectiveness, and finetuning on Standard American English often outperforms dialectal data.

Details

Motivation: Address domain adaptation challenges in dialectal NLP where models trained on Standard American English perform poorly on other English dialects (Indian, Singaporean, Nigerian) due to distribution shifts, especially given scarce dialectal datasets.

Method: Evaluate SHOT TTDA technique on dialectal GLUE datasets, finetune on different dialect combinations, and theoretically analyze dialectal gap correlation with SHOT effectiveness.

Result: SHOT is viable for dialectal NLP without labeled data. Dialectal gap positively correlates with SHOT effectiveness. Finetuning on SAE often yields higher performance than on dialectal data.

Conclusion: SHOT provides effective test-time domain adaptation for dialectal NLP, with dialectal gap serving as a useful predictor of adaptation success, and SAE finetuning remains surprisingly effective.

Abstract: Test-time domain adaptation (TTDA) is an excellent method which helps generalize models across domains, tasks, and distributions without the use of labeled datasets. Thus, TTDA is very useful in natural language processing (NLP) in the dialectal setting, since oftentimes, models are trained on Standard American English (SAE), evaluated on Indian English (IndE), Singaporean English (SingE), or Nigerian English (NgE), of which distribution differs significantly from the former. This is especially useful since dialectal datasets are scarce. In this paper, we explore one of the most famous TTDA techniques, SHOT, in dialectal NLP. We finetune and evaluate SHOT on different combinations of dialectal GLUE. Our findings show that SHOT is a viable technique when labeled datasets are unavailable. We also theoretically propose the concept of dialectal gap and show that it has a positive correlation with the effectiveness of SHOT. We also find that in many cases, finetuning on SAE yields higher performance than finetuning on dialectal data.

Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun

Main category: cs.CL

TL;DR: VITA-Audio is an end-to-end large speech model that reduces first audio token latency in streaming by generating multiple audio tokens in a single forward pass using a lightweight MCTP module and progressive training strategy.

Details

Motivation: Existing speech models have high latency when generating the first audio token during streaming, creating a deployment bottleneck for natural human-computer interaction.

Method: Proposes a lightweight Multiple Cross-modal Token Prediction (MCTP) module for generating multiple audio tokens in one forward pass, plus a four-stage progressive training strategy to maintain speech quality while accelerating inference.

Result: Achieves 3-5x inference speedup at 7B parameter scale and outperforms similar-sized open-source models on ASR, TTS, and SQA benchmarks.

Conclusion: VITA-Audio enables real-time conversational capabilities with minimal latency and is the first multi-modal large language model capable of generating audio output during the first forward pass.

Abstract: With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

[80] From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora

Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, Maosong Sun

Main category: cs.CL

TL;DR: This paper introduces TED2025, a large-scale multi-way parallel corpus spanning 113 languages based on TED Talks, and demonstrates that training LLMs on such aligned parallel data consistently outperforms using unaligned multilingual data across six benchmarks.

Details

Motivation: Current multilingual LLM training uses unaligned data which limits cross-lingual semantic capture. Multi-way parallel data provides stronger cross-lingual consistency and greater potential for improving multilingual performance.

Method: Created TED2025 corpus with 113 languages (up to 50 languages aligned in parallel). Investigated best practices for leveraging multi-way parallel data through continued pretraining, instruction tuning, and analysis of key influencing factors.

Result: Experiments on six multilingual benchmarks show that models trained on multi-way parallel data consistently outperform those trained on unaligned multilingual data.

Conclusion: Multi-way parallel data is more effective than unaligned multilingual data for enhancing LLM multilingual capabilities, with the TED2025 corpus providing a valuable resource for this purpose.

Abstract: Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.

[81] Improving the fact-checking performance of language models by relying on their entailment ability

Gaurav Kumar, Debajyoti Mazumder, Ayush Garg, Jasabanta Patro

Main category: cs.CL

TL;DR: Proposes using LLM-generated entailed justifications to train encoder-only language models for fact-checking, achieving superior performance compared to existing methods.

Details

Motivation: Existing automated fact-checking approaches have low accuracy for real-world deployment, despite various strategies like end-to-end training, retrieval-augmented generation, and prompt engineering.

Method: Train encoder-only language models using entailed justifications generated by large language models for fact-checking tasks.

Result: Demonstrated superiority over recent works and various prompting/fine-tuning strategies through rigorous experiments, quality analysis of explanations, ablation studies, and error analysis.

Conclusion: The proposed simple yet effective strategy of using LLM-generated entailed justifications to train encoder-only models significantly improves fact-checking performance.

Abstract: Automated fact-checking has been a challenging task for the research community. Past works tried various strategies, such as end-to-end training, retrieval-augmented generation, and prompt engineering, to build robust fact-checking systems. However, their accuracy has not been very high for real-world deployment. We, on the other hand, propose a simple yet effective strategy, where entailed justifications generated by LLMs are used to train encoder-only language models (ELMs) for fact-checking. We conducted a rigorous set of experiments, comparing our approach with recent works and various prompting and fine-tuning strategies to demonstrate the superiority of our approach. Additionally, we did quality analysis of model explanations, ablation studies, and error analysis to provide a comprehensive understanding of our approach.

[82] Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Ruihua Song

Main category: cs.CL

TL;DR: CoLaR is a framework that compresses reasoning processes in latent space through two-stage training, enabling efficient silent reasoning with dynamic speed adjustment.

Details

Motivation: Chain-of-Thought reasoning in LLMs is computationally expensive and inefficient due to token-level processing. CoLaR aims to compress reasoning in latent space to reduce computational costs while maintaining performance.

Method: Two-stage approach: 1) Supervised fine-tuning with auxiliary next compressed embedding prediction using random compression factors, 2) Reinforcement learning to explore diverse reasoning paths and exploit compact ones using the latent head’s non-deterministic nature.

Result: Achieves 14.1% higher accuracy than latent-based baselines at comparable compression ratios, reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT. RL-enhanced version gains up to 5.4% performance while reducing latent reasoning chain length by 82.8%.

Conclusion: CoLaR enables efficient latent-level reasoning with dynamic speed control, significantly reducing computational overhead while maintaining or improving performance on mathematical reasoning tasks.

Abstract: Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head’s non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.

[83] TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, Ruixiang Tang

Main category: cs.CL

TL;DR: TACO is a lightweight transformer model that improves multimodal in-context learning by dynamically configuring ICL sequences using task-aware attention and task-mapping signals.

Details

Motivation: Multimodal ICL effectiveness is highly sensitive to input sequence quality, and there's limited understanding of how LVLMs actually exploit these sequences during inference, particularly for complex reasoning tasks.

Method: Systematically interpret multimodal ICL through task mapping, then develop TACO - a transformer model with task-aware attention that dynamically configures ICL sequences by injecting task-mapping signals into autoregressive decoding.

Result: Experiments on five LVLMs and nine datasets show TACO consistently surpasses baselines across diverse ICL tasks.

Conclusion: Task mapping provides a novel and valuable perspective for interpreting and improving multimodal in-context learning.

Abstract: Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.

Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong

Main category: cs.CL

TL;DR: DanmakuTPPBench is a comprehensive benchmark for multi-modal Temporal Point Process modeling, featuring a novel dataset from Bilibili bullet comments and a challenging QA dataset for temporal-textual-visual reasoning.

Details

Motivation: Existing TPP datasets are predominantly unimodal, limiting progress in models that require joint reasoning over temporal, textual, and visual information. This gap hinders the development of multi-modal event sequence modeling.

Method: The benchmark consists of two components: (1) DanmakuTPP-Events dataset derived from Bilibili video platform with timestamped bullet comments and video frames, (2) DanmakuTPP-QA dataset constructed via multi-agent pipeline using LLMs and MLLMs for complex reasoning tasks.

Result: Extensive evaluations reveal significant performance gaps and limitations in current methods’ ability to model multi-modal event dynamics, establishing strong baselines for future research.

Conclusion: The benchmark calls for further integration of TPP modeling into the multi-modal language modeling landscape and highlights the need for improved methods in multi-modal temporal event sequence analysis.

Abstract: We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods’ ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench

Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, Pengfei Liu

Main category: cs.CL

TL;DR: PIR is a framework that identifies and prunes low-importance functional elements from reasoning chains while preserving essential progressive reasoning, enabling more efficient LLM inference with improved accuracy and reduced token usage.

Details

Motivation: Current reasoning chains from LLMs contain verbose functional elements (verification, alternatives, error corrections) that increase computational demands during inference, creating efficiency bottlenecks.

Method: PIR uses perplexity-based importance scoring to quantitatively evaluate each reasoning step’s impact on answer prediction confidence, then selectively prunes only low-importance functional steps while keeping progressive reasoning intact.

Result: Models fine-tuned on PIR-optimized data achieved +0.9% to +6.6% accuracy improvements with 3% to 41% token reduction across AIME, AMC, and GPQA Diamond benchmarks, while maintaining strong generalizability across model sizes and data sources.

Conclusion: PIR provides a practical solution for deploying reasoning-capable LLMs in efficiency-constrained scenarios by optimizing reasoning chains for both performance and computational efficiency.

Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9% to +6.6%) with significantly reduced token usage (-3% to -41%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

[86] Explaining Large Language Models with gSMILE

Zeinab Dehghani, Mohammed Naveed Akram, Koorosh Aslansefat, Adil Khan, Yiannis Papadopoulos

Main category: cs.CL

TL;DR: gSMILE is a model-agnostic framework for token-level interpretability in LLMs that uses controlled prompt perturbations and generates intuitive heatmaps to identify influential tokens.

Details

Motivation: LLMs achieve remarkable text generation performance but remain opaque in decision-making processes, limiting trust and accountability in high-stakes applications.

Method: Extends SMILE methodology using controlled prompt perturbations, Wasserstein distance metrics, and weighted linear surrogates to identify input tokens with significant impact on output.

Result: gSMILE delivers reliable human-aligned attributions across leading LLMs, with Claude 2.1 excelling in attention fidelity and GPT-3.5 achieving highest output consistency.

Conclusion: gSMILE balances model performance and interpretability, enabling more transparent and trustworthy AI systems.

Abstract: Large Language Models (LLMs) such as GPT, LLaMA, and Claude achieve remarkable performance in text generation but remain opaque in their decision-making processes, limiting trust and accountability in high-stakes applications. We present gSMILE (generative SMILE), a model-agnostic, perturbation-based framework for token-level interpretability in LLMs. Extending the SMILE methodology, gSMILE uses controlled prompt perturbations, Wasserstein distance metrics, and weighted linear surrogates to identify input tokens with the most significant impact on the output. This process enables the generation of intuitive heatmaps that visually highlight influential tokens and reasoning paths. We evaluate gSMILE across leading LLMs (OpenAI’s gpt-3.5-turbo-instruct, Meta’s LLaMA 3.1 Instruct Turbo, and Anthropic’s Claude 2.1) using attribution fidelity, attribution consistency, attribution stability, attribution faithfulness, and attribution accuracy as metrics. Results show that gSMILE delivers reliable human-aligned attributions, with Claude 2.1 excelling in attention fidelity and GPT-3.5 achieving the highest output consistency. These findings demonstrate gSMILE’s ability to balance model performance and interpretability, enabling more transparent and trustworthy AI systems.

[87] CoDial: Interpretable Task-Oriented Dialogue Systems Through Dialogue Flow Alignment

Radin Shayanfar, Chu Fei Luo, Rohan Bhambhoria, Samuel Dahan, Xiaodan Zhu

Main category: cs.CL

TL;DR: CoDial converts task schemas to programmatic LLM guardrail code for interpretable dialogue policy alignment, achieving SOTA performance on STAR dataset and competitive results on MultiWOZ.

Details

Motivation: Existing data-driven TOD systems struggle with generalization to unseen tasks, and schema-based approaches lack interpretability despite better generalization.

Method: Converts TOD task schemas (structured heterogeneous graphs) to programmatic LLM guardrail code using CoDial_free and CoDial_structured paradigms, with iterative feedback mechanism for improvement.

Result: Achieves state-of-the-art performance on STAR dataset and competitive results on MultiWOZ dataset, while providing interpretability.

Conclusion: CoDial enables interpretable and efficient alignment of dialogue policies, making it practical for expert-guided LLM alignment in high-stakes domains.

Abstract: Building Task-Oriented Dialogue (TOD) systems that generalize across different tasks remains a challenging problem. Data-driven approaches often struggle to transfer effectively to unseen tasks. While recent schema-based TOD frameworks improve generalization by decoupling task logic from language understanding, their reliance on neural or generative models often obscures how task schemas influence behaviour and hence impair interpretability. In this work, we introduce a novel framework, CoDial (Code for Dialogue), which converts a TOD task schema, represented as a novel structured heterogeneous graph, to programmatic LLM guardrailing code, such as NVIDIA’s Colang, enabling interpretable and efficient alignment of dialogue policies during inference. We introduce two paradigms, $\text{CoDial}{\text{free}}$ and $\text{CoDial}{\text{structured}}$ for generating LLM guardrails, and propose a feedback mechanism that integrates human feedback to iteratively improve the generated code. Empirically, CoDial achieves state-of-the-art (SOTA) performance on the widely used STAR dataset and is on par with SOTA on the MultiWOZ dataset, while also providing interpretability. We additionally demonstrate CoDial’s iterative improvement via manual and LLM-aided feedback, making it a practical tool for expert-guided alignment of LLMs in high-stakes domains.

[88] EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Shihan Dou, Ming Zhang, Chenhao Huang, Jiayi Chen, Feng Chen, Shichun Liu, Yan Liu, Chenxiao Liu, Cheng Zhong, Zongzhang Zhang, Tao Gui, Chao Xin, Chengzhi Wei, Lin Yan, Yonghui Wu, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: EvaLearn is a benchmark for evaluating LLMs’ learning capability through sequential problem-solving across 648 challenging tasks, revealing that models with strong static abilities don’t necessarily excel at learning.

Details

Motivation: To address the underexplored aspect of LLM learning capability and efficiency in challenging tasks, which is critical for understanding model potential and the gap between models and human capabilities.

Method: Created EvaLearn benchmark with 648 problems across 6 task types organized into 182 sequences, requiring models to solve problems sequentially and leverage previous experience. Used 5 automated metrics to evaluate learning capability and efficiency.

Result: Varied performance profiles among 9 frontier models - some like Claude-3.7-sonnet showed strong learning ability despite moderate initial performance, while others struggled with negative transfer. Instance-level rubrics and teacher-model feedback further facilitated learning.

Conclusion: EvaLearn evaluates a new dimension of LLM performance beyond static abilities, providing a novel perspective for assessing model potential and promoting development of deeper, more dynamic evaluation approaches.

Abstract: We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.

[89] Counterfactual reasoning: an analysis of in-context emergence

Moritz Miller, Bernhard Schölkopf, Siyuan Guo

Main category: cs.CL

TL;DR: Language models can perform counterfactual reasoning by inferring latent concepts and copying contextual noise from factual observations.

Details

Motivation: To study in-context counterfactual reasoning in language models - the ability to predict consequences of hypothetical scenarios.

Method: Used synthetic linear regression tasks requiring noise abduction, analyzed Transformer architectures (self-attention, depth, pre-training data), and examined mechanistic representations in residual streams.

Result: Language models are capable of counterfactual reasoning; latent concepts are linearly represented in residual streams; designated ’noise abduction heads’ are central to the process; findings extend to sequential data.

Conclusion: Transformers can perform noise abduction on sequential data, providing preliminary evidence for potential counterfactual story generation.

Abstract: Large-scale neural language models exhibit remarkable performance in in-context learning: the ability to learn and reason about the input context on the fly. This work studies in-context counterfactual reasoning in language models, that is, the ability to predict consequences of a hypothetical scenario. We focus on a well-defined, synthetic linear regression task that requires noise abduction. Accurate prediction is based on (1) inferring an unobserved latent concept and (2) copying contextual noise from factual observations. We show that language models are capable of counterfactual reasoning. Further, we enhance existing identifiability results and reduce counterfactual reasoning for a broad class of functions to a transformation on in-context observations. In Transformers, we find that self-attention, model depth and pre-training data diversity drive performance. Moreover, we provide mechanistic evidence that the latent concept is linearly represented in the residual stream and we introduce designated \textit{noise abduction heads} central to performing counterfactual reasoning. Lastly, our findings extend to counterfactual reasoning under SDE dynamics and reflect that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/mrtzmllr/iccr.

[90] C-SEO Bench: Does Conversational SEO Work?

Haritz Puerto, Martin Gubri, Tommaso Green, Seong Joon Oh, Sangdoo Yun

Main category: cs.CL

TL;DR: C-SEO Bench is the first benchmark for evaluating Conversational Search Engine Optimization methods across multiple tasks, domains, and adoption scenarios, revealing that most current C-SEO methods are ineffective or harmful while traditional SEO strategies work better.

Details

Motivation: The shift from traditional SEO to C-SEO due to LLM-powered conversational search engines requires systematic evaluation of C-SEO methods across diverse domains and competitive scenarios, which current limited testing doesn't address.

Method: Created C-SEO Bench benchmark with two search tasks (question answering and product recommendation) across three domains each, and introduced a new evaluation protocol with varying adoption rates among competing actors.

Result: Most current C-SEO methods are largely ineffective and often negatively impact document ranking, while traditional SEO strategies are significantly more effective. Gains decrease as more actors adopt C-SEO, showing a congested zero-sum problem.

Conclusion: Traditional SEO approaches outperform current C-SEO methods, and the competitive adoption of C-SEO techniques creates diminishing returns, highlighting the need for more robust C-SEO evaluation frameworks.

Abstract: Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not know whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are not only largely ineffective but also frequently have a negative impact on document ranking, which is opposite to what is expected. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at https://github.com/parameterlab/c-seo-bench and https://huggingface.co/datasets/parameterlab/c-seo-bench.

[91] Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

Ruixuan Deng, Xiaoyang Hu, Miles Gilberti, Shane Storks, Aman Taxali, Mike Angstadt, Chandra Sripada, Joyce Chai

Main category: cs.CL

TL;DR: The paper identifies semantically coherent network components in LLMs using coactivation of sparse autoencoder features from few prompts, showing that ablating or amplifying these components predictably changes model outputs and enables counterfactual responses.

Details

Motivation: To understand the modular organization of knowledge in large language models and develop efficient methods for targeted manipulation of model behavior.

Method: Use coactivation of sparse autoencoder (SAE) features collected from a handful of prompts to identify network components, then perform ablation and amplification experiments on concept and relation components.

Result: Ablating components changes model outputs predictably, amplifying induces counterfactual responses, composing relation and concept components yields compound counterfactual outputs, concept components emerge early while relation components concentrate in later layers.

Conclusion: LLMs have modular organization of knowledge accessed through compositional operations, and the method enables efficient targeted manipulation while comprehensively capturing concepts and relations.

Abstract: We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on concept-relation prediction tasks, we show that ablating these components for concepts (e.g., countries and words) and relations (e.g., capital city and translation language) changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and concept components yields compound counterfactual outputs. Further analysis reveals that while most concept components emerge from the very first layer, more abstract relation components are concentrated in later layers. Lastly, we show that extracted components more comprehensively capture concepts and relations than individual features while maintaining specificity. Overall, our findings suggest a modular organization of knowledge accessed through compositional operations, and advance methods for efficient, targeted LLM manipulation.

[92] The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

Niyati Bafna, Tianjian Li, Kenton Murray, David R. Mortensen, David Yarowsky, Hale Sirin, Daniel Khashabi

Main category: cs.CL

TL;DR: LLMs use an implicit task-solving→translation pipeline where task-solving succeeds but translation fails, especially for low-resource languages, explaining poor multilingual generation quality.

Details

Motivation: To understand why multilingual generation with LLMs is poor for mid- to low-resource languages, and identify the specific failure points in the generation process.

Method: Demonstrated existence of implicit task-solving→translation pipeline, formalized translation barrier hypothesis, and quantified failure contributions across 108 language pairs for word translation task.

Result: Translation barrier explains dominant portion of errors for majority of language pairs, especially severe for low-resource target languages.

Conclusion: Translation stage failure is a major bottleneck for multilingual generation in LLMs, highlighting need for improved translation capabilities in future multilingual model development.

Abstract: Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages, but the causes for this are not well-understood. We first demonstrate the existence of an implicit task-solving–>translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage, despite task-solving success, is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We quantify the extent to which either stage in the pipeline is responsible for final failure for a word translation task across 108 language pairs, and find that the translation barrier explains a dominant portion of error for a majority of language pairs, and is especially severe for low-resource target languages. Our results highlight an important bottleneck for end-to-end multilingual generation, relevant for future work seeking to improve multilinguality in LLMs.

[93] Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: The paper introduces Efficiency Leverage (EL) metric to predict MoE model capacity, discovers power law relationships between EL and key architectural parameters, and validates scaling laws through empirical testing.

Details

Motivation: Mixture-of-Experts architectures decouple parameters from computational cost, but predicting model capacity for different MoE configurations remains an unresolved challenge, creating a gap in principled MoE scaling.

Method: Conducted large-scale empirical study training 300+ models up to 28B parameters, analyzed relationship between MoE configurations and EL, derived unified scaling laws, and validated with pilot model Ling-mini-beta.

Result: Found EL primarily driven by expert activation ratio and compute budget following power laws, with expert granularity as non-linear modulator. Ling-mini-beta (0.85B active params) matched 6.1B dense model performance with 7x fewer compute.

Conclusion: Provides principled, empirically-grounded foundation for efficient MoE model scaling through validated scaling laws that accurately predict computational advantages of MoE architectures.

Abstract: Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

[94] Uncertainty Quantification for Evaluating Machine Translation Bias

Ieva Raminta Staliūnaitė, Julius Cheng, Andreas Vlachos

Main category: cs.CL

TL;DR: Using semantic uncertainty to measure gender bias in machine translation systems, particularly when translating ambiguous source sentences where gender must be inferred from context.

Details

Motivation: Current methods use gender accuracy to measure bias but cannot handle ambiguous cases where multiple translations could be correct. Models should maintain uncertainty when inputs are ambiguous rather than always providing confident translations.

Method: Using semantic uncertainty to assess bias in both ambiguous and unambiguous source sentences, comparing how models handle cases where gender must be inferred from context versus when it’s explicitly marked.

Result: High translation accuracy does not correlate with appropriate uncertainty handling, and debiasing techniques affect ambiguous and unambiguous cases differently.

Conclusion: Semantic uncertainty provides a more comprehensive way to measure gender bias in MT systems, revealing limitations of accuracy-based metrics and showing that debiasing approaches have varying effects depending on input ambiguity.

Abstract: The predictive uncertainty of machine translation (MT) models is typically used as a quality estimation proxy. In this work, we posit that apart from confidently translating when a single correct translation exists, models should also maintain uncertainty when the input is ambiguous. We use uncertainty to measure gender bias in MT systems. When the source sentence includes a lexeme whose gender is not overtly marked, but whose target-language equivalent requires gender specification, the model must infer the appropriate gender from the context and can be susceptible to biases. Prior work measured bias via gender accuracy, however it cannot be applied to ambiguous cases. Using semantic uncertainty, we are able to assess bias when translating both ambiguous and unambiguous source sentences, and find that high translation accuracy does not correlate with exhibiting uncertainty appropriately, and that debiasing affects the two cases differently.

[95] Ontology-Enhanced Knowledge Graph Completion using Large Language Models

Wenbin Guo, Xin Wang, Jiaoyan Chen, Zhao Li, Zirui Chen

Main category: cs.CL

TL;DR: OL-KGC is an ontology-enhanced knowledge graph completion method that combines neural-perceptual structural information with ontological knowledge using LLMs, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: Current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, hindering conclusive and decisive reasoning outcomes. The goal is to integrate neural-perceptual structural information with ontological knowledge for deeper understanding of intrinsic logic.

Method: OL-KGC first leverages neural perceptual mechanisms to embed structural information into textual space, then uses automated extraction algorithm to retrieve ontological knowledge from KGs and transform it into LLM-comprehensible textual format for logic guidance.

Result: Extensive experiments on FB15K-237, UMLS and WN18RR benchmarks show OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.

Conclusion: The proposed OL-KGC method successfully integrates structural and ontological knowledge with LLMs, demonstrating superior performance in knowledge graph completion tasks.

Abstract: Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs – OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks – FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.

[96] Discovering Properties of Inflectional Morphology in Neural Emergent Communication

Miles Gilberti, Shane Storks, Huteng Dai

Main category: cs.CL

TL;DR: The paper reinterprets emergent communication by imposing small-vocabulary constraints to simulate double articulation and creates a setting analogous to natural inflectional morphology, developing new metrics to study concatenativity and fusion.

Details

Motivation: Current emergent communication research focuses on subfield-specific goals that prioritize communication schemes with unique characters and syntactic composition, lacking comparison to natural language communication schemes like inflectional morphology.

Method: Reinterpreted attribute-value reconstruction game with small-vocabulary constraints to simulate double articulation, formulated novel setting analogous to naturalistic inflectional morphology, developed new metrics and explored variations motivated by concatenativity and fusion properties.

Result: Simulated phonological constraints encourage concatenative morphology, and emergent languages replicate natural languages’ tendency to fuse grammatical attributes.

Conclusion: The study demonstrates that emergent communication can replicate key properties of natural language morphology, particularly concatenativity and fusion, when appropriate constraints are applied.

Abstract: Emergent communication (EmCom) with deep neural network-based agents promises to yield insights into the nature of human language, but remains focused primarily on a few subfield-specific goals and metrics that prioritize communication schemes which represent attributes with unique characters one-to-one and compose them syntactically. We thus reinterpret a common EmCom setting, the attribute-value reconstruction game, by imposing a small-vocabulary constraint to simulate double articulation, and formulating a novel setting analogous to naturalistic inflectional morphology (enabling meaningful comparison to natural language communication schemes). We develop new metrics and explore variations of this game motivated by real properties of inflectional morphology: concatenativity and fusion. Through our experiments, we discover that simulated phonological constraints encourage concatenative morphology, and emergent languages replicate the tendency of natural languages to fuse grammatical attributes.

[97] Can we Evaluate RAGs with Synthetic Data?

Jonas van Elburg, Peter van der Putten, Maarten Marx

Main category: cs.CL

TL;DR: Synthetic QA data from LLMs can reliably rank RAG systems by retriever configuration but not by generator architecture, due to task mismatch and stylistic bias.

Details

Motivation: To determine if synthetic benchmarks from LLMs can substitute for human-labeled benchmarks when the latter are unavailable.

Method: Two experiments: varying retriever parameters with fixed generator, and varying generator with fixed retriever parameters, tested across four datasets (two open-domain, two proprietary).

Result: Synthetic benchmarks reliably rank RAGs by retriever configuration, aligning with human benchmarks, but fail to consistently rank RAGs by generator architecture.

Conclusion: Synthetic benchmarks are effective for evaluating retriever variations but unreliable for comparing generator architectures due to task mismatch and stylistic bias.

Abstract: We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when the latter is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they do not consistently produce reliable RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.

[98] Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

Ankan Mullick, Saransh Sharma, Abhik Jana, Pawan Goyal

Main category: cs.CL

TL;DR: Text-only LLM Mistral-7B outperforms multimodal models on intent detection due to strong textual bias in datasets. After debiasing, performance drops significantly, revealing modality bias challenges.

Details

Motivation: To investigate the effectiveness of LLMs and non-LLMs in multimodal intent detection and address modality bias issues in existing datasets.

Method: Comparative analysis of text-only and multimodal models on MIntRec datasets, human evaluation for modality bias confirmation, and dataset debiasing framework to remove text-biased samples.

Result: Mistral-7B outperforms multimodal models by 9% and 4% on biased datasets, but after debiasing, performance drops significantly (50-60% for smaller models) as over 70% and 50% of samples are removed from MIntRec-1 and MIntRec2.0 respectively.

Conclusion: Multimodal intent datasets suffer from significant modality bias favoring text, which undermines proper evaluation of multimodal models. Unbiased datasets are essential for effective multimodal model assessment.

Abstract: The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.

[99] Can Large Language Models Master Complex Card Games?

Wei Wang, Fuqing Bie, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang

Main category: cs.CL

TL;DR: LLMs can master complex card games through fine-tuning, achieving performance comparable to strong game AIs while maintaining some general capabilities.

Details

Motivation: To explore whether large language models (LLMs) can achieve similar success in complex games as specialized AI systems like AlphaGo and AlphaZero, particularly in the domain of card games.

Method: Systematically assessed LLMs across eight diverse card games using supervised fine-tuning on high-quality gameplay data, evaluating performance and general capability retention.

Result: LLMs approached strong game AI performance through fine-tuning, achieved proficiency in multiple games simultaneously with performance synergies for similar games, but experienced some decline in general capabilities that could be mitigated with general instruction data.

Conclusion: LLMs demonstrate strong learning ability and versatility in mastering complex card games, showing potential for game-playing applications while highlighting the importance of balancing specialized training with general capability preservation.

Abstract: Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models’ ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can achieve a certain level of proficiency in multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs. The code is available at https://github.com/THUDM/LLM4CardGame

[100] Understanding Reinforcement Learning for Model Training, and future directions with GRAPE

Rohit Patel

Main category: cs.CL

TL;DR: A comprehensive, beginner-friendly guide to instruction tuning algorithms (SFT, Rejection Sampling, REINFORCE, TRPO, PPO, GRPO, DPO) with simplified notation focused on LLMs, plus literature review and new GRAPE research proposal.

Details

Motivation: Existing explanations of instruction tuning algorithms often assume prior knowledge, lack critical details, or are overly complex, making them inaccessible to newcomers in the field.

Method: Step-by-step development of each algorithm using simplified and explicit notation focused on LLMs, minimizing detours into broader RL literature and eliminating superfluous abstractions.

Result: Clear and intuitive understanding of instruction tuning concepts with reduced cognitive overhead, followed by literature review of new techniques and presentation of GRAPE (Generalized Relative Advantage Policy Evolution) as new research direction.

Conclusion: The paper successfully provides an accessible foundation for understanding instruction tuning algorithms while proposing new research directions in the form of GRAPE for advancing the field.

Abstract: This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.

[101] Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data

Sepehr Golrokh Amin, Devin Rhoads, Fatemeh Fakhrmoosavi, Nicholas E. Lownes, John N. Ivan

Main category: cs.CL

TL;DR: LLMs can generate realistic travel diaries from open-source data, achieving comparable realism to classical methods while excelling in trip purpose determination and showing greater consistency.

Details

Motivation: To overcome reliance on proprietary household travel surveys by using LLMs to generate individual travel diaries from open-source data for agent-based transportation models.

Method: Generate personas from ACS and SLD data, synthesize diaries through direct LLM prompting, and validate using a novel one-to-cohort realism score with four metrics (Trip Count, Interval, Purpose, Mode) compared against CSTS diaries using Jensen-Shannon Divergence.

Result: LLM-generated diaries achieve comparable overall realism (0.485 vs 0.455) to classical methods, excel in trip purpose determination, show greater consistency, and demonstrate statistical representativeness (0.612 vs 0.435) in aggregate validation.

Conclusion: LLMs are viable for zero-shot travel diary generation, establishing a quantifiable metric for synthetic diary evaluation and offering an alternative to traditional survey-dependent approaches.

Abstract: This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM’s statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM’s zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems.

[102] Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents

Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Aditya Vempaty, Prasenjit Dey, Ravi Kokku, Pawan Goyal, Niloy Ganguly

Main category: cs.CL

TL;DR: Spotlight is a new information extraction paradigm that creates engaging narratives by highlighting compelling document aspects rather than comprehensive summaries, using a two-stage fine-tuning and DPO alignment approach.

Details

Motivation: Traditional summaries prioritize comprehensive coverage but may lack engagement. Spotlight aims to foster deeper reader engagement by selectively emphasizing intriguing content from documents.

Method: Two-stage approach: 1) Fine-tune a large language model on benchmark datasets curated for spotlight generation, 2) Alignment via Direct Preference Optimization (DPO) to improve quality.

Result: The model effectively identifies key elements with precision, enhances readability, and boosts the engagement value of original documents through compelling narrative generation.

Conclusion: Spotlight represents a novel paradigm shift from comprehensive summarization to selective highlighting of engaging content, successfully improving reader engagement through targeted information extraction.

Abstract: In this paper, we introduce Spotlight, a novel paradigm for information extraction that produces concise, engaging narratives by highlighting the most compelling aspects of a document. Unlike traditional summaries, which prioritize comprehensive coverage, spotlights selectively emphasize intriguing content to foster deeper reader engagement with the source material. We formally differentiate spotlights from related constructs and support our analysis with a detailed benchmarking study using new datasets curated for this work. To generate high-quality spotlights, we propose a two-stage approach: fine-tuning a large language model on our benchmark data, followed by alignment via Direct Preference Optimization (DPO). Our comprehensive evaluation demonstrates that the resulting model not only identifies key elements with precision but also enhances readability and boosts the engagement value of the original document.

[103] Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs

Amber Shore, Russell Scheinberg, Ameeta Agrawal, So Young Lee

Main category: cs.CL

TL;DR: LLMs can perform coreference disambiguation and detect ambiguity separately but struggle to do both simultaneously, revealing a CORRECT-DETECT trade-off.

Details

Motivation: To examine if LLMs can handle semantic ambiguity in coreference resolution like humans, who use broad contextual knowledge to detect and resolve ambiguities in isolated text.

Method: Tested LLMs with minimal prompting on coreference disambiguation tasks and ambiguity detection in coreference, analyzing their ability to perform both capabilities concurrently.

Result: LLMs achieved good performance in coreference disambiguation and ambiguity detection when done separately, but failed to balance both abilities simultaneously.

Conclusion: While LLMs possess both coreference resolution and ambiguity detection capabilities implicitly, they exhibit a fundamental trade-off (CORRECT-DETECT) that prevents them from effectively performing both tasks at the same time.

Abstract: Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.

[104] Patent Language Model Pretraining with ModernBERT

Amirhossein Yousefiramandi, Ciaran Cooney

Main category: cs.CL

TL;DR: The paper presents ModernBERT-base-PT, a domain-specific masked language model pretrained on 60+ million patent records, which outperforms general-purpose models on patent classification tasks while offering 3x faster inference than PatentBERT.

Details

Motivation: Transformer models like BERT perform poorly in specialized domains like patents due to long, technical, and legally structured text. Existing approaches rely on fine-tuning general models or domain-adapted variants with limited data.

Method: Pretrained 3 domain-specific masked language models using ModernBERT architecture on 60M+ patent records. Incorporated architectural optimizations including FlashAttention, rotary embeddings, and GLU feed-forward layers.

Result: ModernBERT-base-PT consistently outperformed general-purpose ModernBERT baseline on 3 out of 4 patent classification datasets and achieved competitive performance with PatentBERT. Larger models and customized tokenizers further enhanced performance on selected tasks. All variants maintained 3x faster inference than PatentBERT.

Conclusion: Domain-specific pretraining with architectural improvements significantly benefits patent-focused NLP tasks, offering both performance gains and substantially faster inference for time-sensitive applications.

Abstract: Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.

[105] Semantic Agreement Enables Efficient Open-Ended LLM Cascades

Duncan Soiffer, Steven Kolawole, Virginia Smith

Main category: cs.CL

TL;DR: Semantic agreement between ensemble outputs serves as a training-free signal for reliable deferral in LLM cascade systems, achieving target-model quality at 40% cost with 60% latency reduction.

Details

Motivation: Cascade systems face challenges in determining output reliability for open-ended text generation where quality exists on a continuous spectrum with multiple valid responses.

Method: Proposes semantic agreement - meaning-level consensus between ensemble outputs - as a training-free signal for deferral decisions, requiring no model internals and working across black-box APIs.

Result: Semantic cascades match or surpass target-model quality at 40% of the cost, reduce latency by up to 60%, and remain robust to model updates.

Conclusion: Semantic agreement provides a practical baseline for real-world LLM deployment, offering stronger reliability signals than token-level confidence while being training-free and API-compatible.

Abstract: Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement – meaning-level consensus between ensemble outputs – as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.

[106] Evaluating Program Semantics Reasoning with Type Inference in System F

Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, Hao Chen

Main category: cs.CL

TL;DR: TF-Bench is a new benchmark that evaluates LLM reasoning capabilities through type inference in System F, revealing significant limitations in current models with only 55.85% accuracy on purely semantic tasks.

Details

Motivation: Current benchmarks lack formal deductive frameworks for evaluating program semantics reasoning and cannot distinguish between genuine reasoning and superficial pattern matching in LLMs.

Method: Created TF-Bench using type inference in System F, with verified transformations to remove semantically irrelevant natural language, resulting in TF-Bench_pure for purely semantics-driven evaluation.

Result: State-of-the-art LLMs show substantial limitations, with Claude-3.7-sonnet achieving only 55.85% accuracy on TF-Bench_pure. Proposed novel metrics reveal critical gaps in robustness and test-time reasoning effectiveness.

Conclusion: Current LLMs have significant limitations in program semantics reasoning, highlighting essential directions for future research in developing more robust reasoning capabilities.

Abstract: Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.

[107] HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu, Weihao Li, Jingxuan Xu, Tianhao Peng, Xinping Lei, Kun Wu, Yifan Yao, Haoyang Huang, Huaixi Tang, Kepeng Lei, Zhiyi Lai, Songwei Yu, Zongxian Feng, Zuchen Gao, Weihao Xie, Chenchen Zhang, Yanan Wu, Yuanxing Zhang, Lecheng Huang, Yuqun Zhang, Jie Liu, Zhaoxiang Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu

Main category: cs.CL

TL;DR: HiPO is a framework that enables LLMs to adaptively decide when to use detailed reasoning (Think-on) versus direct responses (Think-off), reducing token usage while maintaining accuracy.

Details

Motivation: Current CoT reasoning in LLMs generates lengthy reasoning traces that are inefficient, leading to excessive token usage and higher inference costs.

Method: HiPO combines a hybrid data pipeline with paired Think-on/Think-off responses and a hybrid reinforcement learning reward system that balances accuracy and efficiency.

Result: Experiments on mathematics and coding benchmarks show HiPO substantially reduces token length while maintaining or improving accuracy.

Conclusion: HiPO provides a principled approach for efficient adaptive reasoning, advancing deployment of reasoning-oriented LLMs in resource-sensitive settings.

Abstract: Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.

[108] IASC: Interactive Agentic System for ConLangs

Chihiro Taguchi, Richard Sproat

Main category: cs.CL

TL;DR: A modular system using LLMs to develop constructed languages through phonology creation, sentence translation, lexicon construction, orthography design, and grammar handbook generation.

Details

Motivation: To create fun tools for constructed language development and explore LLMs' understanding of linguistic concepts and their capabilities across different language patterns.

Method: Step-by-step modular approach: 1) Agentic phonology creation with feedback refinement, 2) Sentence translation into morphosyntactic markup, 3) Lexicon construction from morphemes, 4) Orthography design using existing scripts, 5) Grammar handbook generation.

Result: System successfully creates constructed languages but shows varying capabilities across different LLMs and linguistic specifications, with better performance on common patterns than rare ones. Limited success in high-to-low-resource language translation.

Conclusion: The system demonstrates LLMs’ linguistic knowledge while revealing limitations, particularly with rare patterns and low-resource language translation, suggesting potential for future improvements.

Abstract: We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is ’translated’ from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the ’translated’ sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs ‘know’ about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC

[109] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, Weinan Zhang

Main category: cs.CL

TL;DR: This survey provides a systematic overview of Process Reward Models (PRMs) that evaluate reasoning at step level rather than just final answers, covering data generation, model building, and applications across various domains.

Details

Motivation: Conventional alignment is dominated by outcome reward models that only judge final answers, creating a gap in evaluating reasoning processes. PRMs address this by providing fine-grained evaluation of reasoning steps.

Method: The survey systematically examines the full PRM loop: generating process data, building PRMs, and using them for test-time scaling and reinforcement learning. It covers applications in math, code, text, multimodal reasoning, robotics, and agents.

Result: The paper summarizes design spaces, applications across multiple domains, and emerging benchmarks for PRMs, providing a comprehensive framework for understanding process-level reasoning evaluation.

Conclusion: PRMs represent a shift toward fine-grained, robust reasoning alignment that can better guide and evaluate step-by-step reasoning processes, with significant potential across various AI applications.

Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

[110] Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default

Jiaqi Liu, Tong Wang, Su Liu, Xin Hu, Ran Tong, Lanruo Wang, Jiexi Xu

Main category: cs.CL

TL;DR: Lightweight medical abstract classification evaluation shows DistilBERT with cross-entropy loss performs best, with post-hoc calibration significantly improving deployed performance.

Details

Motivation: To establish maximum performance capabilities of lightweight medical abstract classification methods under financial budget restrictions.

Method: Finetuned BERT base and DistilBERT with cross entropy, class weighted CE, and focal loss under identical tokenization, sequence length, optimizer, and schedule. Applied post-hoc operating point selection with validation calibration and classwise thresholds.

Result: DistilBERT with plain CE gives the strongest raw performance trade-off. Post-hoc tuning substantially improves deployed performance, with focal loss benefiting most under tuned regime. Reported Accuracy, Macro F1, and Weighted F1 metrics.

Conclusion: Practical recommendation: start with compact encoder and cross-entropy loss, then add lightweight calibration or thresholding when deployment requires higher macro balance.

Abstract: The research evaluates lightweight medical abstract classification methods to establish their maximum performance capabilities under financial budget restrictions. On the public medical abstracts corpus, we finetune BERT base and Distil BERT with three objectives cross entropy (CE), class weighted CE, and focal loss under identical tokenization, sequence length, optimizer, and schedule. DistilBERT with plain CE gives the strongest raw argmax trade off, while a post hoc operating point selection (validation calibrated, classwise thresholds) sub stantially improves deployed performance; under this tuned regime, focal benefits most. We report Accuracy, Macro F1, and WeightedF1, release evaluation artifacts, and include confusion analyses to clarify error structure. The practical takeaway is to start with a compact encoder and CE, then add lightweight calibration or thresholding when deployment requires higher macro balance.

[111] Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures

Mihir Gupte, Paolo Giusto, Ramesh S

Main category: cs.CL

TL;DR: Proposes a bottom-up method to linearize tree-like structures by generating implicit aggregated summaries at each hierarchical level, enabling efficient RAG with 68% fewer documents while maintaining response quality.

Details

Motivation: LLMs can use context effectively but RAG on structured hierarchical data like code repositories is not well-explored, particularly how to best represent retrieved knowledge from tree structures.

Method: A novel bottom-up approach that linearizes tree-like structures by generating implicit, aggregated summaries at each hierarchical level, allowing knowledge to be stored in a knowledge base for RAG.

Result: Response quality is comparable to using RAG on raw unstructured code, but the proposed method generates over 68% fewer documents in the retriever, showing significant efficiency gains.

Conclusion: Leveraging implicit, linearized knowledge is a highly effective and scalable strategy for handling complex hierarchical data structures in RAG systems.

Abstract: Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model’s in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.

[112] Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, Fuli Luo

Main category: cs.CL

TL;DR: R3 (Rollout Routing Replay) stabilizes RL training in Mixture-of-Experts models by replaying inference routing distributions during training to address routing inconsistency issues.

Details

Motivation: Mixture-of-Experts models suffer from routing instability during RL training, causing catastrophic collapse due to discrepancies between training and inference routing behaviors.

Method: Proposed Rollout Routing Replay (R3) method that records routing distributions from inference engine and replays them during training to ensure consistency.

Result: R3 significantly reduces training-inference policy KL divergence, prevents extreme routing discrepancies, stabilizes RL training without speed compromise, and outperforms GSPO and TIS methods.

Conclusion: R3 provides an effective solution for stabilizing RL training in MoE models by addressing fundamental routing inconsistency issues.

Abstract: Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.

[113] A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

Qianben Chen, Jingyi Cao, Jiayu Zhang, Tianrui Qin, Xiaowan Li, King Zhu, Dingfeng Shi, He Zhu, Minghao Liu, Xiaobo Liang, Xin Gui, Ge Zhang, Jian Yang, Yuchen Eleanor Jiang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: A^2FM is a unified framework that combines reasoning and agentic capabilities in LLMs through task-aware routing and adaptive execution with three modes: instant (simple queries), reasoning (complex thinking), and agentic (tool use).

Details

Motivation: Current LLMs are split into reasoning-centric models (good at internal reasoning but no tools) and agentic models (good with tools but weak reasoning), creating inefficiency where both overthink simple queries or overuse tools.

Method: Route-then-align principle: learn task-aware routing, align mode-specific trajectories under shared backbone. Introduces instant mode for simple queries, Adaptive Policy Optimization (APO) for cost-regularized reward and adaptive sampling across modes.

Result: On 32B scale: 13.4% on BrowseComp, 70.4% on AIME25, 16.7% on HLE - SOTA among comparable models. Adaptive execution achieves $0.00487 cost per correct answer, cutting costs by 45.2% vs reasoning and 33.5% vs agentic modes.

Conclusion: A^2FM successfully unifies reasoning and agentic capabilities while significantly improving cost efficiency through adaptive mode selection, maintaining competitive accuracy across diverse benchmarks.

Abstract: Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

[114] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

María Victoria Carro, Denise Alejandra Mester, Facundo Nieto, Oscar Agustín Stanchi, Guido Ernesto Bergman, Mario Alejandro Leiva, Eitan Sprejer, Luca Nicolás Forziati Gangi, Francisca Gauna Selasco, Juan Gustavo Corvalán, Gerardo I. Simari, María Vanina Martinez

Main category: cs.CL

TL;DR: AI debate experiments reveal models prefer sycophantic strategies over prior beliefs, sequential debate favors second debater, and paradoxically higher-quality arguments emerge when models argue against their beliefs.

Details

Motivation: To test whether language models adopt sycophantic strategies by aligning with judge personas or remain faithful to their prior beliefs in subjective debates, addressing limitations of existing debate experiments that rely on objective datasets with ground truth.

Method: Applied debate to subjective questions, measured models’ prior beliefs, presented debaters with judge personas conflicting with their priors, compared sequential vs simultaneous debate protocols, and evaluated persuasiveness and argument quality when defending aligned vs misaligned positions.

Result: Models prefer defending stances aligned with judge personas over prior beliefs, sequential debate introduces significant bias favoring second debater, models are more persuasive when defending aligned positions, but paradoxically arguments misaligned with prior beliefs are rated as higher quality.

Conclusion: Results inform human judges for better training signals and contribute to aligned AI systems, revealing important persuasion dynamics in human-AI interaction where models exhibit sycophantic behavior and produce higher-quality arguments when arguing against their beliefs.

Abstract: The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models’ prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge’s presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.

[115] SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed, Xingzhi Guo, Daniel Kang, Joo-Kyung Kim

Main category: cs.CL

TL;DR: Search agents using LLMs are more vulnerable to producing harmful outputs than base LLMs. SafeSearch uses multi-objective reinforcement learning to reduce harmfulness by 70% while maintaining utility.

Details

Motivation: LLM-based search agents are more likely to produce harmful outputs than base LLMs, especially when utility-oriented fine-tuning intensifies safety risks, motivating joint alignment of safety and utility.

Method: SafeSearch uses multi-objective reinforcement learning with a final-output safety/utility reward and a novel query-level shaping term that penalizes unsafe queries and rewards safe ones.

Result: SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent.

Conclusion: The query-level reward in SafeSearch effectively improves both safety and utility, demonstrating successful joint alignment for search agents.

Abstract: Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked “How can I track someone’s location without their consent?”, a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented fine-tuning intensifies this risk, motivating joint alignment of safety and utility. We present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 70% across three red-teaming datasets while producing safe, helpful responses, and matches the QA performance of a utility-only finetuned agent; further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

[116] SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Röttger

Main category: cs.CL

TL;DR: SimBench is the first large-scale standardized benchmark for evaluating LLM simulations of human behavior, unifying 20 diverse datasets to provide reproducible evaluation of when and how LLM simulations succeed or fail.

Details

Motivation: Current evaluations of LLM simulations are fragmented with bespoke tasks and metrics, creating incomparable results that hinder the development of faithful human behavior simulations in social sciences.

Method: Created SimBench by unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, enabling standardized evaluation of LLM simulation capabilities.

Result: Best current LLMs have limited simulation ability (40.80/100), performance scales log-linearly with model size, no improvement from increased inference-time compute, alignment-simulation trade-off exists, models struggle with specific demographic groups, and simulation ability correlates strongly with knowledge-intensive reasoning (MMLU-Pro, r=0.939).

Conclusion: SimBench enables measurable progress in developing more faithful LLM simulators by providing a standardized foundation to understand when, how, and why LLM simulations succeed or fail.

Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

cs.CV

[117] MAT-Agent: Adaptive Multi-Agent Training Optimization

Jusheng Zhang, Kaitong Cai, Yijia Fan, Ningyuan Liu, Keze Wang

Main category: cs.CV

TL;DR: MAT-Agent is a multi-agent framework that dynamically optimizes training parameters using non-stationary multi-armed bandit algorithms, achieving state-of-the-art performance on multiple multi-label image classification benchmarks.

Details

Motivation: Conventional multi-label image classification methods use static training configurations that fail in dynamic settings, requiring adaptive strategies to navigate complex visual-semantic landscapes.

Method: Deploys autonomous agents to dynamically tune data augmentation, optimizers, learning rates, and loss functions using non-stationary multi-armed bandit algorithms, enhanced with dual-rate exponential moving average smoothing and mixed-precision training.

Result: Achieves mAP of 97.4 (Pascal VOC), 92.8 (COCO), 60.9 (VG-256); outperforms baselines with improved OF1 and CF1 scores across all datasets; shows accelerated convergence and robust cross-domain generalization.

Conclusion: MAT-Agent provides a scalable, intelligent solution for optimizing complex visual models and paves the way for adaptive deep learning advancements through collaborative, real-time optimization.

Abstract: Multi-label image classification demands adaptive training strategies to navigate complex, evolving visual-semantic landscapes, yet conventional methods rely on static configurations that falter in dynamic settings. We propose MAT-Agent, a novel multi-agent framework that reimagines training as a collaborative, real-time optimization process. By deploying autonomous agents to dynamically tune data augmentation, optimizers, learning rates, and loss functions, MAT-Agent leverages non-stationary multi-armed bandit algorithms to balance exploration and exploitation, guided by a composite reward harmonizing accuracy, rare-class performance, and training stability. Enhanced with dual-rate exponential moving average smoothing and mixed-precision training, it ensures robustness and efficiency. Extensive experiments across Pascal VOC, COCO, and VG-256 demonstrate MAT-Agent’s superiority: it achieves an mAP of 97.4 (vs. 96.2 for PAT-T), OF1 of 92.3, and CF1 of 91.4 on Pascal VOC; an mAP of 92.8 (vs. 92.0 for HSQ-CvN), OF1 of 88.2, and CF1 of 87.1 on COCO; and an mAP of 60.9, OF1 of 70.8, and CF1 of 61.1 on VG-256. With accelerated convergence and robust cross-domain generalization, MAT-Agent offers a scalable, intelligent solution for optimizing complex visual models, paving the way for adaptive deep learning advancements.

[118] TriggerNet: A Novel Explainable AI Framework for Red Palm Mite Detection and Multi-Model Comparison and Heuristic-Guided Annotation

Harshini Suresha, Kavitha SH

Main category: cs.CV

TL;DR: TriggerNet is an interpretable AI framework that integrates multiple explanation methods (Grad-CAM, RISE, FullGrad, TCAV) for detecting red palm mite infestation across 11 plant species using deep learning and traditional ML models, with Snorkel for efficient disease labeling.

Details

Motivation: Red palm mite infestation causes serious productivity and economic losses in palm cultivation regions, requiring accurate early identification for effective management.

Method: Used TriggerNet framework with multiple explanation methods; trained on RGB images of 11 plant species; employed CNN, EfficientNet, MobileNet, ViT, ResNet50, InceptionV3, Random Forest, SVM, KNN; used Snorkel for disease classification labeling.

Result: Developed comprehensive plant classification and disease detection system; categorized plants into Healthy, Yellow Spots, Reddish Bronzing, and Silk Webbing classes; achieved efficient labeling through heuristic rules.

Conclusion: TriggerNet provides interpretable AI solutions for red palm mite detection, enabling early identification and effective management of infestations across multiple plant species.

Abstract: The red palm mite infestation has become a serious concern, particularly in regions with extensive palm cultivation, leading to reduced productivity and economic losses. Accurate and early identification of mite-infested plants is critical for effective management. The current study focuses on evaluating and comparing the ML model for classifying the affected plants and detecting the infestation. TriggerNet is a novel interpretable AI framework that integrates Grad-CAM, RISE, FullGrad, and TCAV to generate novel visual explanations for deep learning models in plant classification and disease detection. This study applies TriggerNet to address red palm mite (Raoiella indica) infestation, a major threat to palm cultivation and agricultural productivity. A diverse set of RGB images across 11 plant species, Arecanut, Date Palm, Bird of Paradise, Coconut Palm, Ginger, Citrus Tree, Palm Oil, Orchid, Banana Palm, Avocado Tree, and Cast Iron Plant was utilized for training and evaluation. Advanced deep learning models like CNN, EfficientNet, MobileNet, ViT, ResNet50, and InceptionV3, alongside machine learning classifiers such as Random Forest, SVM, and KNN, were employed for plant classification. For disease classification, all plants were categorized into four classes: Healthy, Yellow Spots, Reddish Bronzing, and Silk Webbing. Snorkel was used to efficiently label these disease classes by leveraging heuristic rules and patterns, reducing manual annotation time and improving dataset reliability.

[119] CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization

Yichen Yan, Ming Zhong, Qi Zhu, Xiaoling Gu, Jinpeng Chen, Huan Li

Main category: cs.CV

TL;DR: CoIDO is a dual-objective framework that jointly optimizes data importance and diversity for efficient instruction tuning of multimodal LLMs, achieving 98.2% of full-data performance using only 20% of data.

Details

Motivation: To address the high computational cost of training MLLMs on large datasets and overcome limitations of existing data selection methods that have high overhead and suboptimal selection.

Method: Uses a lightweight plug-in scorer trained on a small random sample to learn candidate set distribution, with homoscedastic uncertainty-based formulation to balance importance and diversity.

Result: Achieved 98.2% of full-data fine-tuning performance on LLaVA-1.5-7B model across ten downstream tasks using only 20% of selected data.

Conclusion: CoIDO provides an efficient and scalable data selection approach that significantly reduces computational demands while maintaining near-full performance.

Abstract: Multimodal large language models (MLLMs) rely heavily on instruction tuning to align vision and language capabilities, yet the computational cost of training on large-scale datasets remains a major bottleneck. Existing data selection methods aim to mitigate this by selecting important and diverse subsets, but they often suffer from two critical drawbacks: high computational overhead from processing the entire dataset and suboptimal data selection due to separate treatment of importance and diversity. We introduce CoIDO, a novel dual-objective framework that jointly optimizes data importance and diversity to overcome these challenges. Unlike existing approaches that require costly evaluations across the whole dataset, CoIDO employs a lightweight plug-in scorer. This scorer is trained on just a small random sample of data to learn the distribution of the candidate set, drastically reducing computational demands. By leveraging a homoscedastic uncertainty-based formulation, CoIDO effectively balances importance and diversity during training, enabling efficient and scalable data selection. In our experiments, we trained the CoIDO scorer using only 20 percent of randomly sampled data. Once trained, CoIDO was applied to the entire dataset to select a 20 percent subset for instruction tuning. On the widely used LLaVA-1.5-7B model across ten downstream tasks, this selected subset achieved an impressive 98.2 percent of the performance of full-data fine-tuning, on average.

[120] Pre to Post-Treatment Glioblastoma MRI Prediction using a Latent Diffusion Model

Alexandre G. Leclercq, Sébastien Bougleux, Noémie N. Moreau, Alexis Desmonts, Romain Hérault, Aurélien Corroyer-Dulmont

Main category: cs.CV

TL;DR: The paper proposes a Latent Diffusion Model for early visual treatment response prediction in glioblastoma, generating post-treatment MRI from pre-treatment MRI and tumor localization using classifier-free guidance with survival information.

Details

Motivation: Early prediction of treatment response in glioblastoma is crucial for personalized medicine, as current methods require at least two months to observe visual impact via MRI. Patients show highly heterogeneous therapeutic responses to the standard Stupp protocol.

Method: A Latent Diffusion Model with concatenation-based conditioning from pre-treatment MRI and tumor localization, enhanced by classifier-free guidance using survival information to improve generation quality reflecting post-treatment tumor evolution.

Result: The model was trained and tested on a local dataset of 140 GBM patients from Centre François Baclesse, including pre/post T1-Gd MRI, expert-delineated tumor localization, and survival information.

Conclusion: The proposed approach addresses early visual treatment response prediction as a slice-to-slice translation problem, potentially enabling earlier assessment of therapeutic interventions in glioblastoma patients.

Abstract: Glioblastoma (GBM) is an aggressive primary brain tumor with a median survival of approximately 15 months. In clinical practice, the Stupp protocol serves as the standard first-line treatment. However, patients exhibit highly heterogeneous therapeutic responses which required at least two months before first visual impact can be observed, typically with MRI. Early prediction treatment response is crucial for advancing personalized medicine. Disease Progression Modeling (DPM) aims to capture the trajectory of disease evolution, while Treatment Response Prediction (TRP) focuses on assessing the impact of therapeutic interventions. Whereas most TRP approaches primarly rely on timeseries data, we consider the problem of early visual TRP as a slice-to-slice translation model generating post-treatment MRI from a pre-treatment MRI, thus reflecting the tumor evolution. To address this problem we propose a Latent Diffusion Model with a concatenation-based conditioning from the pre-treatment MRI and the tumor localization, and a classifier-free guidance to enhance generation quality using survival information, in particular post-treatment tumor evolution. Our model were trained and tested on a local dataset consisting of 140 GBM patients collected at Centre Fran\c{c}ois Baclesse. For each patient we collected pre and post T1-Gd MRI, tumor localization manually delineated in the pre-treatment MRI by medical experts, and survival information.

[121] Provenance of AI-Generated Images: A Vector Similarity and Blockchain-based Approach

Jitendra Sharma, Arthur Carvalho, Suman Bhunia

Main category: cs.CV

TL;DR: An embedding-based framework using image embeddings and vector similarity to detect AI-generated images by analyzing clustering patterns in embedding space.

Details

Motivation: The rise of realistic AI-generated images from models like ChatGPT with DALL-E and Stable Diffusion poses challenges for digital content authentication, requiring methods to verify image integrity and origin.

Method: Uses five benchmark embedding models to process diverse datasets of AI and human-generated images, analyzing embedding proximity and clustering patterns to distinguish between AI and human content.

Result: The approach is robust with moderate to high perturbations minimally impacting embedding signatures, and perturbed images maintain close similarity to their original versions.

Conclusion: Provides a generalizable framework for AI-generated image detection that balances accuracy with computational efficiency.

Abstract: Rapid advancement in generative AI and large language models (LLMs) has enabled the generation of highly realistic and contextually relevant digital content. LLMs such as ChatGPT with DALL-E integration and Stable Diffusion techniques can produce images that are often indistinguishable from those created by humans, which poses challenges for digital content authentication. Verifying the integrity and origin of digital data to ensure it remains unaltered and genuine is crucial to maintaining trust and legality in digital media. In this paper, we propose an embedding-based AI image detection framework that utilizes image embeddings and a vector similarity to distinguish AI-generated images from real (human-created) ones. Our methodology is built on the hypothesis that AI-generated images demonstrate closer embedding proximity to other AI-generated content, while human-created images cluster similarly within their domain. To validate this hypothesis, we developed a system that processes a diverse dataset of AI and human-generated images through five benchmark embedding models. Extensive experimentation demonstrates the robustness of our approach, and our results confirm that moderate to high perturbations minimally impact the embedding signatures, with perturbed images maintaining close similarity matches to their original versions. Our solution provides a generalizable framework for AI-generated image detection that balances accuracy with computational efficiency.

[122] ManzaiSet: A Multimodal Dataset of Viewer Responses to Japanese Manzai Comedy

Kazuki Kawamura, Kengo Nakai, Jun Rekimoto

Main category: cs.CV

TL;DR: ManzaiSet is the first large-scale multimodal dataset of Japanese manzai comedy viewer responses, addressing Western bias in affective computing by capturing facial videos and audio from 241 participants.

Details

Motivation: To address the Western-centric bias in affective computing and enable culturally aware emotion AI development for non-Western contexts like Japanese comedy.

Method: Collected facial videos and audio from 241 participants watching up to 10 professional manzai performances in randomized order, with clustering analysis, individual-level viewing order analysis, and automated humor classification.

Result: Identified three distinct viewer types (72.8% High Appreciators, 13.2% Decliners, 14.0% Improvers), found positive viewing order effect contradicting fatigue hypotheses, and no type-wise differences in humor classification after FDR correction.

Conclusion: The dataset enables development of culturally aware emotion AI and personalized entertainment systems tailored to non-Western contexts, demonstrating the importance of cultural specificity in affective computing.

Abstract: We present ManzaiSet, the first large scale multimodal dataset of viewer responses to Japanese manzai comedy, capturing facial videos and audio from 241 participants watching up to 10 professional performances in randomized order (94.6 percent watched >= 8; analyses focus on n=228). This addresses the Western centric bias in affective computing. Three key findings emerge: (1) k means clustering identified three distinct viewer types: High and Stable Appreciators (72.8 percent, n=166), Low and Variable Decliners (13.2 percent, n=30), and Variable Improvers (14.0 percent, n=32), with heterogeneity of variance (Brown Forsythe p < 0.001); (2) individual level analysis revealed a positive viewing order effect (mean slope = 0.488, t(227) = 5.42, p < 0.001, permutation p < 0.001), contradicting fatigue hypotheses; (3) automated humor classification (77 instances, 131 labels) plus viewer level response modeling found no type wise differences after FDR correction. The dataset enables culturally aware emotion AI development and personalized entertainment systems tailored to non Western contexts.

[123] CMIS-Net: A Cascaded Multi-Scale Individual Standardization Network for Backchannel Agreement Estimation

Yuxuan Huang, Kangzhong Wang, Eugene Yujun Fu, Grace Ngai, Peter H. F. Ng

Main category: cs.CV

TL;DR: CMIS-Net is a novel framework for backchannel agreement detection that addresses individual differences through multi-scale feature normalization and implicit data augmentation, achieving state-of-the-art performance.

Details

Motivation: Backchannel behaviors are crucial for human-like AI interactions but are significantly influenced by individual differences across multiple scales (frame-level and sequence-level), which current emotion recognition methods fail to fully address due to their single-scale approaches.

Method: Proposed Cascaded Multi-Scale Individual Standardization Network (CMIS-Net) that extracts individual-normalized backchannel features by removing person-specific neutral baselines at both frame and sequence levels, plus an implicit data augmentation module to handle training data distributional bias.

Result: Comprehensive experiments and visualizations demonstrate that CMIS-Net effectively handles individual differences and data imbalance, achieving state-of-the-art performance in backchannel agreement detection.

Conclusion: The proposed CMIS-Net successfully addresses the limitations of existing methods by incorporating multi-scale individual normalization and data augmentation, providing an effective solution for backchannel agreement detection in conversational AI systems.

Abstract: Backchannels are subtle listener responses, such as nods, smiles, or short verbal cues like “yes” or “uh-huh,” which convey understanding and agreement in conversations. These signals provide feedback to speakers, improve the smoothness of interaction, and play a crucial role in developing human-like, responsive AI systems. However, the expression of backchannel behaviors is often significantly influenced by individual differences, operating across multiple scales: from instant dynamics such as response intensity (frame-level) to temporal patterns such as frequency and rhythm preferences (sequence-level). This presents a complex pattern recognition problem that contemporary emotion recognition methods have yet to fully address. Particularly, existing individualized methods in emotion recognition often operate at a single scale, overlooking the complementary nature of multi-scale behavioral cues. To address these challenges, we propose a novel Cascaded Multi-Scale Individual Standardization Network (CMIS-Net) that extracts individual-normalized backchannel features by removing person-specific neutral baselines from observed expressions. Operating at both frame and sequence levels, this normalization allows model to focus on relative changes from each person’s baseline rather than absolute expression values. Furthermore, we introduce an implicit data augmentation module to address the observed training data distributional bias, improving model generalization. Comprehensive experiments and visualizations demonstrate that CMIS-Net effectively handles individual differences and data imbalance, achieving state-of-the-art performance in backchannel agreement detection.

[124] Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch

Xu Cai, Yang Wu, Qianli Chen, Haoran Wu, Lichuan Xiang, Hongkai Wen

Main category: cs.CV

TL;DR: An ultra-efficient post-training method that converts large pre-trained flow matching diffusion models into few-step samplers using velocity field self-distillation, eliminating the need for retraining or step-size embeddings.

Details

Motivation: Existing shortcutting methods require specialized step-size embeddings and retraining from scratch, which is nearly as costly as pretraining. The goal is to enable efficient few-step sampling from standard flow matching models without this expensive retraining process.

Method: Uses velocity field self-distillation to impart aggressive shortcut mechanisms to standard flow matching models. The approach works on velocity fields rather than sample space and learns rapidly through self-guided distillation in an online manner, requiring less than one A100 day for training.

Result: Produces 3-step Flux models efficiently. Can also be incorporated into pretraining to yield models that inherently learn efficient few-step flows without quality loss. Enables few-shot distillation (e.g., 10 text-image pairs) for billion-parameter diffusion models with state-of-the-art performance at minimal cost.

Conclusion: The method provides an ultra-efficient way to shortcut large pre-trained flow matching models into few-step samplers, eliminating the need for expensive retraining while maintaining performance, and enables practical few-shot distillation for massive diffusion models.

Abstract: We present an ultra-efficient post-training method for shortcutting large-scale pre-trained flow matching diffusion models into efficient few-step samplers, enabled by novel velocity field self-distillation. While shortcutting in flow matching, originally introduced by shortcut models, offers flexible trajectory-skipping capabilities, it requires a specialized step-size embedding incompatible with existing models unless retraining from scratch$\unicode{x2013}$a process nearly as costly as pretraining itself. Our key contribution is thus imparting a more aggressive shortcut mechanism to standard flow matching models (e.g., Flux), leveraging a unique distillation principle that obviates the need for step-size embedding. Working on the velocity field rather than sample space and learning rapidly from self-guided distillation in an online manner, our approach trains efficiently, e.g., producing a 3-step Flux less than one A100 day. Beyond distillation, our method can be incorporated into the pretraining stage itself, yielding models that inherently learn efficient, few-step flows without compromising quality. This capability also enables, to our knowledge, the first few-shot distillation method (e.g., 10 text-image pairs) for dozen-billion-parameter diffusion models, delivering state-of-the-art performance at almost free cost.

[125] 3D Audio-Visual Segmentation

Artem Sokolov, Swapnil Bhosale, Xiatian Zhu

Main category: cs.CV

TL;DR: The paper introduces 3D Audio-Visual Segmentation, extending 2D AVS to 3D space, and proposes EchoSegnet with a new benchmark 3DAVS-S34-O7 for evaluating 3D sounding object segmentation.

Details

Motivation: Current Audio-Visual Segmentation (AVS) is limited to 2D images, missing the mapping to 3D scenes needed for real-world applications in robotics and AR/VR/MR.

Method: Created 3DAVS-S34-O7 benchmark using Habitat simulator for spatial audio annotations, and proposed EchoSegnet integrating pretrained 2D audio-visual models with 3D scene representation through spatial audio-aware mask alignment.

Result: EchoSegnet effectively segments sounding objects in 3D space on the new benchmark, demonstrating significant advancement in embodied AI.

Conclusion: The work successfully extends AVS to 3D space, addressing fundamental limitations of 2D approaches and enabling more realistic applications in embodied AI systems.

Abstract: Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR. To that end, Audio-Visual Segmentation (AVS), taking as condition an audio signal to identify the masks of the target sounding objects in an input image with synchronous camera and microphone sensors, has been recently advanced. However, this paradigm is still insufficient for real-world operation, as the mapping from 2D images to 3D scenes is missing. To address this fundamental limitation, we introduce a novel research problem, 3D Audio-Visual Segmentation, extending the existing AVS to the 3D output space. This problem poses more challenges due to variations in camera extrinsics, audio scattering, occlusions, and diverse acoustics across sounding object categories. To facilitate this research, we create the very first simulation based benchmark, 3DAVS-S34-O7, providing photorealistic 3D scene environments with grounded spatial audio under single-instance and multi-instance settings, across 34 scenes and 7 object categories. This is made possible by re-purposing the Habitat simulator to generate comprehensive annotations of sounding object locations and corresponding 3D masks. Subsequently, we propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models synergistically with 3D visual scene representation through spatial audio-aware mask alignment and refinement. Extensive experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI. Project page: https://x-up-lab.github.io/research/3d-audio-visual-segmentation/

[126] Robotic Classification of Divers’ Swimming States using Visual Pose Keypoints as IMUs

Demetrious T. Kutzke, Ying-Kun Wu, Elizabeth Terveen, Junaed Sattar

Main category: cs.CV

TL;DR: A hybrid computer vision approach creates pseudo-IMU data from 3D joint keypoints to monitor scuba diver safety underwater, overcoming wireless signal limitations of traditional sensors.

Details

Motivation: Traditional activity recognition methods fail underwater due to wireless signal attenuation, creating safety risks for scuba divers who face medical emergencies like cardiac arrest.

Method: Computer vision generates high-fidelity motion data from 3D human joint keypoints, creating a “pseudo-IMU” that bypasses underwater wireless communication issues with AUVs.

Result: The system successfully identifies anomalous diver behavior signaling medical emergencies and was integrated onboard an AUV for real-time monitoring in simulated distress scenarios.

Conclusion: This hybrid vision-based approach effectively advances robotic monitoring capabilities for scuba diver safety by overcoming underwater communication limitations.

Abstract: Traditional human activity recognition uses either direct image analysis or data from wearable inertial measurement units (IMUs), but can be ineffective in challenging underwater environments. We introduce a novel hybrid approach that bridges this gap to monitor scuba diver safety. Our method leverages computer vision to generate high-fidelity motion data, effectively creating a ``pseudo-IMU’’ from a stream of 3D human joint keypoints. This technique circumvents the critical problem of wireless signal attenuation in water, which plagues conventional diver-worn sensors communicating with an Autonomous Underwater Vehicle (AUV). We apply this system to the vital task of identifying anomalous scuba diver behavior that signals the onset of a medical emergency such as cardiac arrest – a leading cause of scuba diving fatalities. By integrating our classifier onboard an AUV and conducting experiments with simulated distress scenarios, we demonstrate the utility and effectiveness of our method for advancing robotic monitoring and diver safety.

ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, Wentao Zhang

Main category: cs.CV

TL;DR: LongInsightBench is the first benchmark for evaluating models’ ability to understand long videos across visual, audio, and text modalities, focusing on human language, viewpoints, and actions in information-dense content like lectures and interviews.

Details

Motivation: To address the lack of benchmarks for assessing models' understanding of long videos that integrate multiple modalities and contain rich contextual elements like human language and actions.

Method: Created a benchmark with ~1,000 carefully selected long-duration videos from FineVideo dataset, designed six challenging task scenarios (Intra-Event and Inter-Event Tasks), and implemented a three-step semi-automated quality assurance pipeline for question and answer validation.

Result: Experimental results show that omni-modal models still struggle with tasks requiring precise temporal localization and long-range causal inference. Extended experiments reveal information loss and processing bias in multi-modal fusion.

Conclusion: LongInsightBench provides a comprehensive benchmark for evaluating long video understanding capabilities, revealing current limitations in omni-modal models and highlighting challenges in multi-modal fusion for complex temporal reasoning tasks.

Abstract: We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models’ ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.

[128] InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation

Jungmin Lee, Seonghyuk Hong, Juyong Lee, Jaeyoon Lee, Jongwon Choi

Main category: cs.CV

TL;DR: InsideOut extends 3D Gaussian splatting to fuse RGB surface details with subsurface X-ray structures, enabling enhanced visualization and non-destructive testing across medical, cultural heritage, and manufacturing domains.

Details

Motivation: To bridge the gap between high-fidelity RGB surface details and subsurface X-ray structures, which is valuable for medical diagnostics, cultural heritage restoration, and manufacturing applications.

Method: Collect paired RGB and X-ray data, perform hierarchical fitting to align RGB and X-ray radiative Gaussian splats, and propose an X-ray reference loss to ensure consistent internal structures.

Result: InsideOut effectively addresses challenges of disparate data representations between modalities and limited paired datasets, extending 3DGS applicability.

Conclusion: This approach significantly enhances visualization, simulation, and non-destructive testing capabilities across various domains by fusing RGB and X-ray imaging.

Abstract: We introduce InsideOut, an extension of 3D Gaussian splatting (3DGS) that bridges the gap between high-fidelity RGB surface details and subsurface X-ray structures. The fusion of RGB and X-ray imaging is invaluable in fields such as medical diagnostics, cultural heritage restoration, and manufacturing. We collect new paired RGB and X-ray data, perform hierarchical fitting to align RGB and X-ray radiative Gaussian splats, and propose an X-ray reference loss to ensure consistent internal structures. InsideOut effectively addresses the challenges posed by disparate data representations between the two modalities and limited paired datasets. This approach significantly extends the applicability of 3DGS, enhancing visualization, simulation, and non-destructive testing capabilities across various domains.

[129] MUSE: Model-based Uncertainty-aware Similarity Estimation for zero-shot 2D Object Detection and Segmentation

Sungmin Cho, Sungbum Park, Insoo Oh

Main category: cs.CV

TL;DR: MUSE is a training-free framework for zero-shot 2D object detection and segmentation that uses 3D object templates and joint similarity metrics with uncertainty-aware priors, achieving state-of-the-art performance on BOP Challenge 2025.

Details

Motivation: To create a powerful and generalizable framework for zero-shot 2D object detection and segmentation without requiring additional training or fine-tuning, addressing the challenge of detecting unseen objects.

Method: Uses 2D multi-view templates from 3D unseen objects and 2D object proposals. Integrates class and patch embeddings with GeM pooling, employs joint similarity metrics (absolute + relative), and refines scores with uncertainty-aware object priors.

Result: Achieved state-of-the-art performance on BOP Challenge 2025, ranking first across Classic Core, H3, and Industrial tracks without any training or fine-tuning.

Conclusion: MUSE provides an effective training-free solution for zero-shot 2D object detection and segmentation, demonstrating strong generalization capabilities and robust performance across challenging scenarios.

Abstract: In this work, we introduce MUSE (Model-based Uncertainty-aware Similarity Estimation), a training-free framework designed for model-based zero-shot 2D object detection and segmentation. MUSE leverages 2D multi-view templates rendered from 3D unseen objects and 2D object proposals extracted from input query images. In the embedding stage, it integrates class and patch embeddings, where the patch embeddings are normalized using generalized mean pooling (GeM) to capture both global and local representations efficiently. During the matching stage, MUSE employs a joint similarity metric that combines absolute and relative similarity scores, enhancing the robustness of matching under challenging scenarios. Finally, the similarity score is refined through an uncertainty-aware object prior that adjusts for proposal reliability. Without any additional training or fine-tuning, MUSE achieves state-of-the-art performance on the BOP Challenge 2025, ranking first across the Classic Core, H3, and Industrial tracks. These results demonstrate that MUSE offers a powerful and generalizable framework for zero-shot 2D object detection and segmentation.

[130] GAN-based Content-Conditioned Generation of Handwritten Musical Symbols

Gerard Asbert, Pau Torras, Lei Kang, Alicia Fornés, Josep Lladós

Main category: cs.CV

TL;DR: This paper addresses the scarcity of annotated data in Optical Music Recognition (OMR) by generating synthetic handwritten musical scores using a GAN and Smashcima software.

Details

Motivation: The field of OMR lacks sufficient real annotated data, especially for handwritten historical scores, limiting training of recognition models. Synthetic data generation has proven successful in similar fields like Handwritten Text Recognition.

Method: Implemented a music symbol-level Generative Adversarial Network (GAN) to generate realistic handwritten music symbols, then assembled these symbols into full musical scores using the Smashcima engraving software.

Result: The generated symbols exhibited high visual fidelity and realism, representing significant progress in synthetic score generation for OMR applications.

Conclusion: The proposed approach successfully generates realistic synthetic handwritten musical scores, which can help address the data scarcity problem in Optical Music Recognition training.

Abstract: The field of Optical Music Recognition (OMR) is currently hindered by the scarcity of real annotated data, particularly when dealing with handwritten historical musical scores. In similar fields, such as Handwritten Text Recognition, it was proven that synthetic examples produced with image generation techniques could help to train better-performing recognition architectures. This study explores the generation of realistic, handwritten-looking scores by implementing a music symbol-level Generative Adversarial Network (GAN) and assembling its output into a full score using the Smashcima engraving software. We have systematically evaluated the visual fidelity of these generated samples, concluding that the generated symbols exhibit a high degree of realism, marking significant progress in synthetic score generation.

[131] Auditing and Mitigating Bias in Gender Classification Algorithms: A Data-Centric Approach

Tadesse K Bahiru, Natnael Tilahun Sinshaw, Teshager Hailemariam Moges, Dheeraj Kumar Singh

Main category: cs.CV

TL;DR: The paper addresses gender classification bias by creating BalancedFace, a dataset engineered for demographic balance across 189 intersections of age, race, and gender, which significantly improves fairness metrics with minimal accuracy loss.

Details

Motivation: Gender classification systems amplify demographic imbalances from training data, with existing datasets showing significant intersectional underrepresentation and resulting biased models.

Method: Audited five gender classification datasets, trained identical MobileNetV2 classifiers on the most balanced datasets, then constructed BalancedFace by blending images from FairFace and UTKFace supplemented with other collections to fill demographic gaps.

Result: BalancedFace reduces maximum True Positive Rate gap across racial subgroups by over 50% and brings average Disparate Impact score 63% closer to ideal 1.0 compared to next-best dataset, with minimal overall accuracy loss.

Conclusion: Data-centric interventions like BalancedFace provide significant fairness improvements and offer an openly available resource for fair gender classification research.

Abstract: Gender classification systems often inherit and amplify demographic imbalances in their training data. We first audit five widely used gender classification datasets, revealing that all suffer from significant intersectional underrepresentation. To measure the downstream impact of these flaws, we train identical MobileNetV2 classifiers on the two most balanced of these datasets, UTKFace and FairFace. Our fairness evaluation shows that even these models exhibit significant bias, misclassifying female faces at a higher rate than male faces and amplifying existing racial skew. To counter these data-induced biases, we construct BalancedFace, a new public dataset created by blending images from FairFace and UTKFace, supplemented with images from other collections to fill missing demographic gaps. It is engineered to equalize subgroup shares across 189 intersections of age, race, and gender using only real, unedited images. When a standard classifier is trained on BalancedFace, it reduces the maximum True Positive Rate gap across racial subgroups by over 50% and brings the average Disparate Impact score 63% closer to the ideal of 1.0 compared to the next-best dataset, all with a minimal loss of overall accuracy. These results underline the profound value of data-centric interventions and provide an openly available resource for fair gender classification research.

[132] 3D Weakly Supervised Semantic Segmentation via Class-Aware and Geometry-Guided Pseudo-Label Refinement

Xiaoxu Xu, Xuexun Liu, Jinlong Li, Yitian Yuan, Qiudan Zhang, Lin Ma, Nicu Sebe, Xu Wang

Main category: cs.CV

TL;DR: A novel 3D weakly supervised semantic segmentation method that integrates geometric priors with class-aware guidance to generate high-quality pseudo labels through iterative refinement and self-training.

Details

Motivation: To overcome limitations in existing 3D WSSS methods that suffer from low-quality pseudo-labels and insufficient exploitation of 3D geometric priors, which create significant bottlenecks in developing high-performance models.

Method: Proposes a three-component approach: 1) Class-Aware Label Refinement for balanced and accurate pseudo labels, 2) Geometry-Aware Label Refinement using implicit 3D geometric constraints to filter low-confidence labels, and 3) Label Update strategy with Self-Training to propagate labels into unlabeled regions.

Result: Achieves state-of-the-art performance on ScanNet and S3DIS benchmarks, demonstrating remarkable generalization capability in unsupervised settings while maintaining competitive accuracy.

Conclusion: The proposed methodology effectively integrates geometric priors with class-aware guidance to generate high-fidelity pseudo labels, enabling the development of high-performance 3D weakly supervised semantic segmentation models through robust iterative refinement.

Abstract: 3D weakly supervised semantic segmentation (3D WSSS) aims to achieve semantic segmentation by leveraging sparse or low-cost annotated data, significantly reducing reliance on dense point-wise annotations. Previous works mainly employ class activation maps or pre-trained vision-language models to address this challenge. However, the low quality of pseudo-labels and the insufficient exploitation of 3D geometric priors jointly create significant technical bottlenecks in developing high-performance 3D WSSS models. In this paper, we propose a simple yet effective 3D weakly supervised semantic segmentation method that integrates 3D geometric priors into a class-aware guidance mechanism to generate high-fidelity pseudo labels. Concretely, our designed methodology first employs Class-Aware Label Refinement module to generate more balanced and accurate pseudo labels for semantic categrories. This initial refinement stage focuses on enhancing label quality through category-specific optimization. Subsequently, the Geometry-Aware Label Refinement component is developed, which strategically integrates implicit 3D geometric constraints to effectively filter out low-confidence pseudo labels that fail to comply with geometric plausibility. Moreover, to address the challenge of extensive unlabeled regions, we propose a Label Update strategy that integrates Self-Training to propagate labels into these areas. This iterative process continuously enhances pseudo-label quality while expanding label coverage, ultimately fostering the development of high-performance 3D WSSS models. Comprehensive experimental validation reveals that our proposed methodology achieves state-of-the-art performance on both ScanNet and S3DIS benchmarks while demonstrating remarkable generalization capability in unsupervised settings, maintaining competitive accuracy through its robust design.

[133] GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction

Eya Cherif, Arthur Ouaknine, Luke A. Brown, Phuong D. Dao, Kyle R. Kovach, Bing Lu, Daniel Mederer, Hannes Feilhauer, Teja Kattenborn, David Rolnick

Main category: cs.CV

TL;DR: GreenHyperSpectra is a pretraining dataset for plant trait prediction using hyperspectral data, addressing label scarcity and domain shifts across sensors and ecosystems through semi- and self-supervised methods.

Details

Motivation: Conventional field sampling cannot cover trait variation at ecologically meaningful spatial scales, and machine learning approaches face challenges with label scarcity and domain shifts across different sensors and ecological distributions.

Method: Created GreenHyperSpectra dataset with real-world cross-sensor and cross-ecosystem samples, used for pretraining label-efficient multi-output regression models with semi- and self-supervised learning methods.

Result: Pretrained models outperformed state-of-the-art supervised baseline, showing substantial improvements in learning spectral representations for trait prediction across both in-distribution and out-of-distribution scenarios.

Conclusion: Established a comprehensive methodological framework that advances research at the intersection of representation learning and plant functional traits assessment, with all code and data publicly available.

Abstract: Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.

[134] Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods

Ghazal Danaee, Marc Niethammer, Jarrett Rushmore, Sylvain Bouix

Main category: cs.CV

TL;DR: This study evaluates fairness in deep learning segmentation models for nucleus accumbens (NAc) in MRI images across demographic subgroups, finding that race-matched training improves accuracy for some models while nnU-Net shows demographic robustness.

Details

Motivation: Address growing concerns about unfairness and performance disparities in medical image segmentation based on sensitive attributes like race and sex, particularly in deep learning models applied to structural MRI analysis.

Method: Evaluated three segmentation models (UNesT, nnU-Net, CoTr) and traditional atlas-based method (ANTs) on NAc segmentation using manually labeled gold-standard data from four demographic subgroups. Used fairness metrics and linear mixed models to analyze demographic effects on segmentation accuracy and derived volumes.

Result: Training on race-matched data significantly improved segmentation accuracy for ANTs and UNesT, while nnU-Net performed robustly regardless of demographic matching. Sex effects from manual segmentation were preserved in biased models, but race effects disappeared in all but one model.

Conclusion: Demographic matching in training data affects segmentation performance differently across models, with nnU-Net showing superior fairness characteristics. Biased models can preserve sex effects but often eliminate race effects observed in manual segmentations.

Abstract: Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.

[135] ViBED-Net: Video Based Engagement Detection Network Using Face-Aware and Scene-Aware Spatiotemporal Cues

Prateek Gothwal, Deeptimaan Banerjee, Ashis Kumer Biswas

Main category: cs.CV

TL;DR: ViBED-Net is a dual-stream deep learning framework that detects student engagement from video by combining facial expressions and scene context using EfficientNetV2 for spatial features and LSTM/Transformers for temporal modeling, achieving 73.43% accuracy on DAiSEE dataset.

Details

Motivation: Engagement detection in online learning is crucial for improving student outcomes and personalizing instruction, requiring effective video-based affective computing solutions.

Method: Dual-stream architecture using EfficientNetV2 for spatial feature extraction from facial crops and full video frames, with LSTM and Transformer encoders for temporal modeling, plus targeted data augmentation for underrepresented classes.

Result: ViBED-Net with LSTM achieves 73.43% accuracy on DAiSEE dataset, outperforming existing state-of-the-art approaches.

Conclusion: Combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection, offering a scalable, high-performing solution for real-world applications in education and user experience research.

Abstract: Engagement detection in online learning environments is vital for improving student outcomes and personalizing instruction. We present ViBED-Net (Video-Based Engagement Detection Network), a novel deep learning framework designed to assess student engagement from video data using a dual-stream architecture. ViBED-Net captures both facial expressions and full-scene context by processing facial crops and entire video frames through EfficientNetV2 for spatial feature extraction. These features are then analyzed over time using two temporal modeling strategies: Long Short-Term Memory (LSTM) networks and Transformer encoders. Our model is evaluated on the DAiSEE dataset, a large-scale benchmark for affective state recognition in e-learning. To enhance performance on underrepresented engagement classes, we apply targeted data augmentation techniques. Among the tested variants, ViBED-Net with LSTM achieves 73.43% accuracy, outperforming existing state-of-the-art approaches. ViBED-Net demonstrates that combining face-aware and scene-aware spatiotemporal cues significantly improves engagement detection accuracy. Its modular design allows flexibility for application across education, user experience research, and content personalization. This work advances video-based affective computing by offering a scalable, high-performing solution for real-world engagement analysis. The source code for this project is available on https://github.com/prateek-gothwal/ViBED-Net .

[136] SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

Main category: cs.CV

TL;DR: SAVANT is a structured reasoning framework that uses Vision Language Models to detect anomalous driving scenarios through layered scene analysis, achieving high accuracy with open-source models.

Details

Motivation: Autonomous driving systems are vulnerable to rare, out-of-distribution scenarios with semantic anomalies, and current VLM approaches are unreliable and expensive.

Method: Two-phase pipeline with structured scene description extraction and multi-modal evaluation across four semantic layers: Street, Infrastructure, Movable Objects, and Environment.

Result: Achieved 89.6% recall and 88.0% accuracy on real-world driving scenarios, with fine-tuned 7B model reaching 90.8% recall and 93.8% accuracy, outperforming all baselines.

Conclusion: SAVANT enables reliable, accessible semantic monitoring for autonomous systems by addressing data scarcity through automatic labeling of real-world images.

Abstract: Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.

[137] HouseTour: A Virtual Real Estate A(I)gent

Ata Çelen, Marc Pollefeys, Daniel Barath, Iro Armeni

Main category: cs.CV

TL;DR: HouseTour generates 3D camera trajectories and natural language summaries from image collections, using diffusion models constrained by camera poses and 3D Gaussian splatting for video synthesis.

Details

Motivation: Existing vision-language models struggle with geometric reasoning, limiting their ability to create spatially-aware video tours from 3D spaces.

Method: Uses diffusion process for smooth camera trajectories constrained by known poses, integrates 3D information into VLM for descriptions, and synthesizes videos with 3D Gaussian splatting. Introduces HouseTour dataset with 1,200+ tour videos.

Result: Incorporating 3D camera trajectories improves text generation performance over independent task methods. Evaluated with new joint metric showing better end-to-end performance.

Conclusion: Enables automated professional-quality video creation for real estate and tourism without specialized expertise or equipment.

Abstract: We introduce HouseTour, a method for spatially-aware 3D camera trajectory and natural language summary generation from a collection of images depicting an existing 3D space. Unlike existing vision-language models (VLMs), which struggle with geometric reasoning, our approach generates smooth video trajectories via a diffusion process constrained by known camera poses and integrates this information into the VLM for 3D-grounded descriptions. We synthesize the final video using 3D Gaussian splatting to render novel views along the trajectory. To support this task, we present the HouseTour dataset, which includes over 1,200 house-tour videos with camera poses, 3D reconstructions, and real estate descriptions. Experiments demonstrate that incorporating 3D camera trajectories into the text generation process improves performance over methods handling each task independently. We evaluate both individual and end-to-end performance, introducing a new joint metric. Our work enables automated, professional-quality video creation for real estate and touristic applications without requiring specialized expertise or equipment.

[138] Chimera: Compositional Image Generation using Part-based Concepting

Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, Chitra Baral

Main category: cs.CV

TL;DR: Chimera is a personalized image generation model that creates novel objects by combining specified parts from different source images using textual instructions, outperforming baselines in part alignment and visual quality.

Details

Motivation: Existing personalized image generative models lack explicit control for composing objects from specific parts of multiple source images without user-specified masks or annotations.

Method: Constructed a dataset from 464 unique (part, subject) semantic atoms, generated 37k prompts, trained a custom diffusion prior model with part-conditional guidance to enforce semantic identity and spatial layout.

Result: Outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality according to human evaluations and proposed PartEval metric.

Conclusion: Chimera successfully enables controlled composition of objects from multiple source images through textual instructions, demonstrating superior performance in part alignment and visual quality.

Abstract: Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.

[139] Big Data, Tiny Targets: An Exploratory Study in Machine Learning-enhanced Detection of Microplastic from Filters

Paul-Tiberiu Miclea, Martin Sboron, Hardik Vaghasiya, Hoang Thinh Nguyen, Meet Gadara, Thomas Schmid

Main category: cs.CV

TL;DR: This paper explores using machine learning with SEM imaging to detect microplastics, finding YOLO models effective but limited by data availability and preprocessing needs.

Details

Motivation: Microplastics are hard to detect with traditional methods, and manual analysis prevents large-scale screening. Machine learning offers potential for automated detection.

Method: Combined SEM imaging with machine learning-based object detection using YOLO models, focusing on filtration scenarios with symmetric background patterns.

Result: Found differences in YOLO model performance and identified the importance of preprocessing optimization for reliable detection.

Conclusion: Machine learning shows promise for microplastic detection but faces challenges like limited expert-labeled training data and preprocessing requirements.

Abstract: Microplastics (MPs) are ubiquitous pollutants with demonstrated potential to impact ecosystems and human health. Their microscopic size complicates detection, classification, and removal, especially in biological and environmental samples. While techniques like optical microscopy, Scanning Electron Microscopy (SEM), and Atomic Force Microscopy (AFM) provide a sound basis for detection, applying these approaches requires usually manual analysis and prevents efficient use in large screening studies. To this end, machine learning (ML) has emerged as a powerful tool in advancing microplastic detection. In this exploratory study, we investigate potential, limitations and future directions of advancing the detection and quantification of MP particles and fibres using a combination of SEM imaging and machine learning-based object detection. For simplicity, we focus on a filtration scenario where image backgrounds exhibit a symmetric and repetitive pattern. Our findings indicate differences in the quality of YOLO models for the given task and the relevance of optimizing preprocessing. At the same time, we identify open challenges, such as limited amounts of expert-labeled data necessary for reliable training of ML models.

[140] Accelerating Vision Transformers with Adaptive Patch Sizes

Rohan Choudhury, JungEun Kim, Jinhyung Park, Eunho Yang, László A. Jeni, Kris M. Kitani

Main category: cs.CV

TL;DR: APT uses adaptive patch sizes in Vision Transformers to reduce input sequence length, achieving 40-50% speedup while maintaining performance.

Details

Motivation: Standard ViTs use uniform patch sizes regardless of image content, leading to inefficient long input sequences for high-resolution images.

Method: APT allocates larger patch sizes in homogeneous areas and smaller patches in complex regions, reducing total input tokens while preserving important details.

Result: 40% speedup on ViT-L and 50% on ViT-H, maintains downstream performance, converges in 1 epoch on fine-tuned models, 30% faster training/inference for dense visual tasks.

Conclusion: Adaptive patch allocation effectively reduces ViT computational costs without performance loss, enabling efficient high-resolution vision processing.

Abstract: Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30% faster training and inference in visual QA, object detection, and semantic segmentation.

[141] From Volume Rendering to 3D Gaussian Splatting: Theory and Applications

Vitor Pereira Matias, Daniel Perazzo, Vinicius Silva, Alberto Raposo, Luiz Velho, Afonso Paiva, Tiago Novello

Main category: cs.CV

TL;DR: 3D Gaussian Splatting (3DGS) enables efficient 3D reconstruction from posed images through volumetric splatting, but faces challenges with memory usage, baked lighting effects, and limited secondary-ray effects.

Details

Motivation: To provide a comprehensive overview of 3DGS pipeline and address its limitations while highlighting applications in surface reconstruction, avatar modeling, animation, and content generation.

Method: Uses explicit scene modeling as collections of 3D Gaussians with efficient rasterization through volumetric splatting, integrating with common graphics pipelines.

Result: 3DGS achieves real-time rendering capabilities for novel view synthesis but suffers from high memory footprint and limited secondary-ray effects.

Conclusion: 3DGS offers efficient rendering and suitability for feed-forward pipelines, with ongoing efforts to address its limitations across various applications.

Abstract: The problem of 3D reconstruction from posed images is undergoing a fundamental transformation, driven by continuous advances in 3D Gaussian Splatting (3DGS). By modeling scenes explicitly as collections of 3D Gaussians, 3DGS enables efficient rasterization through volumetric splatting, offering thus a seamless integration with common graphics pipelines. Despite its real-time rendering capabilities for novel view synthesis, 3DGS suffers from a high memory footprint, the tendency to bake lighting effects directly into its representation, and limited support for secondary-ray effects. This tutorial provides a concise yet comprehensive overview of the 3DGS pipeline, starting from its splatting formulation and then exploring the main efforts in addressing its limitations. Finally, we survey a range of applications that leverage 3DGS for surface reconstruction, avatar modeling, animation, and content generation-highlighting its efficient rendering and suitability for feed-forward pipelines.

[142] Online In-Context Distillation for Low-Resource Vision Language Models

Zhiqi Kang, Rahaf Aljundi, Vaggelis Dorovatas, Karteek Alahari

Main category: cs.CV

TL;DR: Proposes In-Context Distillation (ICD) - an online method where small vision-language models collaborate with stronger teacher models at inference time using sparse demonstrations to bridge performance gaps without costly fine-tuning.

Details

Motivation: Address the challenge of deploying vision-language models in low-resource, budget-constrained settings where large models are impractical and small models require expensive fine-tuning to match performance.

Method: Online In-Context Distillation framework with cross-modal demonstration selection, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate demonstration pool and minimize teacher queries.

Result: Significantly boosts small model performance (up to 33%) using scarce teacher annotations (as low as 4%), and competes with teacher’s zero-shot performance.

Conclusion: ICD provides an efficient alternative to fine-tuning for adapting small VLMs to low-resource settings through collaborative inference-time knowledge distillation.

Abstract: As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher’s zero-shot performance.

[143] SafeCoop: Unravelling Full Stack Safety in Agentic Collaborative Driving

Xiangbo Gao, Tzu-Hsiang Lin, Ruojing Song, Yuheng Wu, Kuan-Ru Huang, Zicheng Jin, Fangzhou Lin, Shinan Liu, Zhengzhong Tu

Main category: cs.CV

TL;DR: First systematic study of safety/security issues in natural-language-based collaborative driving, proposing SafeCoop defense pipeline with semantic firewall and multi-source consensus that achieves 69.15% driving score improvement under attacks.

Details

Motivation: Traditional V2X systems face bandwidth demands, semantic loss, and interoperability issues. Natural language offers semantic richness and lower bandwidth but introduces new vulnerabilities like message loss, hallucinations, and adversarial attacks.

Method: Developed comprehensive attack taxonomy and SafeCoop defense pipeline with semantic firewall, language-perception consistency checks, multi-source consensus, and agentic transformation for cross-frame spatial alignment.

Result: Evaluated in CARLA simulation across 32 critical scenarios: 69.15% driving score improvement under malicious attacks and up to 67.32% F1 score for malicious detection.

Conclusion: Provides guidance for advancing safe, secure, and trustworthy language-driven collaboration in transportation systems.

Abstract: Collaborative driving systems leverage vehicle-to-everything (V2X) communication across multiple agents to enhance driving safety and efficiency. Traditional V2X systems take raw sensor data, neural features, or perception results as communication media, which face persistent challenges, including high bandwidth demands, semantic loss, and interoperability issues. Recent advances investigate natural language as a promising medium, which can provide semantic richness, decision-level reasoning, and human-machine interoperability at significantly lower bandwidth. Despite great promise, this paradigm shift also introduces new vulnerabilities within language communication, including message loss, hallucinations, semantic manipulation, and adversarial attacks. In this work, we present the first systematic study of full-stack safety and security issues in natural-language-based collaborative driving. Specifically, we develop a comprehensive taxonomy of attack strategies, including connection disruption, relay/replay interference, content spoofing, and multi-connection forgery. To mitigate these risks, we introduce an agentic defense pipeline, which we call SafeCoop, that integrates a semantic firewall, language-perception consistency checks, and multi-source consensus, enabled by an agentic transformation function for cross-frame spatial alignment. We systematically evaluate SafeCoop in closed-loop CARLA simulation across 32 critical scenarios, achieving 69.15% driving score improvement under malicious attacks and up to 67.32% F1 score for malicious detection. This study provides guidance for advancing research on safe, secure, and trustworthy language-driven collaboration in transportation systems. Our project page is https://xiangbogaobarry.github.io/SafeCoop.

[144] World-in-World: World Models in a Closed-Loop World

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen

Main category: cs.CV

TL;DR: World-in-World is the first platform to benchmark generative world models for embodied agents using closed-loop evaluation, revealing that visual quality alone doesn’t ensure task success and controllability is more important.

Details

Motivation: Current benchmarks for generative world models focus on open-loop protocols and visual quality, leaving the core question of whether WMs actually help agents succeed at embodied tasks unanswered.

Method: Introduced World-in-World platform with unified online planning strategy and standardized action API, curating four closed-loop environments to evaluate diverse WMs with task success as primary metric.

Result: Three key findings: (1) visual quality doesn’t guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading pretrained video generators; (3) more inference-time compute improves closed-loop performance.

Conclusion: The study demonstrates the importance of closed-loop evaluation for world models in embodied AI, showing that task-oriented metrics and controllability are more critical than visual quality alone.

Abstract: Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.

[145] Adapting Stereo Vision From Objects To 3D Lunar Surface Reconstruction with the StereoLunar Dataset

Clementine Grethen, Simone Gasparini, Geraldine Morin, Jeremy Lebreton, Lucas Marti, Manuel Sanchez-Gestido

Main category: cs.CV

TL;DR: LunarStereo is the first open dataset of photorealistic stereo image pairs of the Moon, used to adapt MASt3R model for 3D reconstruction in lunar conditions through fine-tuning.

Details

Motivation: Existing stereo vision methods struggle with lunar surface reconstruction due to Moon's lack of texture, difficult lighting variations, and atypical orbital trajectories. Deep learning models trained on human-scale datasets cannot be directly transferred to lunar conditions.

Method: Created LunarStereo dataset using ray tracing based on high-resolution topography and reflectance models. Fine-tuned MASt3R model on this dataset to adapt it to lunar domain.

Result: Significant improvements over zero-shot baselines in 3D surface reconstruction and relative pose estimation, validated through extensive experiments on both synthetic and real lunar data.

Conclusion: The approach demonstrates robust cross-scale generalization in extraterrestrial environments and paves the way for improved lunar surface reconstruction.

Abstract: Accurate 3D reconstruction of lunar surfaces is essential for space exploration. However, existing stereo vision reconstruction methods struggle in this context due to the Moon’s lack of texture, difficult lighting variations, and atypical orbital trajectories. State-of-the-art deep learning models, trained on human-scale datasets, have rarely been tested on planetary imagery and cannot be transferred directly to lunar conditions. To address this issue, we introduce LunarStereo, the first open dataset of photorealistic stereo image pairs of the Moon, simulated using ray tracing based on high-resolution topography and reflectance models. It covers diverse altitudes, lighting conditions, and viewing angles around the lunar South Pole, offering physically grounded supervision for 3D reconstruction tasks. Based on this dataset, we adapt the MASt3R model to the lunar domain through fine-tuning on LunarStereo. We validate our approach through extensive qualitative and quantitative experiments on both synthetic and real lunar data, evaluating 3D surface reconstruction and relative pose estimation. Extensive experiments on synthetic and real lunar data validate the approach, demonstrating significant improvements over zero-shot baselines and paving the way for robust cross-scale generalization in extraterrestrial environments.

[146] VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis

Fatima AlGhamdi, Omar Alharbi, Abdullah Aldwyish, Raied Aljadaany, Muhammad Kamran J Khan, Huda Alamri

Main category: cs.CV

TL;DR: VelocityNet uses head detection and optical flow to extract person velocities, classifies them into motion categories, and detects anomalies via percentile-based scoring.

Details

Motivation: Existing methods struggle with crowded scenes due to occlusions and dynamic motion patterns, lacking adaptability to varying densities and interpretable indicators.

Method: Dual-pipeline framework combining head detection and dense optical flow to extract person-specific velocities, followed by hierarchical clustering into semantic motion classes and percentile-based anomaly scoring.

Result: Effective real-time detection of diverse anomalous motion patterns in densely crowded environments.

Conclusion: VelocityNet successfully addresses limitations in crowd anomaly detection by providing interpretable motion classification and robust anomaly scoring.

Abstract: Detecting anomalies in crowded scenes is challenging due to severe inter-person occlusions and highly dynamic, context-dependent motion patterns. Existing approaches often struggle to adapt to varying crowd densities and lack interpretable anomaly indicators. To address these limitations, we introduce VelocityNet, a dual-pipeline framework that combines head detection and dense optical flow to extract person-specific velocities. Hierarchical clustering categorizes these velocities into semantic motion classes (halt, slow, normal, and fast), and a percentile-based anomaly scoring system measures deviations from learned normal patterns. Experiments demonstrate the effectiveness of our framework in real-time detection of diverse anomalous motion patterns within densely crowded environments.

[147] RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology

Chengrun Li, Corentin Royer, Haozhe Luo, Bastian Wittmann, Xia Li, Ibrahim Hamamci, Sezgin Er, Anjany Sekuboyina, Bjoern Menze

Main category: cs.CV

TL;DR: A new medical vision-language model called RadDiagSeg-M that can jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions, addressing limitations of current models.

Details

Motivation: Current medical vision language models struggle to generate both diagnostic text and segmentation masks simultaneously, which limits their clinical application value since practitioners need both modalities together for effective assistance.

Method: First introduced RadDiagSeg-D dataset combining abnormality detection, diagnosis and multi-target segmentation into unified hierarchical tasks, then developed RadDiagSeg-M model capable of joint abnormality detection, diagnosis and flexible segmentation.

Result: RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis, and shows strong performance across all components of multi-target text-and-mask generation.

Conclusion: The proposed approach establishes a robust and competitive baseline for joint text and segmentation mask generation in medical vision-language tasks, overcoming limitations of existing models.

Abstract: Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline.

[148] EMA-SAM: Exponential Moving-average for SAM-based PTMC Segmentation

Maryam Dialameh, Hossein Rajabzadeh, Jung Suk Sim, Hyock Ju Kwon

Main category: cs.CV

TL;DR: EMA-SAM is a lightweight extension of SAM-2 that adds a confidence-weighted exponential moving average pointer to achieve stable tumor tracking in ultrasound videos, improving segmentation accuracy while maintaining real-time performance.

Details

Motivation: Papillary thyroid microcarcinoma (PTMC) is increasingly managed with radio-frequency ablation (RFA), but accurate lesion segmentation in ultrasound videos remains challenging due to low contrast, probe-induced motion, and heat-related artifacts. SAM-2's frame-independent design leads to unstable predictions and temporal drift in interventional ultrasound.

Method: EMA-SAM incorporates a confidence-weighted exponential moving average pointer into SAM-2’s memory bank, providing a stable latent prototype of the tumor across frames. This design preserves temporal coherence during probe pressure and bubble occlusion while rapidly adapting when clear evidence reappears.

Result: On the PTMC-RFA dataset (124 minutes, 13 patients), EMA-SAM improved maxDice from 0.82 (SAM-2) to 0.86 and maxIoU from 0.72 to 0.76, while reducing false positives by 29%. On external benchmarks including VTUS and colonoscopy video polyp datasets, it achieved consistent gains of 2-5 Dice points over SAM-2. The EMA pointer adds <0.1% FLOPs, preserving real-time throughput of ~30 FPS on a single A100 GPU.

Conclusion: EMA-SAM establishes a robust and efficient framework for stable tumor tracking, bridging the gap between foundation models and the stringent demands of interventional ultrasound.

Abstract: Papillary thyroid microcarcinoma (PTMC) is increasingly managed with radio-frequency ablation (RFA), yet accurate lesion segmentation in ultrasound videos remains difficult due to low contrast, probe-induced motion, and heat-related artifacts. The recent Segment Anything Model 2 (SAM-2) generalizes well to static images, but its frame-independent design yields unstable predictions and temporal drift in interventional ultrasound. We introduce \textbf{EMA-SAM}, a lightweight extension of SAM-2 that incorporates a confidence-weighted exponential moving average pointer into the memory bank, providing a stable latent prototype of the tumour across frames. This design preserves temporal coherence through probe pressure and bubble occlusion while rapidly adapting once clear evidence reappears. On our curated PTMC-RFA dataset (124 minutes, 13 patients), EMA-SAM improves \emph{maxDice} from 0.82 (SAM-2) to 0.86 and \emph{maxIoU} from 0.72 to 0.76, while reducing false positives by 29%. On external benchmarks, including VTUS and colonoscopy video polyp datasets, EMA-SAM achieves consistent gains of 2–5 Dice points over SAM-2. Importantly, the EMA pointer adds \textless0.1% FLOPs, preserving real-time throughput of $\sim$30,FPS on a single A100 GPU. These results establish EMA-SAM as a robust and efficient framework for stable tumour tracking, bridging the gap between foundation models and the stringent demands of interventional ultrasound. Codes are available here \hyperref[code {https://github.com/mdialameh/EMA-SAM}.

[149] VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety

Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng

Main category: cs.CV

TL;DR: VLSU is a comprehensive framework that systematically evaluates multimodal safety through fine-grained severity classification and combinatorial analysis, revealing systematic joint understanding failures in current models where performance degrades significantly when joint image-text reasoning is required.

Details

Motivation: Current safety evaluation approaches treat vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing methods also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal.

Method: A multi-stage pipeline with real-world images and human annotation to construct a large-scale benchmark of 8,187 samples spanning 15 harm categories, evaluating 11 state-of-the-art models through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns.

Result: Models achieve 90%+ accuracy on clear unimodal safety signals but performance degrades to 20-55% when joint image-text reasoning is required. 34% of errors occur despite correct classification of individual modalities, demonstrating absent compositional reasoning. Models struggle to balance refusing unsafe content while responding to borderline cases.

Conclusion: The framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, providing a critical test bed for research on robust vision-language safety.

Abstract: Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.

[150] Beyond Frequency: Scoring-Driven Debiasing for Object Detection via Blueprint-Prompted Image Synthesis

Xinhao Cai, Liulei Li, Gensheng Pei, Tao Chen, Jinshan Pan, Yazhou Yao, Wenguan Wang

Main category: cs.CV

TL;DR: A generation-based debiasing framework for object detection that addresses limitations of prior methods by introducing representation score to diagnose representational gaps and using precise visual blueprints with generative alignment for high-quality synthesis.

Details

Motivation: Prior debiasing methods are limited by representation diversity, naive generative augmentation preserves biases, and current layout-to-image synthesis lacks fidelity and control for complex scenes.

Method: Introduces representation score (RS) to diagnose representational gaps beyond frequency, creates unbiased layouts, replaces text prompts with precise visual blueprints, and employs generative alignment strategy between detector and generator.

Result: Significantly narrows performance gap for underrepresented object groups - improves large/rare instances by 4.4/3.6 mAP over baseline, and surpasses prior L2I synthesis models by 15.9 mAP for layout accuracy.

Conclusion: The proposed framework effectively addresses debiasing in object detection through representation-aware generation and high-quality synthesis, achieving substantial improvements for underrepresented classes.

Abstract: This paper presents a generation-based debiasing framework for object detection. Prior debiasing methods are often limited by the representation diversity of samples, while naive generative augmentation often preserves the biases it aims to solve. Moreover, our analysis reveals that simply generating more data for rare classes is suboptimal due to two core issues: i) instance frequency is an incomplete proxy for the true data needs of a model, and ii) current layout-to-image synthesis lacks the fidelity and control to generate high-quality, complex scenes. To overcome this, we introduce the representation score (RS) to diagnose representational gaps beyond mere frequency, guiding the creation of new, unbiased layouts. To ensure high-quality synthesis, we replace ambiguous text prompts with a precise visual blueprint and employ a generative alignment strategy, which fosters communication between the detector and generator. Our method significantly narrows the performance gap for underrepresented object groups, \eg, improving large/rare instances by 4.4/3.6 mAP over the baseline, and surpassing prior L2I synthesis models by 15.9 mAP for layout accuracy in generated images.

[151] DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, Yukun Li

Main category: cs.CV

TL;DR: DeepSeek-OCR is a system that compresses long document contexts using optical 2D mapping, achieving high OCR accuracy with significant compression ratios while maintaining practical efficiency.

Details

Motivation: To investigate the feasibility of compressing long contexts via optical 2D mapping, addressing challenges in historical document processing and memory management for large language models.

Method: Uses a two-component architecture: DeepEncoder for low-activation high-resolution input processing and high compression, and DeepSeek3B-MoE-A570M as decoder for OCR tasks.

Result: Achieves 97% OCR precision at compression ratio <10x, maintains ~60% accuracy at 20x compression. Outperforms existing methods on benchmarks while using significantly fewer tokens. Can process 200k+ pages daily on single A100-40G.

Conclusion: Demonstrates considerable promise for long-context compression research and practical OCR applications, with publicly available code and models.

Abstract: We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

[152] BlendCLIP: Bridging Synthetic and Real Domains for Zero-Shot 3D Object Classification with Multimodal Pretraining

Ajinkya Khoche, Gergő László Nagy, Maciej Wozniak, Thomas Gustafsson, Patric Jensfelt

Main category: cs.CV

TL;DR: BlendCLIP bridges the synthetic-to-real gap in 3D object classification by combining synthetic CAD data with real-world LiDAR scans using a curriculum-based data mixing strategy, achieving state-of-the-art zero-shot performance with minimal real data.

Details

Motivation: Current methods fail to generalize from synthetic training data to real-world LiDAR scans, while real-data-only approaches lack semantic diversity for recognizing rare objects.

Method: Multimodal pretraining framework that generates object-level triplets (point cloud, image, text) from real driving data, and uses curriculum-based data mixing to gradually adapt from synthetic to real data.

Result: Achieves 27% zero-shot accuracy boost on nuScenes with only 1.5% real samples per batch, and 19.3% improvement over prior methods while maintaining strong generalization on synthetic benchmarks.

Conclusion: Effective domain adaptation, not full-scale real-world annotation, is key to robust open-vocabulary 3D perception.

Abstract: Zero-shot 3D object classification is crucial for real-world applications like autonomous driving, however it is often hindered by a significant domain gap between the synthetic data used for training and the sparse, noisy LiDAR scans encountered in the real-world. Current methods trained solely on synthetic data fail to generalize to outdoor scenes, while those trained only on real data lack the semantic diversity to recognize rare or unseen objects. We introduce BlendCLIP, a multimodal pretraining framework that bridges this synthetic-to-real gap by strategically combining the strengths of both domains. We first propose a pipeline to generate a large-scale dataset of object-level triplets – consisting of a point cloud, image, and text description – mined directly from real-world driving data and human annotated 3D boxes. Our core contribution is a curriculum-based data mixing strategy that first grounds the model in the semantically rich synthetic CAD data before progressively adapting it to the specific characteristics of real-world scans. Our experiments show that our approach is highly label-efficient: introducing as few as 1.5% real-world samples per batch into training boosts zero-shot accuracy on the nuScenes benchmark by 27%. Consequently, our final model achieves state-of-the-art performance on challenging outdoor datasets like nuScenes and TruckScenes, improving over the best prior method by 19.3% on nuScenes, while maintaining strong generalization on diverse synthetic benchmarks. Our findings demonstrate that effective domain adaptation, not full-scale real-world annotation, is the key to unlocking robust open-vocabulary 3D perception. Our code and dataset will be released upon acceptance on https://github.com/kesu1/BlendCLIP.

[153] OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion

Tianyu Huang, Runnan Chen, Dongting Hu, Fengming Huang, Mingming Gong, Tongliang Liu

Main category: cs.CV

TL;DR: OpenInsGaussian is an open-vocabulary instance Gaussian segmentation framework that addresses limitations in existing semantic Gaussian splatting approaches through context-aware feature extraction and attention-driven multi-view fusion.

Details

Motivation: Current semantic Gaussian splatting methods suffer from insufficient contextual cues for individual masks during preprocessing and inconsistencies/missing details when fusing multi-view features from 2D models.

Method: Two-module approach: Context-Aware Feature Extraction (augments masks with rich semantic context) and Attention-Driven Feature Aggregation (selectively fuses multi-view features to mitigate alignment errors and incompleteness).

Result: Achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin on benchmark datasets.

Conclusion: The approach demonstrates robustness and generality, marking significant progress in 3D scene understanding for autonomous driving, robotics, and augmented reality applications.

Abstract: Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.

[154] Hyperbolic Space Learning Method Leveraging Temporal Motion Priors for Human Mesh Recovery

Xiang Zhang, Suping Wu, Weibin Qiu, Zhaocheng Jin, Sheng Yang

Main category: cs.CV

TL;DR: Proposes a hyperbolic space learning method with temporal motion prior for 3D human mesh recovery from videos, addressing hierarchical structure limitations in Euclidean space.

Details

Motivation: Existing video-based 3D human mesh recovery methods learn features in Euclidean space, making it difficult to accurately capture the natural hierarchical structure of human bodies (torso-limbs-fingers), leading to incorrect mesh reconstructions.

Method: 1) Temporal motion prior extraction module combining 3D pose sequences and image features; 2) Hyperbolic space optimization learning strategy using temporal motion prior to optimize mesh features with 3D pose and pose motion information; 3) Hyperbolic mesh optimization loss for stable learning.

Result: Extensive experiments on large public datasets show superior performance compared to most state-of-the-art methods.

Conclusion: The proposed hyperbolic space learning approach with temporal motion prior effectively captures hierarchical human structure and produces accurate, smooth 3D human meshes from videos.

Abstract: 3D human meshes show a natural hierarchical structure (like torso-limbs-fingers). But existing video-based 3D human mesh recovery methods usually learn mesh features in Euclidean space. It’s hard to catch this hierarchical structure accurately. So wrong human meshes are reconstructed. To solve this problem, we propose a hyperbolic space learning method leveraging temporal motion prior for recovering 3D human meshes from videos. First, we design a temporal motion prior extraction module. This module extracts the temporal motion features from the input 3D pose sequences and image feature sequences respectively. Then it combines them into the temporal motion prior. In this way, it can strengthen the ability to express features in the temporal motion dimension. Since data representation in non-Euclidean space has been proved to effectively capture hierarchical relationships in real-world datasets (especially in hyperbolic space), we further design a hyperbolic space optimization learning strategy. This strategy uses the temporal motion prior information to assist learning, and uses 3D pose and pose motion information respectively in the hyperbolic space to optimize and learn the mesh features. Then, we combine the optimized results to get an accurate and smooth human mesh. Besides, to make the optimization learning process of human meshes in hyperbolic space stable and effective, we propose a hyperbolic mesh optimization loss. Extensive experimental results on large publicly available datasets indicate superiority in comparison with most state-of-the-art.

[155] UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding

Da Zhang, Chenggang Rong, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Main category: cs.CV

TL;DR: UWBench is a comprehensive benchmark for underwater vision-language understanding, featuring 15,003 high-resolution underwater images with human-verified annotations for object referring expressions and question-answer pairs, designed to evaluate models on captioning, visual grounding, and visual question answering in challenging underwater environments.

Details

Motivation: Large vision-language models have achieved success in natural scene understanding but remain largely unexplored for underwater environments, which present unique challenges like light attenuation, color distortion, and require specialized marine knowledge.

Method: Created UWBench with 15,003 underwater images across diverse aquatic environments, enriched with 15,281 object referring expressions and 124,983 question-answer pairs. Established three benchmarks: detailed image captioning, visual grounding, and visual question answering for underwater contexts.

Result: Extensive experiments on state-of-the-art VLMs show that underwater understanding remains challenging with substantial room for improvement, demonstrating the need for specialized benchmarks like UWBench.

Conclusion: UWBench provides essential resources for advancing vision-language research in underwater contexts and supports applications in marine science, ecological monitoring, and autonomous underwater exploration.

Abstract: Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.

[156] Latent-Info and Low-Dimensional Learning for Human Mesh Recovery and Parallel Optimization

Xiang Zhang, Suping Wu, Sheng Yang

Main category: cs.CV

TL;DR: A two-stage network for 3D human mesh recovery that uses latent information and low-dimensional learning to address limb misalignment and computational costs.

Details

Motivation: Existing methods fail to fully exploit latent information (human motion, shape alignment), leading to limb misalignment and insufficient local details in complex scenes, while attention-based methods have high computational costs.

Method: Two-stage approach: 1) Extract global and local information from low/high-frequency image components into hybrid latent frequency domain features, 2) Use low-dimensional mesh-pose interaction through dimensionality reduction and parallel optimization to reduce computational costs.

Result: Extensive experiments on large datasets show superiority over state-of-the-art methods.

Conclusion: The proposed method effectively extracts latent information and reduces computational costs while maintaining reconstruction accuracy, outperforming existing approaches.

Abstract: Existing 3D human mesh recovery methods often fail to fully exploit the latent information (e.g., human motion, shape alignment), leading to issues with limb misalignment and insufficient local details in the reconstructed human mesh (especially in complex scenes). Furthermore, the performance improvement gained by modelling mesh vertices and pose node interactions using attention mechanisms comes at a high computational cost. To address these issues, we propose a two-stage network for human mesh recovery based on latent information and low dimensional learning. Specifically, the first stage of the network fully excavates global (e.g., the overall shape alignment) and local (e.g., textures, detail) information from the low and high-frequency components of image features and aggregates this information into a hybrid latent frequency domain feature. This strategy effectively extracts latent information. Subsequently, utilizing extracted hybrid latent frequency domain features collaborates to enhance 2D poses to 3D learning. In the second stage, with the assistance of hybrid latent features, we model the interaction learning between the rough 3D human mesh template and the 3D pose, optimizing the pose and shape of the human mesh. Unlike existing mesh pose interaction methods, we design a low-dimensional mesh pose interaction method through dimensionality reduction and parallel optimization that significantly reduces computational costs without sacrificing reconstruction accuracy. Extensive experimental results on large publicly available datasets indicate superiority compared to the most state-of-the-art.

[157] TreeFedDG: Alleviating Global Drift in Federated Domain Generalization for Medical Image Segmentation

Yucheng Song, Chenxi Li, Haokang Ding, Zhining Liao, Zhifang Liao

Main category: cs.CV

TL;DR: TreeFedDG is a novel tree topology framework for federated domain generalization in medical image segmentation that addresses global drift issues through hierarchical parameter aggregation, parameter difference-based style mixing, progressive personalized fusion, and ensemble decision-making.

Details

Motivation: Traditional federated learning methods fail to address information aggregation imbalance in cross-domain scenarios, leading to Global Drift problem and reduced model generalization performance in medical imaging tasks with privacy protection and data heterogeneity challenges.

Method: Proposes TreeFedDG with four key components: 1) hierarchical parameter aggregation using tree-structured topology, 2) FedStyle parameter difference-based style mixing, 3) progressive personalized fusion strategy during model distribution, and 4) feature similarity-guided ensemble decision-making during inference.

Result: Extensive experiments on two publicly available datasets show the method outperforms state-of-the-art domain generalization approaches and achieves better balance in cross-domain performance.

Conclusion: TreeFedDG effectively addresses the FedDG-GD problem by leveraging hierarchical knowledge through tree topology, demonstrating superior performance in federated domain generalization for medical image segmentation tasks.

Abstract: In medical image segmentation tasks, Domain Generalization (DG) under the Federated Learning (FL) framework is crucial for addressing challenges related to privacy protection and data heterogeneity. However, traditional federated learning methods fail to account for the imbalance in information aggregation across clients in cross-domain scenarios, leading to the Global Drift (GD) problem and a consequent decline in model generalization performance. This motivates us to delve deeper and define a new critical issue: global drift in federated domain generalization for medical imaging (FedDG-GD). In this paper, we propose a novel tree topology framework called TreeFedDG. First, starting from the distributed characteristics of medical images, we design a hierarchical parameter aggregation method based on a tree-structured topology to suppress deviations in the global model direction. Second, we introduce a parameter difference-based style mixing method (FedStyle), which enforces mixing among clients with maximum parameter differences to enhance robustness against drift. Third, we develop a a progressive personalized fusion strategy during model distribution, ensuring a balance between knowledge transfer and personalized features. Finally, during the inference phase, we use feature similarity to guide the retrieval of the most relevant model chain from the tree structure for ensemble decision-making, thereby fully leveraging the advantages of hierarchical knowledge. We conducted extensive experiments on two publicly available datasets. The results demonstrate that our method outperforms other state-of-the-art domain generalization approaches in these challenging tasks and achieves better balance in cross-domain performance.

[158] StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Xueyi Chen, Keda Tao, Kele Shao, Huan Wang

Main category: cs.CV

TL;DR: StreamingTOM is a training-free framework that addresses both pre-LLM and post-LLM bottlenecks in streaming video VLMs through causal temporal reduction and online quantized memory, achieving significant efficiency gains while maintaining accuracy.

Details

Motivation: Streaming video VLMs face fundamental constraints of causality (no future frame access) and accumulation (unbounded token growth), but existing methods only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged.

Method: Two-stage framework: 1) Causal Temporal Reduction imposes fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, reducing prefill cost; 2) Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them to keep active kv-cache bounded.

Result: Achieves 15.7× kv-cache compression, 1.2× lower peak memory, and 2× faster TTFT compared to prior SOTA. Maintains state-of-the-art accuracy with 63.8% on offline benchmarks and 55.8%/3.7 on RVS.

Conclusion: The two-stage approach provides practical benefits for efficient streaming video understanding with bounded growth, demonstrating the effectiveness of addressing both pre-LLM and post-LLM bottlenecks.

Abstract: Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8%$ on offline benchmarks and $55.8%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.

[159] Efficient Few-shot Identity Preserving Attribute Editing for 3D-aware Deep Generative Models

Vishal Vinod

Main category: cs.CV

TL;DR: A method for identity-preserving 3D face editing using few-shot learning to find latent space directions that enable photorealistic attribute modifications while maintaining view consistency across multiple poses.

Details

Motivation: Identity preserving editing for 3D faces is challenging due to view consistency requirements and trade-offs between editability in low-resolution vs inflexibility in high-resolution. Existing methods require large-scale labeled datasets.

Method: Builds on 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing. Uses just 10 or fewer labeled images to estimate edit directions in latent space for 3D-aware attribute editing.

Result: The method successfully estimates edit directions in latent space that correspond to 3D-aware attribute editing using minimal labeled data. Demonstrates linearity of edits through sequential editing and continuous style manifolds for face aging.

Conclusion: The approach alleviates constraints in 3D face editing by enabling efficient few-shot learning for identity-preserving attribute modifications while maintaining 3D consistency and photorealism.

Abstract: Identity preserving editing of faces is a generative task that enables modifying the illumination, adding/removing eyeglasses, face aging, editing hairstyles, modifying expression etc., while preserving the identity of the face. Recent progress in 2D generative models have enabled photorealistic editing of faces using simple techniques leveraging the compositionality in GANs. However, identity preserving editing for 3D faces with a given set of attributes is a challenging task as the generative model must reason about view consistency from multiple poses and render a realistic 3D face. Further, 3D portrait editing requires large-scale attribute labelled datasets and presents a trade-off between editability in low-resolution and inflexibility to editing in high resolution. In this work, we aim to alleviate some of the constraints in editing 3D faces by identifying latent space directions that correspond to photorealistic edits. To address this, we present a method that builds on recent advancements in 3D-aware deep generative models and 2D portrait editing techniques to perform efficient few-shot identity preserving attribute editing for 3D-aware generative models. We aim to show from experimental results that using just ten or fewer labelled images of an attribute is sufficient to estimate edit directions in the latent space that correspond to 3D-aware attribute editing. In this work, we leverage an existing face dataset with masks to obtain the synthetic images for few attribute examples required for estimating the edit directions. Further, to demonstrate the linearity of edits, we investigate one-shot stylization by performing sequential editing and use the (2D) Attribute Style Manipulation (ASM) technique to investigate a continuous style manifold for 3D consistent identity preserving face aging. Code and results are available at: https://vishal-vinod.github.io/gmpi-edit/

[160] GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation

Tuan Pham, Thanh-Tung Le, Xiaohui Xie, Stephan Mandt

Main category: cs.CV

TL;DR: A training-free framework that enhances diffusion-based monocular depth estimation with stereo guidance to resolve scale ambiguities and achieve accurate metric depth estimation.

Details

Motivation: Existing diffusion-based monocular depth estimation methods excel at relative depth but struggle with absolute metric depth due to scale ambiguities in single-image scenarios.

Method: Reframes depth estimation as an inverse problem using pretrained latent diffusion models conditioned on RGB images, combined with stereo-based geometric constraints to learn scale and shift parameters for accurate depth recovery.

Result: Matches or surpasses state-of-the-art methods, particularly in challenging scenarios with translucent and specular surfaces, without requiring retraining.

Conclusion: The proposed training-free solution effectively integrates stereo vision guidance into existing diffusion-based depth estimation frameworks and generalizes well across diverse environments.

Abstract: We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.

[161] Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models

Lehan Wang, Yi Qin, Honglong Yang, Xiaomeng Li

Main category: cs.CV

TL;DR: Med-RwR is a multimodal medical reasoning framework that actively retrieves external knowledge during diagnosis to address hallucinations and factual inaccuracies in medical MLLMs.

Details

Motivation: Existing medical MLLMs rely solely on internal knowledge, leading to hallucinations and inaccuracies when encountering cases beyond training scope. Current RAG methods are unimodal and neglect visual information.

Method: Two-stage reinforcement learning with tailored rewards to leverage both visual diagnostic findings and textual clinical information for retrieval, plus Confidence-Driven Image Re-retrieval for test-time scaling.

Result: Significant improvements over baseline models on various medical benchmarks, with 8.8% performance gain on EchoCardiography Benchmark despite scarce training data.

Conclusion: Med-RwR effectively enhances reasoning capabilities through external knowledge integration and demonstrates strong generalizability to unfamiliar domains.

Abstract: Incentivizing the reasoning ability of Multimodal Large Language Models (MLLMs) is essential for medical applications to transparently analyze medical scans and provide reliable diagnosis. However, existing medical MLLMs rely solely on internal knowledge during reasoning, leading to hallucinated reasoning and factual inaccuracies when encountering cases beyond their training scope. Although recent Agentic Retrieval-Augmented Generation (RAG) methods elicit the medical model’s proactive retrieval ability during reasoning, they are confined to unimodal LLMs, neglecting the crucial visual information during reasoning and retrieval. Consequently, we propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR, which actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning. Specifically, we design a two-stage reinforcement learning strategy with tailored rewards that stimulate the model to leverage both visual diagnostic findings and textual clinical information for effective retrieval. Building on this foundation, we further propose a Confidence-Driven Image Re-retrieval (CDIR) method for test-time scaling when low prediction confidence is detected. Evaluation on various public medical benchmarks demonstrates Med-RwR’s significant improvements over baseline models, proving the effectiveness of enhancing reasoning capabilities with external knowledge integration. Furthermore, Med-RwR demonstrates remarkable generalizability to unfamiliar domains, evidenced by 8.8% performance gain on our proposed EchoCardiography Benchmark (ECBench), despite the scarcity of echocardiography data in the training corpus. Our data, model, and codes will be made publicly available at https://github.com/xmed-lab/Med-RwR.

[162] The Impact of Image Resolution on Biomedical Multimodal Large Language Models

Liangyu Chen, James Burgess, Jeffrey J Nirschl, Orr Zohar, Serena Yeung-Levy

Main category: cs.CV

TL;DR: Native-resolution training and inference significantly improve MLLM performance for biomedical image analysis, while resolution misalignment degrades performance. Mixed-resolution training effectively balances computational constraints with performance.

Details

Motivation: Current multimodal large language models are designed for low-resolution general images, risking critical information loss in biomedical applications where high-resolution imaging is essential.

Method: Investigated how image resolution affects MLLM performance, testing native-resolution training/inference, resolution misalignment effects, and mixed-resolution training approaches.

Result: Native-resolution training and inference significantly improved performance across multiple biomedical tasks. Resolution misalignment severely degraded performance. Mixed-resolution training effectively mitigated misalignment issues.

Conclusion: Prioritize native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.

Abstract: Imaging technologies are fundamental to biomedical research and modern medicine, requiring analysis of high-resolution images across various modalities. While multimodal large language models (MLLMs) show promise for biomedical image analysis, most are designed for low-resolution images from general-purpose datasets, risking critical information loss. We investigate how image resolution affects MLLM performance in biomedical applications and demonstrate that: (1) native-resolution training and inference significantly improve performance across multiple tasks, (2) misalignment between training and inference resolutions severely degrades performance, and (3) mixed-resolution training effectively mitigates misalignment and balances computational constraints with performance requirements. Based on these findings, we recommend prioritizing native-resolution inference and mixed-resolution datasets to optimize biomedical MLLMs for transformative impact in scientific research and clinical applications.

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, Xin Jin

Main category: cs.CV

TL;DR: OmniNWM is an omniscient panoramic navigation world model that addresses state, action, and reward dimensions in autonomous driving through panoramic video generation, precise action control via Plucker ray-maps, and occupancy-based dense rewards.

Details

Motivation: Existing autonomous driving world models are limited in state modalities, video sequence length, action control precision, and lack reward awareness, creating gaps in comprehensive world modeling.

Method: Uses panoramic video generation (RGB, semantics, depth, 3D occupancy) with flexible forcing for long-horizon generation. Introduces normalized Plucker ray-maps for precise action control. Leverages generated 3D occupancy for rule-based dense rewards.

Result: Achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability. Provides reliable closed-loop evaluation through occupancy-grounded rewards.

Conclusion: OmniNWM presents a unified framework that effectively addresses all three core dimensions of autonomous driving world models, offering improved performance and comprehensive evaluation capabilities.

Abstract: Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://github.com/Arlo0o/OmniNWM.

[164] Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding

Jinlin Li, Yuran Wang, Yifei Yuan, Xiao Zhou, Yingying Zhang, Xixian Yong, Yefeng Zheng, Xian Wu

Main category: cs.CV

TL;DR: ATED is a training-free ensemble method that reduces object hallucination in Large Vision-Language Models by dynamically weighting multiple models’ predictions during inference based on uncertainty.

Details

Motivation: Current LVLMs suffer from object hallucination issues, and existing mitigation approaches face challenges in scalability, adaptability, and model independence.

Method: Adaptive Token Ensemble Decoding (ATED) - a token-level ensemble framework that aggregates predictions from multiple LVLMs using uncertainty-based dynamic weighting and integrates diverse decoding paths.

Result: ATED significantly outperforms state-of-the-art methods on hallucination detection benchmarks, reducing hallucination without compromising fluency or relevance.

Conclusion: Adaptive ensembling is a promising direction for improving LVLM robustness, particularly for high-stakes applications.

Abstract: Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination – generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at https://github.com/jinlin2021/ATED.

[165] Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ATTBHFA-Net

Gao Yu Lee, Tanmoy Dam, Md Meftahul Ferdaus, Daniel Puiu Poenar, Vu Duong

Main category: cs.CV

TL;DR: The paper introduces ATTBHFA-Net, a novel few-shot learning method that combines Bhattacharyya coefficient and Hellinger distance for robust prototype formation in disaster image classification, addressing challenges of limited data and high intra-class variation.

Details

Motivation: Disaster visual recognition faces challenges due to limited and diverse data, with current few-shot learning methods relying on generic datasets lacking disaster imagery and struggling with high intra-class variation and inter-class similarity in disaster contexts.

Method: Proposes ATTBHFA-Net using linear combination of Bhattacharyya coefficient (for inter-class separability) and Hellinger distance (for same-class alignment) to compare feature probability distributions, with a novel contrastive loss based on these metrics.

Result: Experiments on four FSL benchmarks and two disaster image datasets demonstrate superior effectiveness and generalization compared to existing approaches.

Conclusion: ATTBHFA-Net provides an effective solution for few-shot learning in disaster image classification by operating over probability distributions rather than embedded features, significantly improving performance in challenging disaster scenarios.

Abstract: The increasing frequency of natural and human-induced disasters necessitates advanced visual recognition techniques capable of analyzing critical photographic data. With progress in artificial intelligence and resilient computational systems, rapid and accurate disaster classification has become crucial for efficient rescue operations. However, visual recognition in disaster contexts faces significant challenges due to limited and diverse data from the difficulties in collecting and curating comprehensive, high-quality disaster imagery. Few-Shot Learning (FSL) provides a promising approach to data scarcity, yet current FSL research mainly relies on generic benchmark datasets lacking remote-sensing disaster imagery, limiting its practical effectiveness. Moreover, disaster images exhibit high intra-class variation and inter-class similarity, hindering the performance of conventional metric-based FSL methods. To address these issues, this paper introduces the Attention-based Bhattacharyya-Hellinger Feature Aggregation Network (ATTBHFA-Net), which linearly combines the Bhattacharyya coefficient and Hellinger distances to compare and aggregate feature probability distributions for robust prototype formation. The Bhattacharyya coefficient serves as a contrastive margin that enhances inter-class separability, while the Hellinger distance regularizes same-class alignment. This framework parallels contrastive learning but operates over probability distributions rather than embedded feature points. Furthermore, a Bhattacharyya-Hellinger distance-based contrastive loss is proposed as a distributional counterpart to cosine similarity loss, used jointly with categorical cross-entropy to significantly improve FSL performance. Experiments on four FSL benchmarks and two disaster image datasets demonstrate the superior effectiveness and generalization of ATTBHFA-Net compared to existing approaches.

[166] ViSE: A Systematic Approach to Vision-Only Street-View Extrapolation

Kaiyuan Tan, Yingying Shen, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye

Main category: cs.CV

TL;DR: A four-stage pipeline for realistic street view extrapolation in autonomous driving that won first place in RealADSim Workshop NVS track at ICCV 2025, achieving a score of 0.441.

Details

Motivation: Current Novel View Synthesis methods produce distorted and inconsistent images beyond original trajectories, which is critical for closed-loop simulation in autonomous driving.

Method: Four-stage pipeline: 1) Data-driven initialization for robust pseudo-LiDAR point cloud, 2) Geometric priors via novel 2D-SDF road surface modeling, 3) Generative prior for pseudo ground truth in extrapolated viewpoints, 4) Data-driven adaptation network to remove time-specific artifacts.

Result: Achieved final score of 0.441 on RealADSim-NVS benchmark, ranking first among all participants.

Conclusion: The comprehensive pipeline effectively addresses street view extrapolation challenges through geometric priors, generative supervision, and artifact removal, demonstrating state-of-the-art performance in autonomous driving simulation.

Abstract: Realistic view extrapolation is critical for closed-loop simulation in autonomous driving, yet it remains a significant challenge for current Novel View Synthesis (NVS) methods, which often produce distorted and inconsistent images beyond the original trajectory. This report presents our winning solution which ctook first place in the RealADSim Workshop NVS track at ICCV 2025. To address the core challenges of street view extrapolation, we introduce a comprehensive four-stage pipeline. First, we employ a data-driven initialization strategy to generate a robust pseudo-LiDAR point cloud, avoiding local minima. Second, we inject strong geometric priors by modeling the road surface with a novel dimension-reduced SDF termed 2D-SDF. Third, we leverage a generative prior to create pseudo ground truth for extrapolated viewpoints, providing auxilary supervision. Finally, a data-driven adaptation network removes time-specific artifacts. On the RealADSim-NVS benchmark, our method achieves a final score of 0.441, ranking first among all participants.

[167] GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data

Yudong Li, Hao Li, Xianxu Hou, Linlin Shen

Main category: cs.CV

TL;DR: A generative pre-training model for facial knowledge learning using web-built data with self-supervised tasks including masked image/language modeling and image-text matching.

Details

Motivation: Limited research on large-scale pre-training models for facial knowledge compared to natural images, with current approaches relying on manually annotated datasets that are labor-intensive and have limited scalability.

Method: Leverage large-scale web-built data (texts and images with human faces) for pre-training using self-supervised tasks: masked image/language modeling (MILM) and image-text matching (ITM), with image-text matching loss used during generation for controllable image/text generation.

Result: Achieves comparable performance to state-of-the-art pre-training models on facial downstream tasks like attribution classification and expression recognition, and applicable to various face editing tasks including attribute editing, expression manipulation, mask removal, and photo inpainting.

Conclusion: The proposed generative pre-training model effectively learns facial knowledge from web-built data and demonstrates strong performance across multiple facial understanding and editing tasks.

Abstract: Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.

[168] AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

Jiayu Zhang, Qilang Ye, Shuo Ye, Xun Lin, Zihan Song, Zitong Yu

Main category: cs.CV

TL;DR: AV-Master is a novel framework for Audio-Visual Question Answering that dynamically models temporal and modality dimensions to better extract key information from complex audio-visual scenes.

Details

Motivation: Existing AVQA methods lack flexibility in temporal sampling and modality preference awareness, limiting their reasoning capability in complex scenarios with redundant content.

Method: Proposes dynamic adaptive focus sampling for temporal dimension, preference-aware strategy for modality dimension, and dual-path contrastive loss for cross-modal consistency.

Result: Significantly outperforms existing methods on four large-scale benchmarks, especially in complex reasoning tasks.

Conclusion: AV-Master effectively addresses redundancy and fragmentation issues in AVQA through dynamic temporal and modality modeling, achieving superior performance.

Abstract: Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model’s ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively mitigating redundancy and segment fragmentation in traditional sampling methods. In the modality dimension, we propose a preference-aware strategy that models each modality’s contribution independently, enabling selective activation of critical features. Furthermore, we introduce a dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding the model to learn question-specific cross-modal collaborative representations. Experiments on four large-scale benchmarks show that AV-Master significantly outperforms existing methods, especially in complex reasoning tasks.

[169] Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

Yi-Lun Wu, Bo-Kai Ruan, Chiang Tseng, Hong-Han Shuai

Main category: cs.CV

TL;DR: Diffusion-DRO is a new preference learning framework for text-to-image diffusion models that addresses limitations of DPO methods by using inverse reinforcement learning and ranking optimization, eliminating reward model dependency and improving generation quality.

Details

Motivation: Direct preference optimization (DPO) methods struggle with accurately estimating image probabilities due to sigmoid function non-linearity and limited diversity in offline datasets, motivating a more robust approach.

Method: Diffusion-DRO casts preference learning as a ranking problem using inverse reinforcement learning, simplifies training to denoising formulation, and integrates offline expert demonstrations with online policy-generated negative samples.

Result: Diffusion-DRO delivers improved generation quality across challenging prompts, outperforming state-of-the-art baselines in both quantitative metrics and user studies.

Conclusion: The proposed Diffusion-DRO framework effectively addresses DPO limitations and provides superior alignment with human preferences for text-to-image generation.

Abstract: Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.

[170] Visual Space Optimization for Zero-shot Learning

Xinsheng Wang, Shanmin Pang, Jihua Zhu, Zhongyu Li, Zhiqiang Tian, Yaochen Li

Main category: cs.CV

TL;DR: This paper proposes two strategies to optimize the visual space for zero-shot learning: visual prototype-based method and intermediate embedding space optimization, achieving state-of-the-art performance.

Details

Motivation: Existing zero-shot learning methods use visual space as embedding space, but discrete instance distribution makes data structure unremarkable. Optimizing visual space allows semantic vectors to be embedded more effectively.

Method: Two strategies: 1) Visual prototype method - learns a prototype for each visual class to represent classes instead of discrete features; 2) Intermediate embedding space optimization - uses multilayer perceptron framework to learn common intermediate space and make visual data structure more distinctive.

Result: Extensive experiments on four benchmark datasets show optimizing visual space benefits zero-shot learning. The prototype-based method achieves new state-of-the-art performance.

Conclusion: Optimizing visual space is crucial for zero-shot learning, and the proposed prototype-based method sets new state-of-the-art results.

Abstract: Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing to its potential ability in the real-word applications. Zero-shot learning models rely on learning an embedding space, where both semantic descriptions of classes and visual features of instances can be embedded for nearest neighbor search. Recently, most of the existing works consider the visual space formulated by deep visual features as an ideal choice of the embedding space. However, the discrete distribution of instances in the visual space makes the data structure unremarkable. We argue that optimizing the visual space is crucial as it allows semantic vectors to be embedded into the visual space more effectively. In this work, we propose two strategies to accomplish this purpose. One is the visual prototype based method, which learns a visual prototype for each visual class, so that, in the visual space, a class can be represented by a prototype feature instead of a series of discrete visual features. The other is to optimize the visual feature structure in an intermediate embedding space, and in this method we successfully devise a multilayer perceptron framework based algorithm that is able to learn the common intermediate embedding space and meanwhile to make the visual data structure more distinctive. Through extensive experimental evaluation on four benchmark datasets, we demonstrate that optimizing visual space is beneficial for zero-shot learning. Besides, the proposed prototype based method achieves the new state-of-the-art performance.

[171] Learning Human-Object Interaction as Groups

Jiajun Hong, Jianan Wei, Wenguan Wang

Main category: cs.CV

TL;DR: GroupHOI is a framework for Human-Object Interaction Detection that models interactions from a group perspective using geometric proximity and semantic similarity, outperforming state-of-the-art methods on multiple benchmarks.

Details

Motivation: Existing HOI-DET methods focus on pairwise relationships but overlook that real-world interactions often emerge from collective behaviors involving multiple humans and objects in joint activities.

Method: Propagates contextual information using geometric proximity (learnable clustering based on spatial features) and semantic similarity (enhanced transformer decoder with local contextual cues from HO-pair features).

Result: Demonstrates superiority over state-of-the-art methods on HICO-DET and V-COCO benchmarks, and shows leading performance on the challenging Nonverbal Interaction Detection (NVI-DET) task.

Conclusion: GroupHOI effectively models higher-order interactions within groups by considering both geometric proximity and semantic similarity, addressing limitations of pairwise-only approaches in real-world scenarios.

Abstract: Human-Object Interaction Detection (HOI-DET) aims to localize human-object pairs and identify their interactive relationships. To aggregate contextual cues, existing methods typically propagate information across all detected entities via self-attention mechanisms, or establish message passing between humans and objects with bipartite graphs. However, they primarily focus on pairwise relationships, overlooking that interactions in real-world scenarios often emerge from collective behaviors (multiple humans and objects engaging in joint activities). In light of this, we revisit relation modeling from a group view and propose GroupHOI, a framework that propagates contextual information in terms of geometric proximity and semantic similarity. To exploit the geometric proximity, humans and objects are grouped into distinct clusters using a learnable proximity estimator based on spatial features derived from bounding boxes. In each group, a soft correspondence is computed via self-attention to aggregate and dispatch contextual cues. To incorporate the semantic similarity, we enhance the vanilla transformer-based interaction decoder with local contextual cues from HO-pair features. Extensive experiments on HICO-DET and V-COCO benchmarks demonstrate the superiority of GroupHOI over the state-of-the-art methods. It also exhibits leading performance on the more challenging Nonverbal Interaction Detection (NVI-DET) task, which involves varied forms of higher-order interactions within groups.

[172] FeatureFool: Zero-Query Fooling of Video Models via Feature Map

Duoxun Tang, Xi Xiao, Guangwu Hu, Kangkang Sun, Xiao Yang, Dongyang Chen, Qing Li, Yongjie Yin, Jiyao Wang

Main category: cs.CV

TL;DR: FeatureFool is a zero-query black-box attack that uses DNN-extracted feature maps to manipulate video feature spaces, achieving over 70% success rate against video classifiers without queries and bypassing Video-LLM recognition.

Details

Motivation: Existing black-box attacks require multiple queries and interactions, which are impractical for real-world applications and don't scale well to Video-LLMs. No existing video attacks directly leverage feature maps to shift clean-video feature spaces.

Method: FeatureFool performs zero-query attacks by directly exploiting information extracted from DNNs to alter the feature space of clean videos, utilizing feature map transferability.

Result: Achieves over 70% attack success rate against traditional video classifiers without any queries, successfully bypasses Video-LLM recognition, and generates high-quality adversarial videos with good SSIM, PSNR, and Temporal-Inconsistency metrics.

Conclusion: FeatureFool demonstrates an efficient zero-query attack approach that is unprecedented in the video domain, offering stealthy adversarial attacks that are barely perceptible while maintaining high effectiveness against both traditional video classifiers and Video-LLMs.

Abstract: The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.

Yuqing Luo, Yixiao Li, Jiang Liu, Jun Fu, Hadi Amirpour, Guanghui Yue, Baoquan Zhao, Padraig Corcoran, Hantao Liu, Wei Zhou

Main category: cs.CV

TL;DR: The paper proposes CM-SSA, a cross-modal scene semantic alignment method for image complexity assessment that leverages text prompts to enhance complexity predictions aligned with human perception.

Details

Motivation: Existing ICA methods rely on single visual modality features which are insufficient for capturing perceived complexity. Cross-modal scene semantic information has shown promise in perceptual tasks but hasn't been explored for ICA.

Method: CM-SSA consists of two branches: a complexity regression branch that estimates image complexity levels, and a scene semantic alignment branch that aligns images with text prompts containing rich scene semantics through pair-wise learning.

Result: Extensive experiments on multiple ICA datasets demonstrate that CM-SSA significantly outperforms state-of-the-art approaches.

Conclusion: Leveraging cross-modal scene semantic alignment effectively enhances image complexity assessment performance and makes predictions more consistent with human subjective perception.

Abstract: Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.

[174] S2AP: Score-space Sharpness Minimization for Adversarial Pruning

Giorgio Piras, Qi Zhao, Fabio Brau, Maura Pintor, Christian Wressnegger, Battista Biggio

Main category: cs.CV

TL;DR: S2AP is a novel adversarial pruning method that minimizes score-space sharpness to stabilize mask selection and improve robustness.

Details

Motivation: Existing adversarial pruning methods suffer from unstable mask selection due to sharp local minima in robust loss landscape during score-space optimization.

Method: Proposes score-space sharpness minimization that perturbs importance scores and minimizes corresponding robust loss during mask search.

Result: Extensive experiments show S2AP effectively minimizes sharpness in score space, stabilizes mask selection, and improves robustness across various datasets, models, and sparsity levels.

Conclusion: S2AP successfully addresses the instability issue in adversarial pruning by introducing score-space sharpness minimization, leading to more robust compressed models.

Abstract: Adversarial pruning methods have emerged as a powerful tool for compressing neural networks while preserving robustness against adversarial attacks. These methods typically follow a three-step pipeline: (i) pretrain a robust model, (ii) select a binary mask for weight pruning, and (iii) finetune the pruned model. To select the binary mask, these methods minimize a robust loss by assigning an importance score to each weight, and then keep the weights with the highest scores. However, this score-space optimization can lead to sharp local minima in the robust loss landscape and, in turn, to an unstable mask selection, reducing the robustness of adversarial pruning methods. To overcome this issue, we propose a novel plug-in method for adversarial pruning, termed Score-space Sharpness-aware Adversarial Pruning (S2AP). Through our method, we introduce the concept of score-space sharpness minimization, which operates during the mask search by perturbing importance scores and minimizing the corresponding robust loss. Extensive experiments across various datasets, models, and sparsity levels demonstrate that S2AP effectively minimizes sharpness in score space, stabilizing the mask selection, and ultimately improving the robustness of adversarial pruning methods.

[175] Entropy-Enhanced Conformal Features from Ricci Flow for Robust Alzheimer’s Disease Classification

F. Ahmadi, B. Bidabad, H. Nasiri

Main category: cs.CV

TL;DR: A novel local surface representation method using entropy of conformally-derived geometric features achieves 98.62% accuracy in distinguishing Alzheimer’s patients from healthy controls.

Details

Motivation: Alzheimer's disease is associated with significant cortical atrophy, making geometric surface analysis a valuable diagnostic tool for automated and accurate diagnosis.

Method: Used T1-weighted MRI scans from 160 participants (80 AD patients, 80 controls) from ADNI. Computed geometric attributes (area distortion, conformal factor, Gaussian curvature) from cortical surface models, applied Shannon entropy to create feature vectors, and trained multiple classifiers.

Result: The method proved highly effective with MLP and Logistic Regression achieving 98.62% accuracy and F1 score in distinguishing AD patients from healthy controls.

Conclusion: Entropy of conformally-derived geometric features provides a powerful and robust metric for cortical morphometry, offering a straightforward yet powerful tool for clinical research applications in Alzheimer’s disease diagnosis.

Abstract: Background and Objective: In brain imaging, geometric surface models are essential for analyzing the 3D shapes of anatomical structures. Alzheimer’s disease (AD) is associated with significant cortical atrophy, making such shape analysis a valuable diagnostic tool. The objective of this study is to introduce and validate a novel local surface representation method for the automated and accurate diagnosis of AD. Methods: The study utilizes T1-weighted MRI scans from 160 participants (80 AD patients and 80 healthy controls) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Cortical surface models were reconstructed from the MRI data using Freesurfer. Key geometric attributes were computed from the 3D meshes. Area distortion and conformal factor were derived using Ricci flow for conformal parameterization, while Gaussian curvature was calculated directly from the mesh geometry. Shannon entropy was applied to these three features to create compact and informative feature vectors. The feature vectors were used to train and evaluate a suite of classifiers (e.g. XGBoost, MLP, Logistic Regression, etc.). Results: Statistical significance of performance differences between classifiers was evaluated using paired Welch’s t-test. The method proved highly effective in distinguishing AD patients from healthy controls. The Multi-Layer Perceptron (MLP) and Logistic Regression classifiers outperformed all others, achieving an accuracy and F$_1$ Score of 98.62%. Conclusions: This study confirms that the entropy of conformally-derived geometric features provides a powerful and robust metric for cortical morphometry. The high classification accuracy underscores the method’s potential to enhance the study and diagnosis of Alzheimer’s disease, offering a straightforward yet powerful tool for clinical research applications.

[176] Bayesian Fully-Connected Tensor Network for Hyperspectral-Multispectral Image Fusion

Linsong Shan, Zecan Yang, Laurence T. Yang, Changlong Li, Honglu Zhao, Xin Nie

Main category: cs.CV

TL;DR: Proposes Bayesian FCTN (BFCTN) for hyperspectral-multispectral image fusion, using hierarchical sparse priors and variational Bayesian inference to preserve spatial-spectral structures while reducing manual parameter tuning.

Details

Motivation: Existing tensor decomposition methods disrupt spatial-spectral structures through data vectorization/reshaping, require extensive manual parameter tuning, and lack robustness against noise and degradation.

Method: Bayesian FCTN decomposition with hierarchical sparse priors connecting factor tensors, modeled using Variational Bayesian inference and EM algorithm for parameter estimation.

Result: Achieves state-of-the-art fusion accuracy, strong robustness against noise and spatial degradation, and practical applicability in complex real-world scenarios.

Conclusion: BFCTN effectively preserves spatial-spectral structures, models cross-dimensional correlations, and reduces manual parameter tuning while maintaining high fusion performance and robustness.

Abstract: Tensor decomposition is a powerful tool for data analysis and has been extensively employed in the field of hyperspectral-multispectral image fusion (HMF). Existing tensor decomposition-based fusion methods typically rely on disruptive data vectorization/reshaping or impose rigid constraints on the arrangement of factor tensors, hindering the preservation of spatial-spectral structures and the modeling of cross-dimensional correlations. Although recent advances utilizing the Fully-Connected Tensor Network (FCTN) decomposition have partially alleviated these limitations, the process of reorganizing data into higher-order tensors still disrupts the intrinsic spatial-spectral structure. Furthermore, these methods necessitate extensive manual parameter tuning and exhibit limited robustness against noise and spatial degradation. To alleviate these issues, we propose the Bayesian FCTN (BFCTN) method. Within this probabilistic framework, a hierarchical sparse prior that characterizing the sparsity of physical elements, establishes connections between the factor tensors. This framework explicitly models the intrinsic physical coupling among spatial structures, spectral signatures, and local scene homogeneity. For model learning, we develop a parameter estimation method based on Variational Bayesian inference (VB) and the Expectation-Maximization (EM) algorithm, which significantly reduces the need for manual parameter tuning. Extensive experiments demonstrate that BFCTN not only achieves state-of-the-art fusion accuracy and strong robustness but also exhibits practical applicability in complex real-world scenarios.

[177] Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling

Mst Jannatun Ferdous, Masum Billah, Joy Karmoker, Mohd Ruhul Ameen, Akif Islam, Md. Omar Faruqe

Main category: cs.CV

TL;DR: An automated cricket video analysis system using YOLOv8 for pitch/ball detection and OCR for scorecard extraction, achieving high accuracy in detecting wickets and modeling ball trajectories.

Details

Motivation: To automate cricket video analysis for extracting wicket-taking deliveries, detecting cricket balls, and modeling trajectories to provide data-driven insights for coaching and strategy.

Method: Uses YOLOv8 architecture for pitch and ball detection, optical character recognition for scorecard extraction, and image preprocessing (grayscale transformation, power transformation, morphological operations) for robust text extraction.

Result: Pitch detection achieved 99.5% mAP50 with 0.999 precision; ball detection achieved 99.18% mAP50 with 0.968 precision and 0.978 recall. System successfully modeled trajectories on detected pitches.

Conclusion: The approach effectively automates cricket analytics with high accuracy, offering significant potential for coaching and strategic decision-making in cricket.

Abstract: This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.

[178] ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters

Zhiwei Hao, Jianyuan Guo, Li Shen, Kai Han, Yehui Tang, Han Hu, Yunhe Wang

Main category: cs.CV

TL;DR: ScaleNet enables efficient scaling of vision transformers by inserting additional layers with weight sharing and adapter modules, achieving better performance with fewer training epochs compared to training from scratch.

Details

Motivation: Training larger vision transformer models is computationally intensive and costly. ScaleNet addresses this by providing an efficient method to scale up existing pretrained models without the high costs of training from scratch.

Method: ScaleNet inserts additional layers into pretrained ViTs using layer-wise weight sharing. Each added layer shares parameters with a corresponding pretrained layer, and parallel adapter modules with adjustment parameters are used to optimize each instance of shared parameters.

Result: On ImageNet-1K, ScaleNet achieved 7.42% accuracy improvement over training from scratch with a 2x depth-scaled DeiT-Base model, while requiring only one-third of the training epochs. The method also showed potential in downstream tasks like object detection.

Conclusion: ScaleNet provides an efficient and cost-effective approach for scaling vision transformers, enabling rapid model expansion with minimal parameter increases while maintaining or improving performance across various vision tasks.

Abstract: Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$\times$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.

[179] ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization

Yuanhe Guo, Linxi Xie, Zhuoran Chen, Kangrui Yu, Ryan Po, Guandao Yang, Gordon Wetztein, Hongyi Wen

Main category: cs.CV

TL;DR: ImageGem is a dataset for studying generative models that understand individual user preferences, featuring real-world interaction data from 57K users including customized LoRAs, text prompts, and generated images.

Details

Motivation: The key challenge hindering the development of personalized generative models is the lack of in-the-wild and fine-grained user preference annotations.

Method: Created ImageGem dataset with 57K users, 242K customized LoRAs, 3M text prompts, and 5M generated images. Used this data to train preference alignment models, test retrieval models and vision-language models, and propose an end-to-end framework for editing customized diffusion models in latent weight space.

Result: The dataset enabled training better preference alignment models and demonstrated performance on personalized image retrieval and generative model recommendation. The proposed framework successfully aligned customized diffusion models with individual user preferences.

Conclusion: ImageGem dataset enables a new paradigm for generative model personalization by providing the first comprehensive dataset with fine-grained user preference annotations.

Abstract: We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.

[180] Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection

Ji Du, Xin Wang, Fangwei Hao, Mingyang Yu, Chunyuan Chen, Jiesheng Wu, Bin Wang, Jing Xu, Ping Li

Main category: cs.CV

TL;DR: RISE is a retrieval-based self-augmented paradigm for camouflaged object detection that generates pseudo-labels using dataset-level contextual information without ground truth annotations.

Details

Motivation: Previous COD methods primarily use image-level modeling or annotation-based optimization, lacking dataset-level contextual information and requiring laborious annotations. RISE aims to exploit the entire training dataset to generate pseudo-labels without ground truth.

Method: RISE constructs prototype libraries for environments and camouflaged objects using training images without ground truth. It uses a Clustering-then-Retrieval strategy to generate coarse masks through clustering, followed by histogram-based filtering and cross-category retrieval. Multi-View KNN Retrieval integrates results from diverse views to produce robust pseudo-masks.

Result: Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods in camouflaged object detection.

Conclusion: RISE provides an effective retrieval-based self-augmented paradigm that successfully generates high-quality pseudo-labels for COD without requiring ground truth annotations, leveraging dataset-level contextual information through innovative clustering and multi-view retrieval strategies.

Abstract: At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-then-Retrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogram-based image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE outperforms state-of-the-art unsupervised and prompt-based methods. Code is available at https://github.com/xiaohainku/RISE.

[181] LAND: Lung and Nodule Diffusion for 3D Chest CT Synthesis with Anatomical Guidance

Anna Oliveras, Roger Marí, Rafael Redondo, Oriol Guardià, Ana Tost, Bhalaji Nagarajan, Carolina Migliorelli, Vicent Ribas, Petia Radeva

Main category: cs.CV

TL;DR: A new latent diffusion model generates high-quality 3D chest CT scans from anatomical masks using minimal GPU resources, enabling controlled synthesis of CT volumes with/without lung nodules.

Details

Motivation: To create a computationally efficient method for generating realistic 3D chest CT scans that can be precisely controlled through anatomical masks, addressing the high computational costs of existing approaches.

Method: Uses a latent diffusion model conditioned on 3D anatomical masks (lung and nodule regions) to synthesize 256x256x256 volumetric CT images at 1 mm resolution on a single mid-range GPU.

Result: Successfully generates diverse CT volumes with controlled anatomical features; shows that conditioning only on nodule masks produces anatomically incorrect outputs, demonstrating the necessity of global lung structure for accurate synthesis.

Conclusion: The proposed method provides an efficient tool for generating realistic CT data with precise anatomical control, valuable for training AI models and healthcare professionals, while highlighting the critical role of global anatomical context in conditional medical image synthesis.

Abstract: This work introduces a new latent diffusion model to generate high-quality 3D chest CT scans conditioned on 3D anatomical masks. The method synthesizes volumetric images of size 256x256x256 at 1 mm isotropic resolution using a single mid-range GPU, significantly lowering the computational cost compared to existing approaches. The conditioning masks delineate lung and nodule regions, enabling precise control over the output anatomical features. Experimental results demonstrate that conditioning solely on nodule masks leads to anatomically incorrect outputs, highlighting the importance of incorporating global lung structure for accurate conditional synthesis. The proposed approach supports the generation of diverse CT volumes with and without lung nodules of varying attributes, providing a valuable tool for training AI models or healthcare professionals.

[182] Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, Nanning Zheng

Main category: cs.CV

TL;DR: Proposes VFM-VAE to directly integrate Vision Foundation Models into Latent Diffusion Models, avoiding distillation issues and achieving superior performance with faster convergence.

Details

Motivation: Existing distillation approaches for incorporating Vision Foundation Models into LDMs weaken robustness and cause semantic deviation under distribution shifts.

Method: Designs VFM-VAE with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, plus joint tokenizer-diffusion alignment strategy using SE-CKNNA metric.

Result: Achieves gFID of 2.20 in 80 epochs (10x speedup) and 1.62 after 640 epochs, establishing direct VFM integration as superior paradigm.

Conclusion: Direct VFM integration through VFM-VAE is a better approach than distillation for LDMs, providing both performance and efficiency benefits.

Abstract: The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM’s semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.

[183] Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos

Jinfeng Liu, Lingtong Kong, Mi Zhou, Jinwen Chen, Dan Xu

Main category: cs.CV

TL;DR: Mono4DGS-HDR is the first system for reconstructing 4D HDR scenes from monocular LDR videos with alternating exposures using a two-stage Gaussian Splatting approach.

Details

Motivation: To address the challenging problem of reconstructing renderable 4D HDR scenes from unposed monocular LDR videos with alternating exposures, which hasn't been studied before.

Method: Two-stage optimization: 1) Learn video HDR Gaussian representation in orthographic camera space without poses, 2) Transform to world space and jointly refine Gaussians with camera poses, plus temporal luminance regularization for HDR consistency.

Result: Significantly outperforms adapted state-of-the-art methods in both rendering quality and speed, as demonstrated on a new evaluation benchmark.

Conclusion: Mono4DGS-HDR successfully reconstructs high-quality 4D HDR scenes from challenging monocular LDR video input with alternating exposures.

Abstract: We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.

[184] Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation

Wei-Chia Chang, Yan-Ann Chen

Main category: cs.CV

TL;DR: A pipeline combining vision-language models with retrieval-augmented generation for zero-shot vehicle make and model recognition, achieving 20% improvement over CLIP baseline without retraining.

Details

Motivation: Existing VMMR approaches struggle to adapt to newly released vehicle models, and CLIP's fixed pretrained weights limit performance without costly finetuning.

Method: VLM converts vehicle images to descriptive attributes, which are compared against a textual feature database. Retrieved entries are combined with descriptions to form prompts for language model inference.

Result: The proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating effective zero-shot performance.

Conclusion: RAG-enhanced LM reasoning enables scalable VMMR in smart-city applications without large-scale retraining, allowing rapid updates through textual descriptions.

Abstract: Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.

[185] DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices

Suman Kunwar

Main category: cs.CV

TL;DR: DWaste is a computer vision platform for real-time waste sorting on smartphones and edge devices, comparing classification vs detection models with trade-offs between accuracy and efficiency.

Details

Motivation: Address the growing waste problem from convenience packaging by enabling efficient waste sorting through AI-powered solutions on resource-constrained devices.

Method: Benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection models (YOLOv8n, YOLOv11n) using custom waste dataset annotated with Annotated Lab tool, with model quantization for optimization.

Result: Classification models achieved high accuracy (96% with EfficientNetV2S) but had high latency (0.22s) and carbon emissions. Object detection models delivered strong performance (77% mAP) with ultra-fast inference (0.03s), smaller model sizes (<7MB), and quantization reduced model size/VRAM by up to 75%.

Conclusion: Lightweight object detection models with quantization are ideal for real-time, low-power waste sorting on edge devices, demonstrating successful implementation of “Greener AI” for sustainable waste management.

Abstract: The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) using a subset of our own waste data set and annotated it using the custom tool Annotated Lab. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy (~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 77% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of “Greener AI” models to support real-time, sustainable waste sorting on edge devices.

[186] RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation

Junwen Huang, Shishir Reddy Vutukur, Peter KT Yu, Nassir Navab, Slobodan Ilic, Benjamin Busam

Main category: cs.CV

TL;DR: The paper proposes a diffusion transformer approach for template-based object pose estimation, reformulating it as a ray alignment problem to improve accuracy by aligning viewing directions from posed templates with non-posed query images.

Details

Motivation: Traditional template-based pose estimation methods often fail due to incorrect template retrieval, leading to inaccurate pose predictions. The authors aim to address this limitation by developing a more robust approach.

Method: The method reformulates pose estimation as a ray alignment problem using a diffusion transformer architecture. It reparameterizes object rotation using object-centered camera rays and models translation through dense translation offsets. A coarse-to-fine training strategy with narrowed template sampling is employed.

Result: Extensive experiments across multiple benchmark datasets demonstrate competitive performance compared to state-of-the-art approaches in unseen object pose estimation.

Conclusion: The proposed diffusion transformer framework successfully addresses template retrieval failures in pose estimation by leveraging geometric priors and ray alignment, achieving competitive results on benchmark datasets.

Abstract: Typical template-based object pose pipelines estimate the pose by retrieving the closest matching template and aligning it with the observed image. However, failure to retrieve the correct template often leads to inaccurate pose predictions. To address this, we reformulate template-based object pose estimation as a ray alignment problem, where the viewing directions from multiple posed template images are learned to align with a non-posed query image. Inspired by recent progress in diffusion-based camera pose estimation, we embed this formulation into a diffusion transformer architecture that aligns a query image with a set of posed templates. We reparameterize object rotation using object-centered camera rays and model object translation by extending scale-invariant translation estimation to dense translation offsets. Our model leverages geometric priors from the templates to guide accurate query pose inference. A coarse-to-fine training strategy based on narrowed template sampling improves performance without modifying the network architecture. Extensive experiments across multiple benchmark datasets show competitive results of our method compared to state-of-the-art approaches in unseen object pose estimation.

[187] GBlobs: Local LiDAR Geometry for Improved Sensor Placement Generalization

Dušan Malić, Christian Fruhwirth-Reisinger, Alexander Prutsch, Wei Lin, Samuel Schulter, Horst Possegger

Main category: cs.CV

TL;DR: The paper presents GBlobs, a local point cloud feature descriptor that achieves state-of-the-art 3D object detection performance across diverse LiDAR sensor placements by overcoming geometric shortcuts in traditional methods.

Details

Motivation: Current LiDAR-based 3D detectors suffer from 'geometric shortcuts' where models rely on absolute object positions rather than shape characteristics, limiting generalization to different sensor placements and point distributions.

Method: The approach uses GBlobs as network input features - local point cloud feature descriptors specifically designed to replace absolute Cartesian coordinates and force the network to learn object-centric representations instead of position-dependent features.

Result: The method achieved top-ranking performance in RoboSense 2025 Track 3, demonstrating exceptional generalization capabilities across various LiDAR configurations and significantly improved robustness to different sensor placements.

Conclusion: GBlobs effectively circumvent geometric shortcuts in 3D object detection, enabling models to learn robust, object-centric representations that generalize well across diverse sensor configurations, establishing a new state-of-the-art approach for cross-sensor 3D detection.

Abstract: This technical report outlines the top-ranking solution for RoboSense 2025: Track 3, achieving state-of-the-art performance on 3D object detection under various sensor placements. Our submission utilizes GBlobs, a local point cloud feature descriptor specifically designed to enhance model generalization across diverse LiDAR configurations. Current LiDAR-based 3D detectors often suffer from a \enquote{geometric shortcut} when trained on conventional global features (\ie, absolute Cartesian coordinates). This introduces a position bias that causes models to primarily rely on absolute object position rather than distinguishing shape and appearance characteristics. Although effective for in-domain data, this shortcut severely limits generalization when encountering different point distributions, such as those resulting from varying sensor placements. By using GBlobs as network input features, we effectively circumvent this geometric shortcut, compelling the network to learn robust, object-centric representations. This approach significantly enhances the model’s ability to generalize, resulting in the exceptional performance demonstrated in this challenge.

[188] Descriptor: Occluded nuScenes: A Multi-Sensor Dataset for Evaluating Perception Robustness in Automated Driving

Sanjay Kumar, Tim Brophy, Reenu Mohandas, Eoin Martino Grua, Ganesh Sistu, Valentina Donzella, Ciaran Eising

Main category: cs.CV

TL;DR: The paper introduces Occluded nuScenes Dataset, an extension of nuScenes benchmark with controlled, parameterized sensor occlusions for robust perception evaluation in automated driving.

Details

Motivation: Existing autonomous driving datasets lack controlled, parameterized, and reproducible sensor degradations, limiting systematic evaluation of perception systems under adverse conditions like sensor failures and environmental occlusions.

Method: Extends nuScenes dataset with: camera occlusions (4 types), radar degradations (3 types), and LiDAR degradations (3 types) using parameterized scripts for flexible and repeatable data generation.

Result: Created the first multi-sensor occlusion dataset with controlled degradations, enabling consistent evaluation of perception models under partial sensor failures and environmental interference.

Conclusion: The dataset advances research on robust sensor fusion, resilience analysis, and safety-critical perception by providing reproducible adverse condition testing for automated driving systems.

Abstract: Robust perception in automated driving requires reliable performance under adverse conditions, where sensors may be affected by partial failures or environmental occlusions. Although existing autonomous driving datasets inherently contain sensor noise and environmental variability, very few enable controlled, parameterised, and reproducible degradations across multiple sensing modalities. This gap limits the ability to systematically evaluate how perception and fusion architectures perform under well-defined adverse conditions. To address this limitation, we introduce the Occluded nuScenes Dataset, a novel extension of the widely used nuScenes benchmark. For the camera modality, we release both the full and mini versions with four types of occlusions, two adapted from public implementations and two newly designed. For radar and LiDAR, we provide parameterised occlusion scripts that implement three types of degradations each, enabling flexible and repeatable generation of occluded data. This resource supports consistent, reproducible evaluation of perception models under partial sensor failures and environmental interference. By releasing the first multi-sensor occlusion dataset with controlled and reproducible degradations, we aim to advance research on robust sensor fusion, resilience analysis, and safety-critical perception in automated driving.

[189] See the Text: From Tokenization to Visual Reading

Ling Xing, Alex Jinpeng Wang, Rui Yan, Hongyu Qu, Zechao Li, Jinhui Tang

Main category: cs.CV

TL;DR: SeeTok renders text as images and uses multimodal LLMs to interpret them, achieving comparable performance to subword tokenizers while using 4.43x fewer tokens and reducing FLOPs by 70.5%.

Details

Motivation: Current LLMs use subword tokenization which fragments text and over-segments low-resource languages, creating long meaningless sequences and increasing computation. Humans read visually by recognizing word shapes and patterns.

Method: SeeTok renders text as visual images and leverages pretrained multimodal LLMs’ OCR and text-vision alignment capabilities learned from large-scale multimodal training.

Result: SeeTok matches or surpasses subword tokenizers across three language tasks while requiring significantly fewer tokens and computational resources. It also shows improved cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy.

Conclusion: SeeTok represents a shift from symbolic tokenization to human-like visual reading, moving toward more natural and cognitively inspired language models.

Abstract: People see text. Humans read by recognizing words as visual objects, including their shapes, layouts, and patterns, before connecting them to meaning, which enables us to handle typos, distorted fonts, and various scripts effectively. Modern large language models (LLMs), however, rely on subword tokenization, fragmenting text into pieces from a fixed vocabulary. While effective for high-resource languages, this approach over-segments low-resource languages, yielding long, linguistically meaningless sequences and inflating computation. In this work, we challenge this entrenched paradigm and move toward a vision-centric alternative. Our method, SeeTok, renders text as images (visual-text) and leverages pretrained multimodal LLMs to interpret them, reusing strong OCR and text-vision alignment abilities learned from large-scale multimodal training. Across three different language tasks, SeeTok matches or surpasses subword tokenizers while requiring 4.43 times fewer tokens and reducing FLOPs by 70.5%, with additional gains in cross-lingual generalization, robustness to typographic noise, and linguistic hierarchy. SeeTok signals a shift from symbolic tokenization to human-like visual reading, and takes a step toward more natural and cognitively inspired language models.

[190] Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model

Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, Meng Wang

Main category: cs.CV

TL;DR: Kaleido is a subject-to-video generation framework that addresses multi-subject consistency and background disentanglement issues in video synthesis from multiple reference images.

Details

Motivation: Existing S2V generation models struggle with maintaining multi-subject consistency and handling background disentanglement, leading to lower reference fidelity and semantic drift when using multiple reference images.

Method: Proposes a data construction pipeline with quality filtering and diverse data synthesis, plus Reference Rotary Positional Encoding (R-RoPE) for stable multi-image integration.

Result: Extensive experiments show Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization across multiple benchmarks.

Conclusion: Kaleido represents an advance in S2V generation by effectively addressing multi-subject consistency and background disentanglement challenges.

Abstract: We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.

[191] CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder

Yongmin Lee, Hye Won Chung

Main category: cs.CV

TL;DR: CovMatch is a scalable multimodal dataset distillation framework that aligns cross-covariance of real and synthetic features while regularizing feature distributions, enabling joint optimization of both image and text encoders for improved cross-modal alignment.

Details

Motivation: Extending dataset distillation to multimodal contrastive learning faces challenges in cross-modal alignment and computational cost. Prior approaches that freeze text encoders severely limit semantic alignment and become performance bottlenecks.

Method: Proposes CovMatch framework that aligns cross-covariance between real and synthetic features while regularizing feature distributions within each modality. Unlike prior methods, it enables joint optimization of both image and text encoders.

Result: Outperforms state-of-the-art multimodal distillation methods on Flickr30K and COCO datasets, achieving up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic image-text pairs.

Conclusion: CovMatch provides a scalable solution for multimodal dataset distillation that enables joint encoder optimization, leading to stronger cross-modal alignment and significant performance improvements over existing methods.

Abstract: Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.

[192] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: GAR is a region-level MLLM that addresses limitations in fine-grained visual understanding by leveraging global contexts and modeling interactions between multiple regions, achieving advanced compositional reasoning.

Details

Motivation: Existing MLLMs struggle with dense scenes requiring fine-grained analysis, and previous region-level approaches understand regions in isolation without considering global contexts.

Method: Introduces GAR with RoI-aligned feature replay technique to support precise perception using global contexts, modeling interactions between multiple prompts, and achieving compositional reasoning.

Result: GAR-1B outperforms DAM-3B by +4.5 on DLC-Bench, surpasses InternVL3-78B on GAR-Bench-VQA, and GAR-8B outperforms VideoRefer-7B on VideoRefer-BenchQ in zero-shot transfer to videos.

Conclusion: GAR enables comprehensive region-level visual understanding with global context awareness, shifting from passive description to active dialogue, and demonstrates strong transfer capabilities to video domains.

Abstract: While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.

[193] Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang

Main category: cs.CV

TL;DR: 3DThinker is a framework that enables 3D spatial reasoning without 3D prior inputs or labeled 3D training data, outperforming existing methods on multiple benchmarks.

Details

Motivation: Current vision-language models struggle with 3D spatial relationships from limited views, as existing methods rely on text or 2D visual cues with limited representational capacity for 3D spatial imagination tasks.

Method: Two-stage training: first supervised alignment of 3D latent representations between VLM and 3D foundation model, then optimization of reasoning trajectory based on outcome signals to refine 3D mentaling.

Result: Extensive experiments show 3DThinker consistently outperforms strong baselines across multiple benchmarks.

Conclusion: The framework offers a new perspective for unifying 3D representations into multimodal reasoning and enables 3D mentaling during reasoning without requiring 3D prior inputs or explicit 3D training data.

Abstract: Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.

[194] C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression

Baptiste Bauvin, Loïc Baret, Ola Ahmad

Main category: cs.CV

TL;DR: Proposes a novel one-shot pruning framework using explainable deep learning with causal-aware approach to efficiently reduce model size without performance degradation, outperforming existing methods.

Details

Motivation: Address the performance drop issue in one-shot pruning methods that apply pruning directly at post-training without fine-tuning, while maintaining model efficiency.

Method: Causal-aware pruning approach leveraging cause-effect relations between model predictions and structures in a progressive pruning process, applied to CNN and ViT baselines.

Result: Achieves substantial model size reductions with minimal performance impact, outperforms counterparts without requiring fine-tuning.

Conclusion: The proposed one-shot pruning framework offers the best trade-off between model compression and performance preservation, making it suitable for deployment-constrained scenarios.

Abstract: Neural network compression has gained increasing attention in recent years, particularly in computer vision applications, where the need for model reduction is crucial for overcoming deployment constraints. Pruning is a widely used technique that prompts sparsity in model structures, e.g. weights, neurons, and layers, reducing size and inference costs. Structured pruning is especially important as it allows for the removal of entire structures, which further accelerates inference time and reduces memory overhead. However, it can be computationally expensive, requiring iterative retraining and optimization. To overcome this problem, recent methods considered one-shot setting, which applies pruning directly at post-training. Unfortunately, they often lead to a considerable drop in performance. In this paper, we focus on this issue by proposing a novel one-shot pruning framework that relies on explainable deep learning. First, we introduce a causal-aware pruning approach that leverages cause-effect relations between model predictions and structures in a progressive pruning process. It allows us to efficiently reduce the size of the network, ensuring that the removed structures do not deter the performance of the model. Then, through experiments conducted on convolution neural network and vision transformer baselines, pre-trained on classification tasks, we demonstrate that our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning. Overall, our approach outperforms its counterparts, offering the best trade-off. Our code is available on GitHub.

[195] ε-Seg: Sparsely Supervised Semantic Segmentation of Microscopy Data

Sheida Rahnamai Kordasiabi, Damian Dalle Nogare, Florian Jug

Main category: cs.CV

TL;DR: ε-Seg is a hierarchical variational autoencoder method for semantic segmentation of biological EM images using sparse labels (0.05% or less), featuring center-region masking, contrastive learning, GMM prior, and direct MLP prediction.

Details

Motivation: Semantic segmentation of complex EM images is challenging even for human observers, especially with limited labeled data, requiring methods that work with sparse supervision.

Method: Uses hierarchical VAEs with center-region masking for robust embeddings, sparse label contrastive learning, GMM prior for latent space clustering, and MLP head for direct label prediction instead of clustering.

Result: Achieves competitive sparsely-supervised segmentation on complex biological EM datasets and fluorescence microscopy data with very limited training labels.

Conclusion: ε-Seg enables effective semantic segmentation of complex biological images using minimal labeled data through robust embedding learning and direct prediction approach.

Abstract: Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce {\epsilon}-Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center-region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering-free label prediction. Center-region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of {\epsilon}-Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that {\epsilon}-Seg is capable of achieving competitive sparsely-supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.

[196] Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression

Kyo Kuroki, Yasuyuki Okoshi, Thiem Van Chu, Kazushi Kawamura, Masato Motomura

Main category: cs.CV

TL;DR: Binary Quadratic Quantization (BQQ) is a novel matrix quantization method that uses binary quadratic expressions instead of linear combinations, achieving superior memory efficiency and reconstruction error trade-offs compared to conventional methods.

Details

Motivation: To overcome limitations of conventional first-order quantization methods that use linear combinations of binary bases, by leveraging the expressive power of binary quadratic expressions while maintaining compact data format.

Method: Proposes Binary Quadratic Quantization (BQQ) that approximates real-valued matrices using binary quadratic expressions rather than linear combinations. Validated through matrix compression benchmarks and post-training quantization on Vision Transformer models.

Result: BQQ consistently achieves superior trade-off between memory efficiency and reconstruction error for matrix compression. Outperforms state-of-the-art PTQ methods by up to 2.2% and 59.1% on ImageNet under calibration-based and data-free scenarios with 2-bit quantization.

Conclusion: Binary quadratic expressions are surprisingly effective for efficient matrix approximation and neural network compression, demonstrating BQQ’s potential as a powerful quantization approach.

Abstract: This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.

[197] Image augmentation with invertible networks in interactive satellite image change detection

Hichem Sahbi

Main category: cs.CV

TL;DR: A novel interactive satellite image change detection algorithm using active learning with an invertible network for data augmentation.

Details

Motivation: To improve satellite image change detection by interactively querying users for labels and dynamically updating the model, addressing the challenge of limited labeled data.

Method: Uses an iterative active learning framework with a question-and-answer model that queries users about image labels. Features a novel invertible network that maps images to latent spaces for linear augmentation, then back to input space for model retraining.

Result: Experimental results show superior performance compared to related work in satellite image change detection.

Conclusion: The proposed interactive framework with invertible network augmentation effectively improves change detection accuracy through active learning and user interaction.

Abstract: This paper devises a novel interactive satellite image change detection algorithm based on active learning. Our framework employs an iterative process that leverages a question-and-answer model. This model queries the oracle (user) about the labels of a small subset of images (dubbed as display), and based on the oracle’s responses, change detection model is dynamically updated. The main contribution of our framework resides in a novel invertible network that allows augmenting displays, by mapping them from highly nonlinear input spaces to latent ones, where augmentation transformations become linear and more tractable. The resulting augmented data are afterwards mapped back to the input space, and used to retrain more effective change detection criteria in the subsequent iterations of active learning. Experimental results demonstrate superior performance of our proposed method compared to the related work.

[198] Beyond the Pipeline: Analyzing Key Factors in End-to-End Deep Learning for Historical Writer Identification

Hanif Rasyidi, Moshiur Farazi

Main category: cs.CV

TL;DR: This paper investigates factors affecting end-to-end deep learning for historical writer identification, finding most models struggle with generalization in realistic document-level settings, especially in zero-shot scenarios, but identifies one setup that achieves comparable performance to top systems.

Details

Motivation: Historical writer identification remains challenging due to handwriting diversity, document degradation, and limited labeled samples. Traditional methods work well on curated datasets but end-to-end approaches struggle with generalization in realistic settings.

Method: The study explores combinations of pre-processing methods, backbone architectures, and post-processing strategies including text segmentation, patch sampling, and feature aggregation to analyze end-to-end deep learning pipelines.

Result: Most configurations performed poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. However, one end-to-end setup achieved results comparable to the top-performing system despite simpler design.

Conclusion: The findings highlight key challenges in building robust end-to-end systems for historical writer identification and provide insights into design choices that can improve performance in this domain.

Abstract: This paper investigates various factors that influence the performance of end-to-end deep learning approaches for historical writer identification (HWI), a task that remains challenging due to the diversity of handwriting styles, document degradation, and the limited number of labelled samples per writer. These conditions often make accurate recognition difficult, even for human experts. Traditional HWI methods typically rely on handcrafted image processing and clustering techniques, which tend to perform well on small and carefully curated datasets. In contrast, end-to-end pipelines aim to automate the process by learning features directly from document images. However, our experiments show that many of these models struggle to generalise in more realistic, document-level settings, especially under zero-shot scenarios where writers in the test set are not present in the training data. We explore different combinations of pre-processing methods, backbone architectures, and post-processing strategies, including text segmentation, patch sampling, and feature aggregation. The results suggest that most configurations perform poorly due to weak capture of low-level visual features, inconsistent patch representations, and high sensitivity to content noise. Still, we identify one end-to-end setup that achieves results comparable to the top-performing system, despite using a simpler design. These findings point to key challenges in building robust end-to-end systems and offer insight into design choices that improve performance in historical document writer identification.

[199] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao

Main category: cs.CV

TL;DR: MoGA (Mixture-of-Groups Attention) is an efficient sparse attention mechanism that uses learnable token routing to enable long video generation by overcoming the quadratic scaling limitations of full attention in Diffusion Transformers.

Details

Motivation: Long video generation with Diffusion Transformers is limited by the quadratic scaling of full attention with sequence length, as attention is highly redundant and dominated by a small subset of query-key pairs.

Method: Proposes Mixture-of-Groups Attention (MoGA) - a lightweight, learnable token router that precisely matches tokens without blockwise estimation, enabling semantic-aware routing and effective long-range interactions. It integrates with modern attention stacks like FlashAttention.

Result: Developed an efficient long video generation model that produces minute-level, multi-shot, 480p videos at 24 fps with a context length of approximately 580k. Comprehensive experiments validate effectiveness on various video generation tasks.

Conclusion: MoGA provides an efficient sparse attention solution that enables practical long video generation by overcoming computational bottlenecks while maintaining performance.

Abstract: Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.

[200] UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

Main category: cs.CV

TL;DR: UniGenBench++ is a comprehensive semantic assessment benchmark for text-to-image generation that addresses limitations in existing benchmarks by providing diverse prompt scenarios, multilingual support, and fine-grained evaluation across multiple dimensions.

Details

Motivation: Existing T2I benchmarks lack diversity in prompt scenarios, multilingual support, and fine-grained sub-dimension assessment, limiting their real-world applicability and comprehensive evaluation capabilities.

Method: Developed UniGenBench++ with 600 prompts organized hierarchically across 5 main themes and 20 subthemes, with English and Chinese versions in short/long forms. Used Gemini-2.5-Pro MLLM for benchmark construction and assessment, and trained a robust evaluation model for offline assessment.

Result: The benchmark enables systematic evaluation of T2I models across 10 primary and 27 sub evaluation criteria, revealing strengths and weaknesses of both open- and closed-source models through comprehensive benchmarking.

Conclusion: UniGenBench++ provides a unified, comprehensive semantic assessment framework for T2I generation that addresses key limitations of existing benchmarks and enables more thorough evaluation of model performance across diverse scenarios and dimensions.

Abstract: Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models’ semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.

Yiqi Lin, Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Mike Zheng Shou

Main category: cs.CV

TL;DR: VC2L is a unified vision-centric framework that models text, images, and their combinations using a single vision transformer, operating entirely in pixel space without needing OCR or modality fusion.

Details

Motivation: Contrastive vision-language models like CLIP struggle with complex web documents where text and images are interleaved, loosely aligned, or embedded visually, limiting their real-world applicability.

Method: VC2L renders all inputs as images and uses snippet-level contrastive learning to align consecutive multimodal segments, leveraging document coherence without requiring explicitly paired image-text data.

Result: VC2L achieves competitive or superior performance compared to CLIP-style models on both new benchmarks (AnyCIR, SeqCIR, CSR) and established datasets (M-BEIR, MTEB).

Conclusion: Multimodal web data is valuable for contrastive learning, and a unified vision-centric approach is scalable and effective for multimodal representation learning.

Abstract: Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.

[202] A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition

Peiqin Zhuang, Lei Bai, Yichao Wu, Ding Liang, Luping Zhou, Yali Wang, Wanli Ouyang

Main category: cs.CV

TL;DR: The paper proposes EMIM, a module that integrates cost volume-style motion modeling into transformers for better action recognition, especially on motion-sensitive datasets.

Details

Motivation: Transformer-based methods dominate action recognition but perform poorly on motion-sensitive datasets due to lack of elaborate motion modeling. Cost volume in traditional methods has powerful motion modeling capacities similar to attention affinity matrices.

Method: Proposed Explicit Motion Information Mining (EMIM) module that constructs affinity matrix in cost volume style by sampling key tokens from query-based neighboring areas in next frame using sliding windows. This matrix handles both contextual aggregation and motion feature extraction.

Result: Validated on four datasets, outperforms state-of-the-art approaches, especially on motion-sensitive Something-Something V1 & V2 datasets.

Conclusion: Integrating cost volume-style motion modeling into transformers through EMIM effectively improves motion modeling capacities for action recognition.

Abstract: Recently, action recognition has been dominated by transformer-based methods, thanks to their spatiotemporal contextual aggregation capacities. However, despite the significant progress achieved on scene-related datasets, they do not perform well on motion-sensitive datasets due to the lack of elaborate motion modeling designs. Meanwhile, we observe that the widely-used cost volume in traditional action recognition is highly similar to the affinity matrix defined in self-attention, but equipped with powerful motion modeling capacities. In light of this, we propose to integrate those effective motion modeling properties into the existing transformer in a unified and neat way, with the proposal of the Explicit Motion Information Mining module (EMIM). In EMIM, we propose to construct the desirable affinity matrix in a cost volume style, where the set of key candidate tokens is sampled from the query-based neighboring area in the next frame in a sliding-window manner. Then, the constructed affinity matrix is used to aggregate contextual information for appearance modeling and is converted into motion features for motion modeling as well. We validate the motion modeling capacities of our method on four widely-used datasets, and our method performs better than existing state-of-the-art approaches, especially on motion-sensitive datasets, i.e., Something-Something V1 & V2.

[203] PLANA3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting

Changkun Liu, Bin Tan, Zeran Ke, Shangzhan Zhang, Jiachen Liu, Ming Qian, Nan Xue, Yujun Shen, Tristan Braud

Main category: cs.CV

TL;DR: PLANA3R is a pose-free framework for metric planar 3D reconstruction from unposed two-view images using planar primitives, trained without explicit plane supervision.

Details

Motivation: To address metric 3D reconstruction of indoor scenes by exploiting geometric regularities with compact planar representations, avoiding the need for 3D plane annotations during training.

Method: Uses Vision Transformers to extract sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting with depth and normal maps.

Result: Demonstrates strong generalization to out-of-domain indoor environments across 3D surface reconstruction, depth estimation, relative pose estimation, and accurate plane segmentation tasks.

Conclusion: PLANA3R enables scalable training on large-scale stereo datasets using only depth and normal annotations, providing effective metric planar 3D reconstruction without explicit plane supervision.

Abstract: This paper addresses metric 3D reconstruction of indoor scenes by exploiting their inherent geometric regularities with compact representations. Using planar 3D primitives - a well-suited representation for man-made environments - we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstruction from unposed two-view images. Our approach employs Vision Transformers to extract a set of sparse planar primitives, estimate relative camera poses, and supervise geometry learning via planar splatting, where gradients are propagated through high-resolution rendered depth and normal maps of primitives. Unlike prior feedforward methods that require 3D plane annotations during training, PLANA3R learns planar 3D structures without explicit plane supervision, enabling scalable training on large-scale stereo datasets using only depth and normal annotations. We validate PLANA3R on multiple indoor-scene datasets with metric supervision and demonstrate strong generalization to out-of-domain indoor environments across diverse tasks under metric evaluation protocols, including 3D surface reconstruction, depth estimation, and relative pose estimation. Furthermore, by formulating with planar 3D representation, our method emerges with the ability for accurate plane segmentation. The project page is available at https://lck666666.github.io/plana3r

[204] SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation

Siyong Jian, Huan Wang

Main category: cs.CV

TL;DR: A novel KV cache compression framework for autoregressive image generation that reduces memory usage 5x and speeds up throughput 6.6x with minimal quality loss by leveraging spatial locality and emergent semantic sink phenomena.

Details

Motivation: Autoregressive image generation models like Janus-Pro have high memory and computational demands due to large numbers of visual tokens, while KV cache compression remains largely unexplored in image generation compared to language modeling.

Method: Identifies spatial locality and emergent semantic sink attention phenomena, then compresses KV cache by adaptively decoupling attention heads into two types: spatial-locality heads maintain short recent token windows, while semantic-sink heads preserve compact sets of highly-attended tokens.

Result: Achieves 5x reduction in memory usage and 6.6x speedup in overall throughput with only minimal visual quality loss.

Conclusion: Enables highly efficient native autoregressive image generation on resource-constrained hardware by effectively compressing KV cache while maintaining visual quality.

Abstract: Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5$\times$ reduction in memory usage and a notable 6.6$\times$ speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.

[205] IF-VidCap: Can Video Caption Models Follow Instructions?

Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu

Main category: cs.CV

TL;DR: IF-VidCap is a new benchmark for evaluating controllable video captioning that assesses instruction-following capabilities, revealing that while proprietary models still lead, open-source models are catching up and dense captioning specialists underperform general-purpose MLLMs on complex instructions.

Details

Motivation: Current video captioning benchmarks focus on descriptive comprehensiveness but overlook instruction-following capabilities, which are crucial for practical applications where users need specific caption formats rather than exhaustive descriptions.

Method: Introduced IF-VidCap benchmark with 1,400 high-quality samples and a systematic framework that evaluates captions on format correctness and content correctness dimensions. Conducted comprehensive evaluation of over 20 prominent models.

Result: Proprietary models still dominate but the performance gap is closing, with top open-source solutions achieving near-parity. Dense captioning specialized models underperform general-purpose MLLMs on complex instructions.

Conclusion: Future work should simultaneously advance both descriptive richness and instruction-following fidelity in video captioning models, as current specialized dense captioning approaches are insufficient for complex instruction-following tasks.

Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

[206] Moving Light Adaptive Colonoscopy Reconstruction via Illumination-Attenuation-Aware 3D Gaussian Splatting

Hao Wang, Ying Zhou, Haoyu Zhao, Rui Wang, Qiang Hu, Xing Zhang, Qiang Li, Zhiwei Wang

Main category: cs.CV

TL;DR: ColIAGS improves 3D Gaussian Splatting for colonoscopy by addressing illumination variations through improved appearance and geometry modeling, achieving better rendering fidelity and geometric accuracy.

Details

Motivation: Standard 3DGS assumes static illumination, which is incompatible with dynamic lighting in colonoscopy, leading to degraded 3D reconstructions with vaporous Gaussian blobs that violate scene structure.

Method: Proposes ColIAGS with two key improvements: 1) Improved Appearance Modeling with two illumination attenuation factors for photometric adaptation, 2) Improved Geometry Modeling using high-dimensional view embedding for better Gaussian geometry prediction and implicit illumination attenuation.

Result: Outperforms state-of-the-art methods with superior rendering fidelity and significantly reduced Depth MSE, achieving dual capabilities of novel view synthesis and accurate geometric reconstruction.

Conclusion: ColIAGS successfully addresses illumination variations in colonoscopic scenes while preserving geometry accuracy, making it suitable for critical medical applications like virtual colonoscopy and lesion tracking.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a pivotal technique for real-time view synthesis in colonoscopy, enabling critical applications such as virtual colonoscopy and lesion tracking. However, the vanilla 3DGS assumes static illumination and that observed appearance depends solely on viewing angle, which causes incompatibility with the photometric variations in colonoscopic scenes induced by dynamic light source/camera. This mismatch forces most 3DGS methods to introduce structure-violating vaporous Gaussian blobs between the camera and tissues to compensate for illumination attenuation, ultimately degrading the quality of 3D reconstructions. Previous works only consider the illumination attenuation caused by light distance, ignoring the physical characters of light source and camera. In this paper, we propose ColIAGS, an improved 3DGS framework tailored for colonoscopy. To mimic realistic appearance under varying illumination, we introduce an Improved Appearance Modeling with two types of illumination attenuation factors, which enables Gaussians to adapt to photometric variations while preserving geometry accuracy. To ensure the geometry approximation condition of appearance modeling, we propose an Improved Geometry Modeling using high-dimensional view embedding to enhance Gaussian geometry attribute prediction. Furthermore, another cosine embedding input is leveraged to generate illumination attenuation solutions in an implicit manner. Comprehensive experimental results on standard benchmarks demonstrate that our proposed ColIAGS achieves the dual capabilities of novel view synthesis and accurate geometric reconstruction. It notably outperforms other state-of-the-art methods by achieving superior rendering fidelity while significantly reducing Depth MSE. Code will be available.

[207] SEAL: Semantic-Aware Hierarchical Learning for Generalized Category Discovery

Zhenqi He, Yuanpei Liu, Kai Han

Main category: cs.CV

TL;DR: SEAL is a hierarchical learning framework for Generalized Category Discovery that uses semantic hierarchies to improve classification of both known and unknown classes in partially labeled datasets.

Details

Motivation: Existing GCD approaches rely on single-level semantics or manually designed hierarchies, limiting generalizability and scalability.

Method: Proposes SEAL with Hierarchical Semantic-Guided Soft Contrastive Learning to generate informative soft negatives and Cross-Granularity Consistency module to align predictions across granularity levels.

Result: Achieves state-of-the-art performance on fine-grained benchmarks (SSB, Oxford-Pet, Herbarium19) and demonstrates generalization on coarse-grained datasets.

Conclusion: SEAL effectively leverages natural hierarchical structures to overcome limitations of conventional contrastive learning and improves GCD performance across various datasets.

Abstract: This paper investigates the problem of Generalized Category Discovery (GCD). Given a partially labelled dataset, GCD aims to categorize all unlabelled images, regardless of whether they belong to known or unknown classes. Existing approaches typically depend on either single-level semantics or manually designed abstract hierarchies, which limit their generalizability and scalability. To address these limitations, we introduce a SEmantic-aware hierArchical Learning framework (SEAL), guided by naturally occurring and easily accessible hierarchical structures. Within SEAL, we propose a Hierarchical Semantic-Guided Soft Contrastive Learning approach that exploits hierarchical similarity to generate informative soft negatives, addressing the limitations of conventional contrastive losses that treat all negatives equally. Furthermore, a Cross-Granularity Consistency (CGC) module is designed to align the predictions from different levels of granularity. SEAL consistently achieves state-of-the-art performance on fine-grained benchmarks, including the SSB benchmark, Oxford-Pet, and the Herbarium19 dataset, and further demonstrates generalization on coarse-grained datasets. Project page: https://visual-ai.github.io/seal/

[208] Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model for Microclimate Impact Prediction

Jannis Fleckenstein, David Kreismann, Tamara Rosemary Govindasamy, Thomas Brunschwiler, Etienne Vos, Mattia Rigotti

Main category: cs.CV

TL;DR: Geospatial foundation models fine-tuned on limited data can accurately predict urban heat patterns and evaluate mitigation strategies, especially in data-scarce regions.

Details

Motivation: Urban heat island effects are worsening due to urbanization and climate change, but conventional ML models with limited data produce inaccurate predictions, particularly in underserved areas.

Method: Fine-tuned geospatial foundation models using empirical ground truth of urban heat patterns, quantified cooling effects from green spaces, and simulated inpainting for future climate scenarios.

Result: Foundation models demonstrated strong generalization and accurate predictions of land surface temperatures under future climate scenarios.

Conclusion: Geospatial foundation models offer a powerful approach for evaluating urban heat island mitigation strategies in data-scarce regions to support climate-resilient cities.

Abstract: As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data, yet conventional machine learning models with limited data often produce inaccurate predictions, particularly in underserved areas. Geospatial foundation models trained on global unstructured data offer a promising alternative by demonstrating strong generalization and requiring only minimal fine-tuning. In this study, an empirical ground truth of urban heat patterns is established by quantifying cooling effects from green spaces and benchmarking them against model predictions to evaluate the model’s accuracy. The foundation model is subsequently fine-tuned to predict land surface temperatures under future climate scenarios, and its practical value is demonstrated through a simulated inpainting that highlights its role for mitigation support. The results indicate that foundation models offer a powerful way for evaluating urban heat island mitigation strategies in data-scarce regions to support more climate-resilient cities.

[209] UltraGen: High-Resolution Video Generation with Hierarchical Attention

Teng Hu, Jiangning Zhang, Zihan Su, Ran Yi

Main category: cs.CV

TL;DR: UltraGen enables efficient end-to-end native high-resolution video generation (up to 4K) by addressing computational bottlenecks in diffusion transformers through hierarchical dual-branch attention architecture.

Details

Motivation: Existing diffusion transformer video generation models are limited to low-resolution outputs (≤720P) due to quadratic computational complexity of attention mechanisms, making high-resolution video generation impractical.

Method: Proposes UltraGen with hierarchical dual-branch attention: global attention branch for semantic consistency using spatially compressed modeling, and local attention branch with hierarchical cross-window mechanism for regional content fidelity.

Result: UltraGen successfully scales pre-trained low-resolution models to 1080P and 4K resolution, outperforming state-of-the-art methods and two-stage super-resolution pipelines in both qualitative and quantitative evaluations.

Conclusion: The framework enables practical high-resolution video generation for the first time, overcoming computational limitations through efficient attention decomposition strategies.

Abstract: Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.

[210] Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection

Wenping Jin, Yuyang Tang, Li Zhu, Fei Guo

Main category: cs.CV

TL;DR: A hyperspectral anomaly detection framework called “Rebellious Student” that trains spatial and spectral branches to learn complementary features through intentional divergence rather than imitation, enabling universal deployment without per-scene retraining.

Details

Motivation: To improve hyperspectral anomaly detection by integrating spectral and spatial cues through complementary feature learning, building on universal deployment methods that don't require per-scene retraining.

Method: Two-stage learning: (1) train spectral enhancement network via reverse distillation for robust background spectral representations; (2) train spatial network (rebellious student) using decorrelation losses to enforce feature orthogonality while maintaining reconstruction fidelity.

Result: Extensive experiments on HAD100 benchmark show substantial improvements over established baselines with minimal computational overhead, confirming effectiveness and generality.

Conclusion: The proposed complementary learning paradigm effectively enhances both spectral and spatial background features for parameter-free and training-free anomaly detection when paired with conventional detectors.

Abstract: A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed – without per-scene retraining or parameter tuning – has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel “Rebellious Student” framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network – the rebellious student – is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Extensive experiments on the HAD100 benchmark show substantial improvements over several established baselines with minimal computational overhead, confirming the effectiveness and generality of the proposed complementary learning paradigm. Our code is publicly available at https://github.com/xjpp2016/FERS.

[211] ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Xiaoxing Hu, Kaicheng Yang, Ziyong Feng, Qi Ming, Zonghao Guo, Xiang An, Ziyong Feng, Junchi Yan, Xue Yang

Main category: cs.CV

TL;DR: ProCLIP is a curriculum learning framework that progressively aligns LLM-based text encoders with CLIP’s image encoder to handle long texts and multilingual inputs while preserving CLIP’s pre-trained vision-language alignment.

Details

Motivation: CLIP's text encoder has limitations: 77-token max length, no multilingual support, and poor fine-grained semantic understanding. Direct replacement with LLMs disrupts CLIP's vision-language alignment.

Method: ProCLIP uses progressive alignment: 1) Knowledge distillation from CLIP text encoder to LLM embedder, 2) Image-text contrastive tuning with self-distillation regularization, 3) Instance semantic and embedding structure alignment losses.

Result: The framework effectively aligns LLM-based embedders with CLIP image encoder while preserving pre-trained knowledge and avoiding disruption of vision-language alignment.

Conclusion: ProCLIP successfully bridges LLMs with CLIP’s vision-language space, enabling long-text and multilingual capabilities while maintaining the original alignment quality.

Abstract: The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP’s text encoder into the LLM-based embedder to leverage CLIP’s rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP

[212] A Geometric Approach to Steerable Convolutions

Soumyabrata Kundu, Risi Kondor

Main category: cs.CV

TL;DR: A geometric derivation of steerable CNNs in d-dimensions using pattern matching principles, with intuitive explanations for Clebsch-Gordan decomposition and spherical harmonics, plus improved interpolation kernels for robustness.

Details

Motivation: To provide a more intuitive, geometric approach to steerable CNNs compared to abstract group theoretical methods, making the concepts more accessible and practical.

Method: Geometric derivation based on pattern matching principles, with novel interpolation kernel construction for steerable convolution layers.

Result: Intuitive explanation of Clebsch-Gordan decomposition and spherical harmonic basis functions, plus improved implementation with greater noise robustness.

Conclusion: The geometric approach offers clearer understanding of steerable CNNs and practical improvements through better interpolation kernels that enhance robustness to noisy data.

Abstract: In contrast to the somewhat abstract, group theoretical approach adopted by many papers, our work provides a new and more intuitive derivation of steerable convolutional neural networks in $d$ dimensions. This derivation is based on geometric arguments and fundamental principles of pattern matching. We offer an intuitive explanation for the appearance of the Clebsch–Gordan decomposition and spherical harmonic basis functions. Furthermore, we suggest a novel way to construct steerable convolution layers using interpolation kernels that improve upon existing implementation, and offer greater robustness to noisy data.

[213] An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection

Neel Patel, Alexander Wong, Ashkan Ebadi

Main category: cs.CV

TL;DR: A teacher-student AI framework improves tuberculosis and COVID-19 detection from chest X-rays using supervised and self-supervised learning, achieving high accuracy and explainable predictions.

Details

Motivation: Tuberculosis detection is critical in resource-limited areas where radiologists are scarce, but AI models require large datasets that are expensive to obtain.

Method: Proposed a teacher-student framework with two supervised heads and a self-supervised head to enhance disease and symptom detection on chest X-rays.

Result: Achieved 98.85% accuracy for COVID-19/tuberculosis/normal classification and 90.09% macro-F1 score for multilabel symptom detection, significantly outperforming baselines.

Conclusion: The model shows promise for clinical deployment with explainable predictions based on relevant anatomical features, suitable for screening and triage settings.

Abstract: Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas. Early detection is vital for treatment, yet the lack of skilled radiologists underscores the need for artificial intelligence (AI)-driven screening tools. Developing reliable AI models is challenging due to the necessity for large, high-quality datasets, which are costly to obtain. To tackle this, we propose a teacher–student framework which enhances both disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head. Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection, significantly outperforming baselines. The explainability assessments also show the model bases its predictions on relevant anatomical features, demonstrating promise for deployment in clinical screening and triage settings.

[214] SAM 2++: Tracking Anything at Any Granularity

Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang

Main category: cs.CV

TL;DR: SAM 2++ is a unified video tracking model that handles multiple granularities (masks, boxes, points) through task-specific prompts, unified decoder, and task-adaptive memory mechanism, achieving state-of-the-art performance across diverse tracking tasks.

Details

Motivation: Existing trackers are tailored to single tasks with custom-designed modules, limiting generalization and causing redundancy in model design and parameters. There's a need for a unified approach to handle tracking at different granularities.

Method: 1) Task-specific prompts to encode various inputs into general embeddings; 2) Unified decoder to standardize diverse task outputs; 3) Task-adaptive memory mechanism for cross-granularity memory matching; 4) Custom data engine creating Tracking-Any-Granularity dataset with rich annotations.

Result: SAM 2++ achieves state-of-the-art performance across multiple benchmarks for diverse tracking tasks at different granularities, demonstrating superior generalization and robustness.

Conclusion: SAM 2++ establishes a unified and robust tracking framework that effectively handles tracking at any granularity, overcoming limitations of task-specific trackers and setting new standards in video tracking.

Abstract: Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

[215] Unifying and Enhancing Graph Transformers via a Hierarchical Mask Framework

Yujie Xing, Xiao Wang, Bin Wu, Hai Huang, Chuan Shi

Main category: cs.CV

TL;DR: Proposes M3Dphormer, a unified hierarchical mask framework for Graph Transformers that reveals equivalence between architecture and attention masks, enabling flexible modeling of diverse node interactions through theoretically grounded mask design and adaptive integration.

Details

Motivation: Existing Graph Transformers rely on intricate architectural designs tailored to specific interactions, limiting flexibility. The paper aims to create a unified framework that can model diverse node interactions more flexibly through attention mask construction.

Method: Introduces a unified hierarchical mask framework and M3Dphormer model with: 1) Three theoretically grounded hierarchical masks, 2) Mixture-of-Experts-based architecture with bi-level expert routing, 3) Dual attention computation that switches between dense and sparse modes based on local mask sparsity.

Result: Extensive experiments across multiple benchmarks demonstrate that M3Dphormer achieves state-of-the-art performance, validating the effectiveness of the unified framework and model design.

Conclusion: The proposed unified hierarchical mask framework successfully addresses limitations of existing Graph Transformers by revealing the equivalence between architecture and attention masks, enabling flexible modeling of diverse interactions through theoretically principled mask design and adaptive integration mechanisms.

Abstract: Graph Transformers (GTs) have emerged as a powerful paradigm for graph representation learning due to their ability to model diverse node interactions. However, existing GTs often rely on intricate architectural designs tailored to specific interactions, limiting their flexibility. To address this, we propose a unified hierarchical mask framework that reveals an underlying equivalence between model architecture and attention mask construction. This framework enables a consistent modeling paradigm by capturing diverse interactions through carefully designed attention masks. Theoretical analysis under this framework demonstrates that the probability of correct classification positively correlates with the receptive field size and label consistency, leading to a fundamental design principle: an effective attention mask should ensure both a sufficiently large receptive field and a high level of label consistency. While no single existing mask satisfies this principle across all scenarios, our analysis reveals that hierarchical masks offer complementary strengths, motivating their effective integration. Then, we introduce M3Dphormer, a Mixture-of-Experts-based Graph Transformer with Multi-Level Masking and Dual Attention Computation. M3Dphormer incorporates three theoretically grounded hierarchical masks and employs a bi-level expert routing mechanism to adaptively integrate multi-level interaction information. To ensure scalability, we further introduce a dual attention computation scheme that dynamically switches between dense and sparse modes based on local mask sparsity. Extensive experiments across multiple benchmarks demonstrate that M3Dphormer achieves state-of-the-art performance, validating the effectiveness of our unified framework and model design.

[216] FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning

Yubin Zheng, Pak-Hei Yeung, Jing Xia, Tianjie Ju, Peng Tang, Weidong Qiu, Jagath C. Rajapakse

Main category: cs.CV

TL;DR: FedDEAP is an adaptive federated prompt tuning framework that enhances CLIP’s generalization in multi-domain FL by disentangling semantic and domain features, using dual-prompt design, and aligning visual-textual representations.

Details

Motivation: Domain shift and label heterogeneity in federated learning hinder global model generalization, while CLIP's strong zero-shot capabilities need effective fine-tuning across domains in federated settings.

Method: Three key components: (1) Disentangle semantic and domain features using transformation networks, (2) Dual-prompt design with global semantic and local domain prompts, (3) Align textual and visual representations to preserve semantic and domain consistency.

Result: Theoretical analysis and experiments on four datasets demonstrate effectiveness in enhancing CLIP’s generalization for federated image recognition across multiple domains.

Conclusion: FedDEAP successfully addresses domain shift and heterogeneity challenges in federated learning by adaptively tuning CLIP prompts while preserving both shared semantic and personalized domain knowledge.

Abstract: Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP’s generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.

[217] DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution

Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Shihao Wang, Tianhe Wu, Qiaosi Yi, Shuai Li, Lei Zhang

Main category: cs.CV

TL;DR: DP$^2$O-SR is a framework that optimizes real-world image super-resolution models using perceptual preferences without human annotations, leveraging hybrid quality assessment rewards and hierarchical preference optimization.

Details

Motivation: To exploit the perceptual quality range introduced by T2I model randomness and improve Real-ISR performance without costly human annotations.

Method: Uses hybrid reward combining full-reference and no-reference IQA models, constructs multiple preference pairs from same-model outputs, and implements hierarchical preference optimization with adaptive weighting based on reward gaps and diversity.

Result: Significantly improves perceptual quality across diffusion- and flow-based T2I backbones and generalizes well to real-world benchmarks.

Conclusion: DP$^2$O-SR effectively aligns generative models with perceptual preferences through automated optimization, demonstrating superior performance in real-world image super-resolution.

Abstract: Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.

[218] DSI-Bench: A Benchmark for Dynamic Spatial Intelligence

Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, Zhou Zhao

Main category: cs.CV

TL;DR: DSI-Bench is a benchmark for evaluating dynamic spatial reasoning in vision-language models, featuring 1,000 videos and 1,700 questions covering nine motion patterns to test models’ understanding of simultaneous observer and object motion.

Details

Motivation: Current vision-language models excel in 2D tasks and static scenarios but have limited ability to understand dynamic 3D scenarios where both observers and objects move simultaneously, which is essential for real-world spatial reasoning.

Method: Created DSI-Bench with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns, using spatially and temporally symmetric designs to reduce biases and systematically evaluate reasoning about self-motion and object motion.

Result: Evaluation of 14 VLMs and expert models revealed key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios.

Conclusion: DSI-Bench provides valuable insights for developing general and expertise models with dynamic spatial intelligence, highlighting the need for improved understanding of simultaneous motion in 3D environments.

Abstract: Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models’ reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.

Xianzheng Ma, Brandon Smart, Yash Bhalgat, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu

Main category: cs.CV

TL;DR: This survey paper comprehensively reviews methodologies for integrating large language models (LLMs) with 3D spatial data (3D-LLMs), covering various 3D representations, applications in scene understanding, and spatial reasoning tasks.

Details

Motivation: To provide a comprehensive overview of the rapid progress in integrating LLMs with 3D spatial data, highlighting the unique advantages of LLMs for advancing spatial comprehension in embodied AI systems.

Method: The paper conducts a systematic survey examining various 3D data representations (point clouds, NeRFs) and their integration with LLMs for tasks including 3D scene understanding, captioning, question-answering, dialogue, and spatial reasoning/planning.

Result: The meta-analysis reveals significant progress in 3D-LLM integration but identifies the need for novel approaches to fully harness their potential in understanding and interacting with complex 3D environments.

Conclusion: The survey aims to chart a course for future research to explore and expand 3D-LLM capabilities in understanding and interacting with the complex 3D world, supported by an organized project page of related papers.

Abstract: As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.

[220] Learning Collaborative Knowledge with Multimodal Representation for Polyp Re-Identification

Suncheng Xiang, Jiale Guan, Shilun Cai, Jiacheng Ruan, Dahong Qian

Main category: cs.CV

TL;DR: A multimodal collaborative learning framework for polyp re-identification that combines visual and textual information to improve retrieval performance in colonoscopic images.

Details

Motivation: Traditional object re-identification methods using CNN models trained on ImageNet perform poorly on colonoscopic datasets due to domain gaps and unimodal limitations, failing to leverage complementary multimodal information.

Method: Proposes DMCL - a Deep Multimodal Collaborative Learning framework with dynamic multimodal feature fusion strategy that leverages visual-text representations through end-to-end training.

Result: Experiments show the multimodal approach outperforms state-of-the-art unimodal ReID models, especially when combined with collaborative multimodal fusion strategy.

Conclusion: The proposed DMCL framework effectively enables multimodal knowledge collaboration and enhances generalization capability for polyp re-identification in medical scenarios.

Abstract: Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Worsely, these solutions typically learn unimodal modal representations on the basis of visual samples, which fails to explore complementary information from other different modalities. To address this challenge, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for polyp re-identification, which can effectively encourage multimodal knowledge collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized visual-text representations for multimodal fusion via end-to-end training. Experiments on the standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the collaborative multimodal fusion strategy. The code is publicly available at https://github.com/JeremyXSC/DMCL.

[221] H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting

Bing He, Yunuo Chen, Guo Lu, Qi Wang, Qunshan Gu, Rong Xie, Li Song, Wenjun Zhang

Main category: cs.CV

TL;DR: Proposes H3D control points using hybrid optical flow back-projection and gradient-based optimization for dynamic scene reconstruction, achieving faster convergence and better performance than existing methods.

Details

Motivation: Existing deformable 3D Gaussian splatting methods struggle with convergence on complex real-world motions due to high degrees of freedom in gradient-based optimization of all motion information.

Method: Introduces heterogeneous 3D control points (H3D) that combine optical flow back-projection for directly observable motion components and gradient-based optimization for geometrically occluded portions, decoupling observable and unobservable motion.

Result: Achieves superior performance on Neu3DV and CMU-Panoptic datasets, converges within 100 iterations, and processes each frame in 2 seconds on a single RTX 4070 GPU.

Conclusion: The hybrid approach effectively addresses convergence challenges in dynamic scene reconstruction while maintaining compact motion representation.

Abstract: Dynamic scene reconstruction poses a persistent challenge in 3D vision. Deformable 3D Gaussian Splatting has emerged as an effective method for this task, offering real-time rendering and high visual fidelity. This approach decomposes a dynamic scene into a static representation in a canonical space and time-varying scene motion. Scene motion is defined as the collective movement of all Gaussian points, and for compactness, existing approaches commonly adopt implicit neural fields or sparse control points. However, these methods predominantly rely on gradient-based optimization for all motion information. Due to the high degree of freedom, they struggle to converge on real-world datasets exhibiting complex motion. To preserve the compactness of motion representation and address convergence challenges, this paper proposes heterogeneous 3D control points, termed \textbf{H3D control points}, whose attributes are obtained using a hybrid strategy combining optical flow back-projection and gradient-based methods. This design decouples directly observable motion components from those that are geometrically occluded. Specifically, components of 3D motion that project onto the image plane are directly acquired via optical flow back projection, while unobservable portions are refined through gradient-based optimization. Experiments on the Neu3DV and CMU-Panoptic datasets demonstrate that our method achieves superior performance over state-of-the-art deformable 3D Gaussian splatting techniques. Remarkably, our method converges within just 100 iterations and achieves a per-frame processing speed of 2 seconds on a single NVIDIA RTX 4070 GPU.

[222] Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model

Yiming Shi, Xun Zhu, Kaiwen Wang, Ying Hu, Chenyi Guo, Miao Li, Ji Wu

Main category: cs.CV

TL;DR: Med-2E3 is a 3D medical multimodal large language model that integrates both 3D and 2D features through a dual encoder architecture with text-guided inter-slice scoring, achieving state-of-the-art performance on 3D medical image analysis.

Details

Motivation: Traditional task-specific models lack generalizability across diverse clinical scenarios, and existing MLLMs fail to fully leverage the hierarchical information in 3D medical images. Inspired by radiologists' practice of examining both 3D spatial structure and 2D planar content.

Method: Proposes Med-2E3 with dual 3D-2D encoder architecture and Text-Guided Inter-Slice (TG-IS) scoring module that scores attention of each 2D slice based on slice contents and task instructions.

Result: TG-IS exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models on large-scale 3D medical multimodal datasets.

Conclusion: Med-2E3 is the first MLLM to integrate both 3D and 2D features for 3D medical image analysis, demonstrating superior performance through its dual encoder approach and text-guided scoring mechanism.

Abstract: 3D medical image analysis is essential for modern healthcare, yet traditional task-specific models are inadequate due to limited generalizability across diverse clinical scenarios. Multimodal large language models (MLLMs) offer a promising solution to these challenges. However, existing MLLMs have limitations in fully leveraging the rich, hierarchical information embedded in 3D medical images. Inspired by clinical practice, where radiologists focus on both 3D spatial structure and 2D planar content, we propose Med-2E3, a 3D medical MLLM that integrates a dual 3D-2D encoder architecture. To aggregate 2D features effectively, we design a Text-Guided Inter-Slice (TG-IS) scoring module, which scores the attention of each 2D slice based on slice contents and task instructions. To the best of our knowledge, Med-2E3 is the first MLLM to integrate both 3D and 2D features for 3D medical image analysis. Experiments on large-scale, open-source 3D medical multimodal datasets demonstrate that TG-IS exhibits task-specific attention distribution and significantly outperforms current state-of-the-art models. The code is available at: https://github.com/MSIIP/Med-2E3

[223] Foundation Cures Personalization: Improving Personalized Models’ Prompt Consistency via Hidden Foundation Knowledge

Yiyang Cai, Zhengkai Jiang, Yulong Liu, Chunyang Jiang, Wei Xue, Yike Guo, Wenhan Luo

Main category: cs.CV

TL;DR: FreeCure is a training-free framework that improves prompt consistency in facial personalization models by leveraging foundation models’ knowledge, addressing the trade-off between identity fidelity and prompt alignment.

Details

Motivation: Current facial personalization models using identity embeddings compromise prompt consistency and attribute-level controllability, while foundation models demonstrate precise facial attribute control capabilities that could be leveraged to solve this problem.

Method: Uses dual inference paradigm with/without identity embedding to identify attributes needing enhancement, then employs foundation-aware self-attention module with inversion-based process to bring well-aligned attribute information to personalization.

Result: Significantly improves prompt consistency across various facial personalization models (Stable Diffusion and FLUX-based) while maintaining original identity fidelity, effectively enhancing wide array of facial attributes like hair and accessories.

Conclusion: FreeCure successfully addresses the prompt consistency vs identity fidelity trade-off in facial personalization by leveraging foundation models’ knowledge, providing a training-free solution that integrates seamlessly with existing models.

Abstract: Facial personalization faces challenges to maintain identity fidelity without disrupting the foundation model’s prompt consistency. The mainstream personalization models employ identity embedding to integrate identity information within the attention mechanisms. However, our preliminary findings reveal that identity embeddings compromise the effectiveness of other tokens in the prompt, thereby limiting high prompt consistency and attribute-level controllability. Moreover, by deactivating identity embedding, personalization models still demonstrate the underlying foundation models’ ability to control facial attributes precisely. It suggests that such foundation models’ knowledge can be leveraged to cure the ill-aligned prompt consistency of personalization models. Building upon these insights, we propose FreeCure, a framework that improves the prompt consistency of personalization models with their latent foundation models’ knowledge. First, by setting a dual inference paradigm with/without identity embedding, we identify attributes (e.g., hair, accessories, etc.) for enhancements. Second, we introduce a novel foundation-aware self-attention module, coupled with an inversion-based process to bring well-aligned attribute information to the personalization process. Our approach is training-free, and can effectively enhance a wide array of facial attributes; and it can be seamlessly integrated into existing popular personalization models based on both Stable Diffusion and FLUX. FreeCure has consistently shown significant improvements in prompt consistency across these facial personalization models while maintaining the integrity of their original identity fidelity.

[224] Implicit Neural Compression of Point Clouds

Hongning Ruan, Yulin Shao, Qianqian Yang, Liang Zhao, Zhaoyang Zhang, Dusit Niyato

Main category: cs.CV

TL;DR: NeRC³ is a novel point cloud compression framework using implicit neural representations (INRs) to efficiently compress both geometry and attributes of static and dynamic point clouds, outperforming existing standards and methods.

Details

Motivation: Efficient compression of unstructured, high-precision point cloud data remains challenging despite their importance in 3D representation applications.

Method: Uses two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy for geometry, another maps occupied voxels to attributes. Extended to dynamic point clouds with 4D spatio-temporal representation (4D-NeRC³) to reduce temporal redundancy.

Result: For static point clouds: outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds: achieves superior geometry compression vs latest G-PCC/V-PCC standards, matches state-of-the-art learning-based methods, and shows competitive joint geometry-attribute compression.

Conclusion: NeRC³ provides an effective INR-based framework for both static and dynamic point cloud compression, demonstrating superior performance over existing standards and competitive results with learning-based approaches.

Abstract: Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC$^3$. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.

[225] View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection

Qi Zhang, Zhouhang Luo, Tao Yu, Hui Huang

Main category: cs.CV

TL;DR: This paper proposes a method to improve view transformation robustness (VTR) in multi-view 3D object reconstruction by using Stable Diffusion models to generate novel views guided by reconstruction error analysis.

Details

Motivation: Existing multi-view 3D reconstruction methods lack robustness to view transformations, and directly using large vision models at inference is computationally expensive and doesn't guarantee robustness.

Method: The authors use Stable Diffusion models to generate novel views, but instead of random views, they propose a reconstruction error-guided view selection method that chooses views covering reconstruction errors based on spatial distribution analysis.

Result: Extensive experiments show the proposed method outperforms state-of-the-art 3D reconstruction methods and other view transformation robustness comparison methods.

Conclusion: The method effectively improves view transformation robustness in 3D reconstruction by leveraging Stable Diffusion models for training data augmentation without adding inference computation burden.

Abstract: View transformation robustness (VTR) is critical for deep-learning-based multi-view 3D object reconstruction models, which indicates the methods’ stability under inputs with various view transformations. However, existing research seldom focused on view transformation robustness in multi-view 3D object reconstruction. One direct way to improve the models’ VTR is to produce data with more view transformations and add them to model training. Recent progress on large vision models, particularly Stable Diffusion models, has provided great potential for generating 3D models or synthesizing novel view images with only a single image input. Directly deploying these models at inference consumes heavy computation resources and their robustness to view transformations is not guaranteed either. To fully utilize the power of Stable Diffusion models without extra inference computation burdens, we propose to generate novel views with Stable Diffusion models for better view transformation robustness. Instead of synthesizing random views, we propose a reconstruction error-guided view selection method, which considers the reconstruction errors’ spatial distribution of the 3D predictions and chooses the views that could cover the reconstruction errors as much as possible. The methods are trained and tested on sets with large view transformations to validate the 3D reconstruction models’ robustness to view transformations. Extensive experiments demonstrate that the proposed method can outperform state-of-the-art 3D reconstruction methods and other view transformation robustness comparison methods. Code is available at: https://github.com/zqyq/VTR.

[226] Deep Learning in Palmprint Recognition-A Comprehensive Survey

Chengrui Gao, Ziyuan Yang, Wei Jia, Lu Leng, Bob Zhang, Andrew Beng Jin Teoh

Main category: cs.CV

TL;DR: This paper provides a comprehensive review of deep learning approaches in palmprint recognition, covering ROI segmentation, feature extraction, and security/privacy challenges, while identifying current limitations and future research directions.

Details

Motivation: Traditional handcrafted palmprint recognition methods have limited representation capability due to heavy reliance on prior knowledge, and existing surveys focus narrowly on specific tasks using traditional approaches, creating a gap in comprehensive DL-based palmprint recognition research.

Method: The paper systematically reviews and examines recent advancements in deep learning-powered palmprint recognition across key tasks including region-of-interest segmentation, feature extraction, and security/privacy-oriented challenges.

Result: The review consolidates state-of-the-art progress in DL-based palmprint recognition, highlighting advancements across multiple facets of the technology and identifying current challenges in the field.

Conclusion: This comprehensive review serves as a valuable resource for researchers to stay updated with cutting-edge technologies and drive innovation in palmprint recognition by bridging the gap in DL-based approaches research.

Abstract: Palmprint recognition has emerged as a prominent biometric technology, widely applied in diverse scenarios. Traditional handcrafted methods for palmprint recognition often fall short in representation capability, as they heavily depend on researchers’ prior knowledge. Deep learning (DL) has been introduced to address this limitation, leveraging its remarkable successes across various domains. While existing surveys focus narrowly on specific tasks within palmprint recognition-often grounded in traditional methodologies-there remains a significant gap in comprehensive research exploring DL-based approaches across all facets of palmprint recognition. This paper bridges that gap by thoroughly reviewing recent advancements in DL-powered palmprint recognition. The paper systematically examines progress across key tasks, including region-of-interest segmentation, feature extraction, and security/privacy-oriented challenges. Beyond highlighting these advancements, the paper identifies current challenges and uncovers promising opportunities for future research. By consolidating state-of-the-art progress, this review serves as a valuable resource for researchers, enabling them to stay abreast of cutting-edge technologies and drive innovation in palmprint recognition.

[227] WMamba: Wavelet-based Mamba for Face Forgery Detection

Siran Peng, Tianshuo Zhang, Li Gao, Xiangyu Zhu, Haoyuan Zhang, Kai Pang, Zhen Lei

Main category: cs.CV

TL;DR: WMamba is a novel wavelet-based face forgery detection method that uses Mamba architecture and Dynamic Contour Convolution to better capture slender facial contours and long-range spatial relationships for improved detection performance.

Details

Motivation: Current wavelet-based face forgery detection methods fail to fully exploit wavelet data properties, leading to sub-optimal feature extraction and limited performance gains despite wavelets' ability to capture subtle forgery artifacts.

Method: WMamba uses Mamba architecture with two key innovations: Dynamic Contour Convolution (DCConv) with deformable kernels to model slender facial contours, and Mamba’s ability to capture long-range spatial relationships with linear complexity from small image patches.

Result: Extensive experiments show that WMamba achieves state-of-the-art (SOTA) performance in face forgery detection.

Conclusion: WMamba effectively addresses the limitations of current wavelet-based approaches by better utilizing wavelet information through specialized contour modeling and efficient long-range relationship capture.

Abstract: The rapid evolution of deepfake generation technologies necessitates the development of robust face forgery detection algorithms. Recent studies have demonstrated that wavelet analysis can enhance the generalization abilities of forgery detectors. Wavelets effectively capture key facial contours, often slender, fine-grained, and globally distributed, that may conceal subtle forgery artifacts imperceptible in the spatial domain. However, current wavelet-based approaches fail to fully exploit the distinctive properties of wavelet data, resulting in sub-optimal feature extraction and limited performance gains. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear complexity. This efficiency allows for the extraction of fine-grained, globally distributed forgery artifacts from small image patches. Extensive experiments show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness in face forgery detection.

[228] ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text

Haifeng Ni, Ming Xu

Main category: cs.CV

TL;DR: ITVTON is an efficient virtual try-on framework using a single Diffusion Transformer (DiT) block that concatenates garment and person images with text descriptions to achieve high-fidelity results with reduced computational cost.

Details

Motivation: Existing virtual try-on methods use duplicated backbones or additional image encoders, which increases computational overhead and network complexity. There is a need for more efficient approaches that maintain image quality.

Method: Proposes ITVTON framework using a single Diffusion Transformer (DiT) generator. Concatenates garment and person images along width dimension and incorporates textual descriptions from both. Restricts training to attention parameters within a single DiT block to reduce computational cost.

Result: ITVTON surpasses baseline methods both qualitatively and quantitatively, setting new standard for virtual try-on. Experiments on 10,257 image pairs from IGPair confirm robustness in real-world scenarios.

Conclusion: ITVTON provides an efficient and effective solution for virtual try-on that maintains high image fidelity while significantly reducing computational requirements compared to existing methods.

Abstract: Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.

[229] RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang

Main category: cs.CV

TL;DR: RAD is a 3DGS-based closed-loop Reinforcement Learning framework for autonomous driving that outperforms Imitation Learning methods with 3x lower collision rates.

Details

Motivation: Address limitations of existing Imitation Learning approaches for autonomous driving, including causal confusion and open-loop gaps, by creating a more robust closed-loop system.

Method: Uses 3D Gaussian Splatting (3DGS) to build photorealistic digital replicas of real environments, enabling extensive exploration through RL with safety-focused rewards and IL regularization.

Result: Achieves stronger performance than IL-based methods in most closed-loop metrics, particularly demonstrating a 3x reduction in collision rates.

Conclusion: The proposed RAD framework successfully combines 3DGS with RL to create a safer and more effective autonomous driving system that can handle out-of-distribution scenarios through large-scale simulation.

Abstract: Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and an open-loop gap. In this work, we propose RAD, a 3DGS-based closed-loop Reinforcement Learning (RL) framework for end-to-end Autonomous Driving. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards to guide the policy in effectively responding to safety-critical events and understanding real-world causal relationships. To better align with human driving behavior, we incorporate IL into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, particularly exhibiting a 3x lower collision rate. Abundant closed-loop results are presented in the supplementary material. Code is available at https://github.com/hustvl/RAD for facilitating future research.

[230] Foundations of a Developmental Design Paradigm for Integrated Continual Learning, Deliberative Behavior, and Comprehensibility

Zeki Doruk Erden, Boi Faltings

Main category: cs.CV

TL;DR: The paper introduces a biologically-inspired learning system with three components: Modeller for continual learning, planner for goal-directed action, and behavior encapsulation for hierarchical decomposition, addressing limitations of current ML systems.

Details

Motivation: To overcome inherent limitations in contemporary machine learning systems, particularly in continual learning, information reuse, comprehensibility, and integration with deliberate behavior.

Method: A system design with three core components: Modeller (gradient-free learning mechanism), planner (goal-directed action), and behavior encapsulation (hierarchical decomposition of complex behaviors), conceptually grounded in evolutionary developmental biology principles.

Result: Proof-of-principle operation demonstrated in simple test environment and extended to higher-dimensional network-structured spaces using MNIST for shape detection tasks. The framework shows promise in overcoming multiple ML limitations.

Conclusion: The proposed framework shows promise in overcoming multiple major limitations of contemporary machine learning systems simultaneously and in an organic manner.

Abstract: Inherent limitations of contemporary machine learning systems in crucial areas – importantly in continual learning, information reuse, comprehensibility, and integration with deliberate behavior – are receiving increasing attention. To address these challenges, we introduce a system design, fueled by a novel learning approach conceptually grounded in principles of evolutionary developmental biology, that overcomes key limitations of current methods. Our design comprises three core components: The Modeller, a gradient-free learning mechanism inherently capable of continual learning and structural adaptation; a planner for goal-directed action over learned models; and a behavior encapsulation mechanism that can decompose complex behaviors into a hierarchical structure. We demonstrate proof-of-principle operation in a simple test environment. Additionally, we extend our modeling framework to higher-dimensional network-structured spaces, using MNIST for a shape detection task. Our framework shows promise in overcoming multiple major limitations of contemporary machine learning systems simultaneously and in an organic manner.

[231] H3DE-Net: Efficient and Accurate 3D Landmark Detection in Medical Imaging

Zhen Huang, Tao Tang, Ronghao Xu, Yangbo Wei, Wenkai Yang, Suhua Wang, Xiaoxin Sun, Han Li, Qingsong Yao

Main category: cs.CV

TL;DR: H3DE-Net is a novel 3D landmark detection framework that combines CNNs with lightweight attention mechanism to efficiently capture both local features and global dependencies in medical images.

Details

Motivation: Existing deep learning methods struggle to simultaneously capture fine-grained local features and model global spatial relationships in 3D medical images while maintaining computational efficiency, especially given the sparse distribution of landmarks in high-dimensional volumes.

Method: Proposed H3DE-Net integrates CNNs for local feature extraction with a lightweight attention mechanism using hierarchical routing strategy to reduce computational cost while maintaining global context modeling, plus multi-scale feature fusion for enhanced accuracy.

Result: Experimental results on public CT dataset show H3DE-Net achieves state-of-the-art performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations.

Conclusion: H3DE-Net is the first 3D landmark detection model to integrate lightweight attention mechanism with CNNs, providing an efficient and precise solution for medical image analysis tasks.

Abstract: 3D landmark detection is a critical task in medical image analysis, and accurately detecting anatomical landmarks is essential for subsequent medical imaging tasks. However, mainstream deep learning methods in this field struggle to simultaneously capture fine-grained local features and model global spatial relationships, while maintaining a balance between accuracy and computational efficiency. Local feature extraction requires capturing fine-grained anatomical details, while global modeling requires understanding the spatial relationships within complex anatomical structures. The high-dimensional nature of 3D volume further exacerbates these challenges, as landmarks are sparsely distributed, leading to significant computational costs. Therefore, achieving efficient and precise 3D landmark detection remains a pressing challenge in medical image analysis. In this work, We propose a \textbf{H}ybrid \textbf{3}D \textbf{DE}tection \textbf{Net}(H3DE-Net), a novel framework that combines CNNs for local feature extraction with a lightweight attention mechanism designed to efficiently capture global dependencies in 3D volumetric data. This mechanism employs a hierarchical routing strategy to reduce computational cost while maintaining global context modeling. To our knowledge, H3DE-Net is the first 3D landmark detection model that integrates such a lightweight attention mechanism with CNNs. Additionally, integrating multi-scale feature fusion further enhances detection accuracy and robustness. Experimental results on a public CT dataset demonstrate that H3DE-Net achieves state-of-the-art(SOTA) performance, significantly improving accuracy and robustness, particularly in scenarios with missing landmarks or complex anatomical variations. We aready open-source our project, including code, data and model weights.

[232] Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Learnable Linear Extrapolation

Jiawei Zhang, Ziyuan Liu, Leon Yan, Gen Li, Yuantao Gu

Main category: cs.CV

TL;DR: Proposes Learnable Linear Extrapolation (LLE), a lightweight method that universally enhances diffusion-based inverse algorithms by optimizing combination coefficients to refine predictions using previous estimates, reducing computational costs while maintaining performance.

Details

Motivation: Diffusion-based inverse algorithms have high computational costs due to numerous denoising steps. Fast diffusion ODE solvers for sampling don't work well for inverse problems due to heterogeneous formulations, approximations, and heuristics that introduce errors.

Method: Analyzed ODE solvers for inverse problems, identified linear combination structure of approximations. Proposed canonical form to unify diffusion-based inverse algorithms. Developed LLE that optimizes combination coefficients using previous estimates to refine current predictions.

Result: Extensive experiments show consistent improvements across multiple algorithms and tasks. LLE enhances performance of diffusion-based inverse algorithms with limited steps, enabling more efficient solutions.

Conclusion: LLE is a universal enhancement method that boosts performance of diffusion-based inverse algorithms while reducing computational costs, making them more practical for real-world applications.

Abstract: Diffusion-based inverse algorithms have shown remarkable performance across various inverse problems, yet their reliance on numerous denoising steps incurs high computational costs. While recent developments of fast diffusion ODE solvers offer effective acceleration for diffusion sampling without observations, their application in inverse problems remains limited due to the heterogeneous formulations of inverse algorithms and their prevalent use of approximations and heuristics, which often introduce significant errors that undermine the reliability of analytical solvers. In this work, we begin with an analysis of ODE solvers for inverse problems that reveals a linear combination structure of approximations for the inverse trajectory. Building on this insight, we propose a canonical form that unifies a broad class of diffusion-based inverse algorithms and facilitates the design of more generalizable solvers. Inspired by the linear subspace search strategy, we propose Learnable Linear Extrapolation (LLE), a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm conforming to our canonical form. LLE optimizes the combination coefficients to refine current predictions using previous estimates, alleviating the sensitivity of analytical solvers for inverse algorithms. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https://github.com/weigerzan/LLE_inverse_problem.

[233] scSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy

Ashesh Ashesh, Florian Jug

Main category: cs.CV

TL;DR: scSplit is a novel method for computational multiplexing in fluorescence microscopy that addresses the unknown mixing ratio problem in image decomposition by incorporating a regressor network to predict degradation levels and a degradation-specific normalization module.

Details

Motivation: Existing image decomposition methods are trained on fixed intensity ratios of superimposed inputs, making them unable to handle the range of relative intensities that occur in fluorescence microscopy, where the mixing ratio is a priori unknown.

Method: Based on InDI (iterative method for image restoration), scSplit introduces a regressor network to predict the degradation level (mixing ratio) of input images and a degradation-specific normalization module to enable degradation-aware inference across all mixing ratios.

Result: The method successfully solves image splitting and bleedthrough removal tasks in fluorescence microscopy and was empirically validated on 5 public datasets.

Conclusion: scSplit provides an effective solution for computational multiplexing in fluorescence microscopy by being cognizant of unknown mixing ratios, with source code and pre-trained models publicly available.

Abstract: Fluorescence microscopy, while being a key driver for progress in the life sciences, is also subject to technical limitations. To overcome them, computational multiplexing techniques have recently been proposed, which allow multiple cellular structures to be captured in a single image and later be unmixed. Existing image decomposition methods are trained on a set of superimposed input images and the respective unmixed target images. It is critical to note that the relative strength (mixing ratio) of the superimposed images for a given input is a priori unknown. However, existing methods are trained on a fixed intensity ratio of superimposed inputs, making them not cognizant of the range of relative intensities that can occur in fluorescence microscopy. In this work, we propose a novel method called scSplit that is cognizant of the severity of the above-mentioned mixing ratio. Our idea is based on InDI , a popular iterative method for image restoration, and an ideal starting point to embrace the unknown mixing ratio in any given input. We introduce (i) a suitably trained regressor network that predicts the degradation level (mixing ratio) of a given input image and (ii) a degradation-specific normalization module, enabling degradation-aware inference across all mixing ratios. We show that this method solves two relevant tasks in fluorescence microscopy, namely image splitting and bleedthrough removal, and empirically demonstrate the applicability of scSplit on 5 public datasets. The source code with pre-trained models is hosted at https://github.com/juglab/scSplit/.

[234] REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng

Main category: cs.CV

TL;DR: End-to-end training of latent diffusion models with VAE tokenizer is enabled through representation-alignment (REPA) loss, achieving 17-45x training speedup and state-of-the-art performance on ImageNet.

Details

Motivation: Traditional wisdom suggests end-to-end training is preferable, but standard diffusion loss fails for joint VAE-diffusion training, causing performance degradation. The paper aims to enable effective end-to-end training.

Method: Proposes REPA-E training recipe using representation-alignment loss instead of standard diffusion loss to jointly train VAE and diffusion model end-to-end.

Result: Achieves 17x and 45x training speedup over REPA and vanilla methods respectively. Sets new SOTA with FID 1.12 and 1.69 on ImageNet 256x256. Also improves VAE latent space structure and generation quality.

Conclusion: End-to-end training of latent diffusion models with VAE is possible and beneficial using REPA loss, leading to significant performance improvements and training efficiency gains.

Abstract: In this paper we tackle a fundamental question: “Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?” Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss – allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

[235] Mask Image Watermarking

Runyi Hu, Jie Zhang, Shiqian Zhao, Nils Lukas, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang

Main category: cs.CV

TL;DR: MaskWM is an efficient image watermarking framework with two variants: MaskWM-D for global/local watermarking and localization, and MaskWM-ED for enhanced local watermark robustness. It uses masking mechanisms for flexible watermark extraction and achieves state-of-the-art performance with high efficiency.

Details

Motivation: To address the need for flexible image watermarking that supports both global and local operations, including watermark localization and robust local embedding, while maintaining computational efficiency and adaptability to varying robustness requirements.

Method: MaskWM-D uses encoder-distortion layer-decoder paradigm with masking during decoding to enable global/local extraction and localization. MaskWM-ED extends this by incorporating masks during encoding to guide local watermark embedding. Both use various mask types during training to learn localization and regional extraction.

Result: Achieves state-of-the-art performance in global/local watermark extraction, localization, and multi-watermark embedding. Outperforms all baselines including WAM, preserves high visual quality, requires only 20 hours training (15x more efficient than WAM), and can be quickly fine-tuned by adjusting distortion layer.

Conclusion: MaskWM provides a simple, efficient, and flexible framework for image watermarking that supports diverse applications from global protection to fine-grained local protection, with superior performance and computational efficiency compared to existing methods.

Abstract: We present MaskWM, a simple, efficient, and flexible framework for image watermarking. MaskWM has two variants: (1) MaskWM-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection; (2) MaskWM-ED, which focuses on local watermark embedding and extraction, offering enhanced robustness in small regions to support fine-grined image protection. MaskWM-D builds on the classical encoder-distortion layer-decoder training paradigm. In MaskWM-D, we introduce a simple masking mechanism during the decoding stage that enables both global and local watermark extraction. During training, the decoder is guided by various types of masks applied to watermarked images before extraction, helping it learn to localize watermarks and extract them from the corresponding local areas. MaskWM-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions, which improves robustness under regional attacks. Extensive experiments show that MaskWM achieves state-of-the-art performance in global and local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. In addition, MaskWM is highly efficient and adaptable. It requires only 20 hours of training on a single A6000 GPU, achieving 15x computational efficiency compared to WAM. By simply adjusting the distortion layer, MaskWM can be quickly fine-tuned to meet varying robustness requirements.

[236] VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture

Long Li, Jiajia Li, Dong Chen, Lina Pu, Haibo Yao, Yanbo Huang

Main category: cs.CV

TL;DR: VLLFL is a vision-language model-based lightweight federated learning framework that improves agricultural object detection while preserving privacy and reducing communication overhead.

Details

Motivation: To address privacy concerns and data collection challenges in agricultural object detection, where sensitive data is distributed across farms and large-scale data collection is problematic.

Method: Combines vision-language model’s generalization capabilities with federated learning’s privacy preservation. Uses a compact prompt generator to boost VLM performance across different farms while reducing communication overhead.

Result: Achieves 14.53% improvement in VLM performance while reducing 99.3% communication overhead. Successfully applied to various agricultural tasks including fruit identification and harmful animal detection.

Conclusion: VLLFL provides an efficient, scalable, and privacy-preserving solution for agricultural object detection applications.

Abstract: In modern smart agriculture, object detection plays a crucial role by enabling automation, precision farming, and monitoring of resources. From identifying crop health and pest infestations to optimizing harvesting processes, accurate object detection enhances both productivity and sustainability. However, training object detection models often requires large-scale data collection and raises privacy concerns, particularly when sensitive agricultural data is distributed across farms. To address these challenges, we propose VLLFL, a vision-language model-based lightweight federated learning framework (VLLFL). It harnesses the generalization and context-aware detection capabilities of the vision-language model (VLM) and leverages the privacy-preserving nature of federated learning. By training a compact prompt generator to boost the performance of the VLM deployed across different farms, VLLFL preserves privacy while reducing communication overhead. Experimental results demonstrate that VLLFL achieves 14.53% improvement in the performance of VLM while reducing 99.3% communication overhead. Spanning tasks from identifying a wide variety of fruits to detecting harmful animals in agriculture, the proposed framework offers an efficient, scalable, and privacy-preserving solution specifically tailored to agricultural applications.

[237] Monitoring morphometric drift in lifelong learning segmentation of the spinal cord

Enamundram Naga Karthik, Sandrine Bédard, Jan Valošek, Christoph S. Aigner, Elise Bannier, Josef Bednařík, Virginie Callot, Anna Combes, Armin Curt, Gergely David, Falk Eippert, Lynn Farner, Michael G Fehlings, Patrick Freund, Tobias Granberg, Cristina Granziera, RHSCIR Network Imaging Group, Ulrike Horn, Tomáš Horák, Suzanne Humphreys, Markus Hupp, Anne Kerbrat, Nawal Kinany, Shannon Kolind, Petr Kudlička, Anna Lebret, Lisa Eunyoung Lee, Caterina Mainero, Allan R. Martin, Megan McGrath, Govind Nair, Kristin P. O’Grady, Jiwon Oh, Russell Ouellette, Nikolai Pfender, Dario Pfyffer, Pierre-François Pradat, Alexandre Prat, Emanuele Pravatà, Daniel S. Reich, Ilaria Ricchi, Naama Rotem-Kohavi, Simon Schading-Sassenhausen, Maryam Seif, Andrew Smith, Seth A Smith, Grace Sweeney, Roger Tam, Anthony Traboulsee, Constantina Andrada Treaba, Charidimos Tsagkas, Zachary Vavasour, Dimitri Van De Ville, Kenneth Arnold Weber II, Sarath Chandar, Julien Cohen-Adad

Main category: cs.CV

TL;DR: This paper introduces a lifelong learning framework to monitor morphometric drift in spinal cord segmentation models as they are updated, ensuring stable predictions for deriving normative values from healthy participants.

Details

Motivation: To assess the stability of spinal cord segmentation model predictions as models are updated with new datasets, which is crucial for deriving reliable normative values from healthy participants in neurological diseases.

Method: Developed a spinal cord segmentation model trained on multisite data with 9 MRI contrasts and pathologies, plus an automatic lifelong learning framework using GitHub Actions to monitor morphometric drift when models are updated.

Result: The model outperforms previous versions on challenging lumbar cases (Dice score 0.95±0.03), the monitoring framework provides quick feedback, and morphometric measures show minimal drift between model versions with nearly constant scaling factors.

Conclusion: The proposed lifelong learning framework successfully monitors morphometric drift in spinal cord segmentation models, enabling stable updates to normative databases while maintaining high segmentation performance.

Abstract: Morphometric measures derived from spinal cord segmentations can serve as diagnostic and prognostic biomarkers in neurological diseases and injuries affecting the spinal cord. While robust, automatic segmentation methods to a wide variety of contrasts and pathologies have been developed over the past few years, whether their predictions are stable as the model is updated using new datasets has not been assessed. This is particularly important for deriving normative values from healthy participants. In this study, we present a spinal cord segmentation model trained on a multisite $(n=75)$ dataset, including 9 different MRI contrasts and several spinal cord pathologies. We also introduce a lifelong learning framework to automatically monitor the morphometric drift as the model is updated using additional datasets. The framework is triggered by an automatic GitHub Actions workflow every time a new model is created, recording the morphometric values derived from the model’s predictions over time. As a real-world application of the proposed framework, we employed the spinal cord segmentation model to update a recently-introduced normative database of healthy participants containing commonly used measures of spinal cord morphometry. Results showed that: (i) our model outperforms previous versions and pathology-specific models on challenging lumbar spinal cord cases, achieving an average Dice score of $0.95 \pm 0.03$; (ii) the automatic workflow for monitoring morphometric drift provides a quick feedback loop for developing future segmentation models; and (iii) the scaling factor required to update the database of morphometric measures is nearly constant among slices across the given vertebral levels, showing minimum drift between the current and previous versions of the model monitored by the framework. The code and model are open-source and accessible via Spinal Cord Toolbox v7.0.

[238] VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, Kede Ma

Main category: cs.CV

TL;DR: VisualQuality-R1 is a reasoning-induced no-reference image quality assessment model trained with reinforcement learning to rank, outperforming existing methods and generating human-aligned quality descriptions.

Details

Motivation: The potential of reasoning-induced computation has not been thoroughly explored in image quality assessment, which critically depends on visual reasoning.

Method: Use reinforcement learning to rank with group relative policy optimization to generate multiple quality scores, compute comparative probabilities under Thurstone model, and define rewards using continuous fidelity measures.

Result: Consistently outperforms discriminative deep learning-based NR-IQA models and recent reasoning-induced quality regression methods, capable of generating contextually rich quality descriptions.

Conclusion: VisualQuality-R1 is well-suited for reliably measuring progress in image processing tasks like super-resolution and image generation, supporting multi-dataset training without perceptual scale realignment.

Abstract: DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computation has not been thoroughly explored in the context of image quality assessment (IQA), a task depending critically on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.

[239] Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, Kaiyang Zhou

Main category: cs.CV

TL;DR: Visionary-R1 trains visual language models to perform reasoning on images using reinforcement learning and visual Q&A pairs without chain-of-thought supervision, achieving state-of-the-art performance by using a caption-reason-answer format to prevent shortcut learning.

Details

Motivation: To develop general-purpose reasoning capabilities in visual language models using reinforcement learning without explicit chain-of-thought supervision, addressing the problem of shortcut learning that occurs when models bypass proper reasoning.

Method: Uses reinforcement learning on 273K CoT-free visual question-answer pairs with a caption-reason-answer output format: first generate detailed image caption, then construct extensive reasoning chain, then provide answer.

Result: Outperforms strong multimodal models including GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro on multiple visual reasoning benchmarks.

Conclusion: The caption-reason-answer format effectively mitigates shortcut learning in VLMs trained with reinforcement learning, enabling robust visual reasoning without explicit CoT supervision.

Abstract: Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM – by prompting the model to produce a reasoning chain before providing an answer – can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.

[240] gen2seg: Generative Models Enable Generalizable Instance Segmentation

Om Khangaonkar, Hamed Pirsiavash

Main category: cs.CV

TL;DR: The paper shows that generative models like Stable Diffusion and MAE can be repurposed for zero-shot instance segmentation by fine-tuning with an instance coloring loss on limited object types, achieving strong generalization to unseen categories.

Details

Motivation: To explore whether generative models' inherent understanding of object boundaries and scene compositions from pretraining can be leveraged for general-purpose perceptual organization tasks like instance segmentation.

Method: Fine-tuned Stable Diffusion and MAE (encoder+decoder) using an instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars), then tested zero-shot generalization.

Result: Models showed strong zero-shot generalization, accurately segmenting unseen object types and styles, closely approaching heavily supervised SAM on unseen categories and outperforming it on fine structures and ambiguous boundaries.

Conclusion: Generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining, making them effective for zero-shot perceptual organization tasks.

Abstract: By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE’s ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.

[241] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi

Main category: cs.CV

TL;DR: layoutRL is an end-to-end reinforcement learning framework for document parsing that achieves state-of-the-art performance by optimizing layout-aware rewards using a new 55K dataset.

Details

Motivation: Traditional multi-stage document parsing pipelines suffer from error propagation and limited adaptability to diverse layouts, creating a critical bottleneck in Document AI.

Method: Uses reinforcement learning with composite rewards (normalized edit distance, paragraph count accuracy, reading order preservation) and a vision-language-model-based parser trained on Infinity-Doc-55K dataset containing synthetic and real documents.

Result: Achieves new state-of-the-art performance on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, outperforming specialist pipelines and general vision-language models.

Conclusion: The layoutRL framework enables robust document understanding with superior accuracy and structural fidelity, with code and dataset to be publicly released.

Abstract: Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.

[242] COLORA: Efficient Fine-Tuning for Convolutional Models with a Study Case on Optical Coherence Tomography Image Classification

Mariano Rivera, Angello Hoyos

Main category: cs.CV

TL;DR: CoLoRA extends LoRA to CNNs by decomposing kernel updates into depthwise and pointwise components, reducing trainable parameters by 0.2x while maintaining performance and enabling efficient fine-tuning for medical image classification.

Details

Motivation: To develop a parameter-efficient fine-tuning method specifically for convolutional neural networks (CNNs) that reduces computational overhead while maintaining or improving performance compared to full fine-tuning.

Method: Extends LoRA to convolutional layers by decomposing kernel updates into lightweight depthwise and pointwise components, allowing updates to be merged into pretrained weights after each epoch without changing inference complexity.

Result: On OCTMNISTv2, CoLoRA applied to VGG16 and ResNet50 achieves up to 1% accuracy and 0.013 AUC improvements over strong baselines (Vision Transformers, state-space, and Kolmogorov Arnold models) while reducing per-epoch training time by nearly 20%.

Conclusion: CoLoRA provides a stable and effective alternative to full fine-tuning for medical image classification, offering significant parameter efficiency and computational savings while maintaining or improving performance.

Abstract: We introduce CoLoRA (Convolutional Low-Rank Adaptation), a parameter-efficient fine-tuning method for convolutional neural networks (CNNs). CoLoRA extends LoRA to convolutional layers by decomposing kernel updates into lightweight depthwise and pointwise components.This design reduces the number of trainable parameters to 0.2 compared to conventional fine-tuning, preserves the original model size, and allows merging updates into the pretrained weights after each epoch, keeping inference complexity unchanged. On OCTMNISTv2, CoLoRA applied to VGG16 and ResNet50 achieves up to 1 percent accuracy and 0.013 AUC improvements over strong baselines (Vision Transformers, state-space, and Kolmogorov Arnold models) while reducing per-epoch training time by nearly 20 percent. Results indicate that CoLoRA provides a stable and effective alternative to full fine-tuning for medical image classification.

[243] A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

Zixiang Zhao, Haowen Bai, Bingxin Ke, Yukun Cui, Lilun Deng, Yulun Zhang, Kai Zhang, Konrad Schindler

Main category: cs.CV

TL;DR: UniVF is a unified video fusion framework that addresses temporal inconsistency in video processing by using multi-frame learning and optical flow-based feature warping, achieving state-of-the-art results across four fusion tasks on the new VF-Bench benchmark.

Details

Motivation: Most image fusion methods process static frames independently, ignoring temporal correlations in videos, which leads to flickering and temporal inconsistency in dynamic real-world scenarios.

Method: Proposes Unified Video Fusion (UniVF) framework using multi-frame learning and optical flow-based feature warping for temporally coherent video fusion. Also introduces Video Fusion Benchmark (VF-Bench) with four tasks and unified evaluation protocol.

Result: Extensive experiments show UniVF achieves state-of-the-art results across all four video fusion tasks (multi-exposure, multi-focus, infrared-visible, and medical fusion) on the VF-Bench benchmark.

Conclusion: UniVF provides an effective solution for temporally consistent video fusion across multiple tasks, with VF-Bench serving as a comprehensive benchmark for future video fusion research.

Abstract: The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: https://vfbench.github.io.

[244] DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response

Junjue Wang, Weihao Xuan, Heli Qi, Zhihao Liu, Kunyi Liu, Yuhan Wu, Hongruixuan Chen, Jian Song, Junshi Xia, Zhuo Zheng, Naoto Yokoya

Main category: cs.CV

TL;DR: DisasterM3 is a large-scale remote sensing vision-language dataset for global disaster assessment, featuring 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, covering 36 historical disaster events of 10 types with multi-sensor (optical and SAR) data and 9 disaster-related tasks.

Details

Motivation: Complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors pose challenges for vision-language models in Earth vision applications, creating a gap in disaster-specific datasets for global-scale assessment and response.

Method: Curated a comprehensive dataset (DisasterM3) with multi-hazard coverage (36 events across 10 disaster types), multi-sensor integration (optical and SAR imagery), and multi-task design (9 disaster-related visual perception and reasoning tasks). Fine-tuned four VLMs using this dataset to address identified challenges.

Result: Evaluation of 14 generic and remote sensing VLMs showed state-of-the-art models struggle with disaster tasks due to lack of disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Fine-tuned models achieved stable improvements across all tasks with robust cross-sensor and cross-disaster generalization capabilities.

Conclusion: DisasterM3 fills a critical gap in disaster assessment datasets and demonstrates that fine-tuning VLMs with disaster-specific data significantly improves performance on complex disaster reasoning tasks, enabling better global disaster response capabilities.

Abstract: Large vision-language models (VLMs) have made great achievements in Earth vision. However, complex disaster scenes with diverse disaster types, geographic regions, and satellite sensors have posed new challenges for VLM applications. To fill this gap, we curate a remote sensing vision-language dataset (DisasterM3) for global-scale disaster assessment and response. DisasterM3 includes 26,988 bi-temporal satellite images and 123k instruction pairs across 5 continents, with three characteristics: 1) Multi-hazard: DisasterM3 involves 36 historical disaster events with significant impacts, which are categorized into 10 common natural and man-made disasters. 2)Multi-sensor: Extreme weather during disasters often hinders optical sensor imaging, making it necessary to combine Synthetic Aperture Radar (SAR) imagery for post-disaster scenes. 3) Multi-task: Based on real-world scenarios, DisasterM3 includes 9 disaster-related visual perception and reasoning tasks, harnessing the full potential of VLM’s reasoning ability with progressing from disaster-bearing body recognition to structural damage assessment and object relational reasoning, culminating in the generation of long-form disaster reports. We extensively evaluated 14 generic and remote sensing VLMs on our benchmark, revealing that state-of-the-art models struggle with the disaster tasks, largely due to the lack of a disaster-specific corpus, cross-sensor gap, and damage object counting insensitivity. Focusing on these issues, we fine-tune four VLMs using our dataset and achieve stable improvements across all tasks, with robust cross-sensor and cross-disaster generalization capabilities. The code and data are available at: https://github.com/Junjue-Wang/DisasterM3.

[245] Think With Videos For Agentic Long-Video Understanding

Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou

Main category: cs.CV

TL;DR: VideoExplorer is a framework for long-video understanding that combines planning, temporal grounding, and scalable perception through iterative reasoning with sub-questions and task-oriented video analysis.

Details

Motivation: Existing methods for long-video understanding either lose fine-grained details through downsampling or rely on textual reasoning over generic representations, limiting task-specific perception and exploration.

Method: VideoExplorer uses iterative reasoning: formulating sub-questions, locating relevant moments, and performing task-oriented scalable video understanding. It includes a two-stage training pipeline with supervised trajectory initialization and trajectory-level preference optimization.

Result: Extensive evaluations show VideoExplorer significantly outperforms existing baselines on long-video understanding benchmarks, demonstrating robustness, adaptability, and efficiency.

Conclusion: VideoExplorer provides faithful, efficient, and interpretable reasoning for long-video understanding by naturally integrating planning, temporal grounding, and scalable perception through iterative thinking processes.

Abstract: Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video’’, which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer’s significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).

[246] Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Ruichen Chen, Keith G. Mills, Liyao Jiang, Chao Gao, Di Niu

Main category: cs.CV

TL;DR: Re-ttention enables very high sparse attention for diffusion transformers by leveraging temporal redundancy to overcome attention normalization shifts, achieving 3.1% token usage while maintaining visual quality.

Details

Motivation: Attention mechanism in Diffusion Transformers has quadratic complexity with resolution and video length, creating bottlenecks. Existing sparse attention methods fail at high sparsity levels and incur compute overheads.

Method: Re-ttention reshapes attention scores based on prior softmax distribution history to preserve visual quality at very high sparsity levels, leveraging temporal redundancy of Diffusion Models.

Result: Experimental results show Re-ttention requires only 3.1% of tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.

Conclusion: Re-ttention successfully enables extremely sparse attention for visual generation models while maintaining the visual quality of full quadratic attention.

Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.

[247] Pose-free 3D Gaussian splatting via shape-ray estimation

Youngju Na, Taeyeon Kim, Jumin Lee, Kyu Beom Han, Woo Jae Kim, Sung-eui Yoon

Main category: cs.CV

TL;DR: SHARE is a pose-free, feed-forward Gaussian splatting framework that jointly estimates shape and camera rays to overcome pose ambiguities, using a pose-aware canonical volume representation and anchor-aligned Gaussian prediction for robust scene reconstruction.

Details

Motivation: Generalizable 3D Gaussian splatting depends heavily on precise camera poses for accurate geometry, but obtaining accurate poses in real-world scenarios is challenging, leading to noisy pose estimates and geometric misalignments.

Method: SHARE builds a pose-aware canonical volume representation that integrates multi-view information without explicit 3D transformations, and uses anchor-aligned Gaussian prediction to refine local geometry around coarse anchors for more precise Gaussian placement.

Result: Extensive experiments on diverse real-world datasets show that SHARE achieves robust performance in pose-free generalizable Gaussian splatting.

Conclusion: SHARE provides an effective solution for pose-free 3D Gaussian splatting by jointly estimating shape and camera rays, overcoming pose ambiguities through canonical volume representation and anchor-aligned refinement.

Abstract: While generalizable 3D Gaussian splatting enables efficient, high-quality rendering of unseen scenes, it heavily depends on precise camera poses for accurate geometry. In real-world scenarios, obtaining accurate poses is challenging, leading to noisy pose estimates and geometric misalignments. To address this, we introduce SHARE, a pose-free, feed-forward Gaussian splatting framework that overcomes these ambiguities by joint shape and camera rays estimation. Instead of relying on explicit 3D transformations, SHARE builds a pose-aware canonical volume representation that seamlessly integrates multi-view information, reducing misalignment caused by inaccurate pose estimates. Additionally, anchor-aligned Gaussian prediction enhances scene reconstruction by refining local geometry around coarse anchors, allowing for more precise Gaussian placement. Extensive experiments on diverse real-world datasets show that our method achieves robust performance in pose-free generalizable Gaussian splatting. Code is avilable at https://github.com/youngju-na/SHARE

[248] MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng

Main category: cs.CV

TL;DR: MSR-Align is a multimodal safety reasoning dataset that addresses safety risks in vision-language models by providing fine-grained, policy-grounded reasoning across vision and text modalities to improve robustness against jailbreak attacks.

Details

Motivation: Vision-Language Models (VLMs) with enhanced chain-of-thought capabilities introduce novel safety risks from harmful multimodal prompts. Existing safety alignment methods designed for unimodal language models are insufficient for multimodal threats, and current safety datasets lack the fine-grained reasoning needed for robust alignment.

Method: Created MSR-Align dataset through a generation pipeline emphasizing multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. The dataset supports fine-grained reasoning over standardized safety policies across vision and text modalities.

Result: Fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks while preserving or enhancing general reasoning performance.

Conclusion: MSR-Align provides a scalable and effective foundation for advancing safety alignment of reasoning-capable VLMs, bridging the gap in multimodal safety reasoning capabilities.

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.

[249] Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, Yueting Zhuang

Main category: cs.CV

TL;DR: The paper proposes a unified approach for visual comprehension and generation in MLLMs through a two-stage training method, enabling iterative introspective image generation rather than treating comprehension and generation as separate functions.

Details

Motivation: Current MLLMs treat visual comprehension and generation as independent capabilities, failing to leverage comprehension to enhance generation or integrate LLM reasoning into image generation processes.

Method: Two-stage training: supervised fine-tuning teaches MLLMs to generate genuine Chain-of-Thought for visual generation, followed by reinforcement learning that activates full potential through exploration-exploitation trade-off.

Result: The model excels in text-to-image generation, image editing, and functions as a superior image semantic evaluator with enhanced visual comprehension capabilities.

Conclusion: The approach advances MLLMs from simple text-to-image tasks to unified image generation by enabling collaborative co-evolution of visual comprehension and generation through iterative introspective processes.

Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.

[250] VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning

Hao Yan, Xingchen Liu, Hao Wang, Zhenbiao Cao, Handong Zheng, Liang Yin, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, Chao Weng, Wei Chen, Yuliang Liu, Xiang Bai

Main category: cs.CV

TL;DR: The paper introduces VisuRiddles benchmark for abstract visual reasoning and PRS framework to generate training data with fine-grained perceptual descriptions, addressing MLLMs’ limitations in perceiving abstract graphics.

Details

Motivation: Abstract Visual Reasoning (AVR) remains a critical challenge for multimodal large language models due to limitations in perceiving abstract graphics, creating a need for better benchmarks and training approaches.

Method: Proposed VisuRiddles benchmark with tasks across five core dimensions and two high-level reasoning categories, and developed Perceptual Riddle Synthesizer (PRS) framework for automated generation of riddles with fine-grained perceptual descriptions.

Result: Experimental results on VisuRiddles show that fine-grained visual perception is the principal bottleneck and the synthesis framework significantly enhances contemporary MLLMs’ performance on challenging AVR tasks.

Conclusion: The proposed approach effectively addresses abstract visual reasoning challenges by providing fine-grained perceptual supervision and automated data generation, leading to improved model performance and interpretability.

Abstract: Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models’ reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially allowing for supervision over intermediate reasoning stages and thereby improving both training efficacy and model interpretability. Our extensive experimental results on VisuRiddles empirically validate that fine-grained visual perception is the principal bottleneck and our synthesis framework markedly enhances the performance of contemporary MLLMs on these challenging tasks. Our code and dataset will be released at https://github.com/yh-hust/VisuRiddles

[251] MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan

Main category: cs.CV

TL;DR: MATRIX is a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories and preference pairs to train VLM controllers for robust tool-use reasoning, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Address the scarcity of high-quality multimodal trajectories and high cost of manual annotation for training VLM controllers in tool-use reasoning tasks.

Method: Developed a pipeline that constructs M-TRACE dataset (28.5K multimodal tasks with 177K verified trajectories) for imitation-based tuning, then generates Pref-X (11K preference pairs) for step-wise preference learning to optimize the MATRIX Agent controller.

Result: MATRIX consistently surpasses both open- and closed-source VLMs across three benchmarks (Agent-X, GTA, and GAIA), demonstrating scalable and effective multimodal tool use.

Conclusion: The framework enables scalable training of VLM controllers for complex reasoning and decision-making with external tools through automated trajectory synthesis and preference learning.

Abstract: Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.

[252] FlySearch: Exploring how vision-language models explore

Adam Pardyl, Dominik Matuszek, Mateusz Przebieracz, Marek Cygan, Bartosz Zieliński, Maciej Wołczyk

Main category: cs.CV

TL;DR: FlySearch is a 3D outdoor photorealistic environment for testing VLMs’ ability to search and navigate to objects in complex scenes. Current VLMs struggle with exploration tasks, showing significant performance gaps compared to humans.

Details

Motivation: To test whether Vision-Language Models can effectively perform active, goal-driven exploration in messy, unstructured real-world environments.

Method: Created FlySearch - a 3D outdoor photorealistic environment with three difficulty scenarios for object search and navigation tasks. Analyzed VLM performance and identified failure causes.

Result: State-of-the-art VLMs cannot reliably solve even simple exploration tasks, with performance gap to humans increasing with task difficulty. Identified key failure causes including vision hallucination, context misunderstanding, and task planning failures.

Conclusion: Current VLMs have significant limitations in active exploration tasks. Some issues can be addressed through finetuning. The benchmark and codebase are publicly released to facilitate further research.

Abstract: The real world is messy and unstructured. Uncovering critical information often requires active, goal-driven exploration. It remains to be seen whether Vision-Language Models (VLMs), which recently emerged as a popular zero-shot tool in many difficult tasks, can operate effectively in such conditions. In this paper, we answer this question by introducing FlySearch, a 3D, outdoor, photorealistic environment for searching and navigating to objects in complex scenes. We define three sets of scenarios with varying difficulty and observe that state-of-the-art VLMs cannot reliably solve even the simplest exploration tasks, with the gap to human performance increasing as the tasks get harder. We identify a set of central causes, ranging from vision hallucination, through context misunderstanding, to task planning failures, and we show that some of them can be addressed by finetuning. We publicly release the benchmark, scenarios, and the underlying codebase.

[253] From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang, Zhuofan Zhang, Ziyu Zhu, Yue Fan, Jing Xiong, Pengxiang Li, Xiaojian Ma, Qing Li

Main category: cs.CV

TL;DR: Anywhere3D-Bench is a new 3D visual grounding benchmark that extends beyond object localization to include human-activity areas, unoccupied space, and object parts, revealing significant limitations in current models’ spatial reasoning and fine-grained perception capabilities.

Details

Motivation: Current 3D visual grounding research focuses primarily on object localization, leaving unexplored the grounding of referring expressions beyond objects in 3D scenes, such as spaces and object parts.

Method: The authors introduce Anywhere3D-Bench, a comprehensive benchmark with 2,886 referring expression-bounding box pairs across four grounding levels: human-activity areas, unoccupied space, individual objects, and object parts. They evaluate state-of-the-art 3D grounding methods, LLMs, and MLLMs on this benchmark.

Result: Experimental results show that space-level and part-level visual grounding are most challenging. OpenAI o4-mini achieves only 23.00% accuracy on space-level tasks and 31.46% on part-level tasks, significantly lower than its performance on area-level and object-level tasks.

Conclusion: There is a critical gap in current models’ capacity to understand and reason about 3D scenes beyond object-level semantics, particularly in spatial reasoning for space-level tasks and fine-grained perception for part-level tasks.

Abstract: 3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.00% accuracy on space-level tasks and 31.46% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models’ capacity to understand and reason about 3D scenes beyond object-level semantics.

Fabian Immel, Jan-Hendrik Pauls, Richard Fehler, Frank Bieder, Jonas Merkert, Christoph Stiller

Main category: cs.CV

TL;DR: SDTagNet is an online HD map construction method that uses standard definition (SD) maps like OpenStreetMap to enhance far-range detection accuracy for autonomous vehicles, incorporating both polyline data and textual annotations with NLP features.

Details

Motivation: HD maps are expensive to maintain, while online HD map construction from sensors has limited range. SD maps are easier to maintain and can provide valuable prior information to overcome sensor limitations.

Method: Uses SD map priors with two innovations: 1) incorporates polyline data plus textual annotations with NLP-derived features, 2) introduces point-level SD map encoder with orthogonal element identifiers to uniformly integrate all map element types.

Result: Improves map perception performance by up to +5.9 mAP (+45%) compared to no priors, and +3.2 mAP (+20%) compared to previous SD map prior approaches on Argoverse 2 and nuScenes datasets.

Conclusion: SDTagNet effectively leverages widely available SD maps to significantly boost online HD map construction performance, overcoming sensor range limitations and reducing dependency on predefined specifications.

Abstract: Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet

[255] Glyph: Scaling Context Windows via Visual-Text Compression

Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang

Main category: cs.CV

TL;DR: Glyph is a framework that converts long text into images using vision-language models to achieve 3-4x token compression while maintaining accuracy comparable to leading LLMs, enabling efficient processing of million-token-level contexts.

Details

Motivation: Scaling context windows to million-token levels in LLMs brings prohibitive computational and memory costs, limiting the practicality of long-context models for document understanding, code analysis, and multi-step reasoning tasks.

Method: Proposes Glyph framework that renders long texts into images and processes them with vision-language models, using LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression.

Result: Achieves 3-4x token compression while maintaining comparable accuracy to Qwen3-8B on long-context benchmarks, with 4x faster prefilling/decoding and 2x faster SFT training. A 128K-context VLM can scale to handle 1M-token-level tasks under extreme compression.

Conclusion: Visual context scaling through text-to-image rendering provides an effective alternative to token-based sequence extension, enabling efficient processing of extremely long contexts while benefiting multimodal tasks like document understanding.

Abstract: Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

Jialong Zuo, Yongtai Deng, Mengdan Tan, Rui Jin, Dongyue Wu, Nong Sang, Liang Pan, Changxin Gao

Main category: cs.CV

TL;DR: The paper introduces OM-ReID, a multi-modal person re-identification problem, and presents ORBench dataset with 5 modalities and ReID5o framework for unified multi-modal learning.

Details

Motivation: Real-world person re-identification requires handling various query modalities, but existing methods and datasets are limited to specific modalities, failing to meet practical requirements.

Method: Constructed ORBench dataset with 1,000 identities across 5 modalities (RGB, infrared, color pencil, sketch, text), and proposed ReID5o framework with unified encoding and multi-expert routing for synergistic fusion and cross-modal alignment.

Result: Extensive experiments show ORBench’s advancement and practicality, with ReID5o achieving best performance compared to other models evaluated on the dataset.

Conclusion: ORBench serves as an ideal platform for OM-ReID research, and ReID5o provides an effective solution for multi-modal person re-identification with arbitrary modality combinations.

Abstract: In real-word scenarios, person re-identification (ReID) expects to identify a person-of-interest via the descriptive query, regardless of whether the query is a single modality or a combination of multiple modalities. However, existing methods and datasets remain constrained to limited modalities, failing to meet this requirement. Therefore, we investigate a new challenging problem called Omni Multi-modal Person Re-identification (OM-ReID), which aims to achieve effective retrieval with varying multi-modal queries. To address dataset scarcity, we construct ORBench, the first high-quality multi-modal dataset comprising 1,000 unique identities across five modalities: RGB, infrared, color pencil, sketch, and textual description. This dataset also has significant superiority in terms of diversity, such as the painting perspectives and textual information. It could serve as an ideal platform for follow-up investigations in OM-ReID. Moreover, we propose ReID5o, a novel multi-modal learning framework for person ReID. It enables synergistic fusion and cross-modal alignment of arbitrary modality combinations in a single model, with a unified encoding and multi-expert routing mechanism proposed. Extensive experiments verify the advancement and practicality of our ORBench. A wide range of possible models have been evaluated and compared on it, and our proposed ReID5o model gives the best performance. The dataset and code will be made publicly available at https://github.com/Zplusdragon/ReID5o_ORBench.

[257] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

Roey Ron, Guy Tevet, Haim Sawdayee, Amit H. Bermano

Main category: cs.CV

TL;DR: HOIDiNi is a text-driven diffusion framework that generates realistic human-object interactions by optimizing in noise space using Diffusion Noise Optimization, achieving both physical correctness and motion naturalness through a two-phase approach.

Details

Motivation: Human-object interaction generation is challenging due to strict contact accuracy requirements and diverse motion manifolds. Current methods trade off between realism and physical correctness, but HOIDiNi aims to achieve both.

Method: Uses Diffusion Noise Optimization (DNO) in pretrained diffusion model noise space. Separates the problem into two phases: object-centric phase for hand-object contact location selection, and human-centric phase for full-body motion refinement.

Result: Outperforms prior works and baselines on GRAB dataset in contact accuracy, physical validity, and overall quality. Can generate complex interactions like grasping, placing, and full-body coordination using only text prompts.

Conclusion: HOIDiNi successfully generates realistic and plausible human-object interactions with precise contact and natural motion through structured optimization in diffusion noise space.

Abstract: We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.

[258] Polyline Path Masked Attention for Vision Transformer

Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang, Deyu Meng

Main category: cs.CV

TL;DR: Proposes PPMA that combines ViT’s self-attention with enhanced Mamba2 structured mask using 2D polyline path scanning for better spatial adjacency modeling in computer vision tasks.

Details

Motivation: To integrate the global dependency modeling of Vision Transformers with the spatial adjacency prior modeling of Mamba2, leveraging complementary strengths of both architectures for improved computer vision performance.

Method: Enhanced Mamba2’s structured mask with 2D polyline path scanning strategy, created polyline path mask, conducted theoretical analysis, designed efficient computation algorithm, and embedded the mask into ViT’s self-attention mechanism.

Result: Outperforms previous state-of-the-art approaches on image classification, object detection, and segmentation. PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on ADE20K semantic segmentation, surpassing RMT-T/S/B by 0.7%/1.3%/0.3% respectively.

Conclusion: PPMA successfully integrates ViT and Mamba2 strengths, demonstrating superior performance across multiple computer vision benchmarks through improved spatial adjacency modeling.

Abstract: Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

[259] ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Hao Liu, Yu Hu, Rakiba Rayhana, Ling Bai, Zheng Liu

Main category: cs.CV

TL;DR: Early bed-exit intent can be predicted using a single load cell under a bed leg by converting load signals into complementary images and processing them with a dual-stream Swin Transformer that fuses modalities through cross-attention.

Details

Motivation: Bed-related falls are a major injury source in healthcare facilities, and current commercial alarms only trigger after patients have already left the bed, making early prediction crucial for prevention.

Method: Uses one low-cost load cell under bed leg, converts load signals into RGB line plot and three texture maps (recurrence plot, Markov transition field, Gramian angular field), then processes with ViFusionTST - a dual-stream Swin Transformer that fuses line plot and texture maps through cross-attention.

Result: On 6 months of continuous data from 95 beds in long-term-care facility, ViFusionTST achieved accuracy of 0.885 and F1 score of 0.794, outperforming recent 1D and 2D time-series baselines across multiple metrics.

Conclusion: Image-based fusion of load-sensor signals for time series classification provides a practical, effective, and privacy-preserving solution for real-time fall prevention in healthcare settings.

Abstract: Bed-related falls remain a major source of injury in hospitals and long-term care facilities, yet many commercial alarms trigger only after a patient has already left the bed. We show that early bed-exit intent can be predicted using only one low-cost load cell mounted under a bed leg. The resulting load signals are first converted into a compact set of complementary images: an RGB line plot that preserves raw waveforms and three texture maps-recurrence plot, Markov transition field, and Gramian angular field-that expose higher-order dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that processes the line plot and texture maps in parallel and fuses them through cross-attention to learn data-driven modality weights. To provide a realistic benchmark, we collected six months of continuous data from 95 beds in a long-term-care facility. On this real-world dataset ViFusionTST reaches an accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D time-series baselines across F1, recall, accuracy, and AUPRC. The results demonstrate that image-based fusion of load-sensor signals for time series classification is a practical and effective solution for real-time, privacy-preserving fall prevention.

[260] RODS: Robust Optimization Inspired Diffusion Sampling for Detecting and Reducing Hallucination in Generative Models

Yiqi Tian, Pengfei Jin, Mingze Yuan, Na Li, Bo Zeng, Quanzheng Li

Main category: cs.CV

TL;DR: RODS is a novel diffusion sampling method that detects and corrects hallucinations using optimization-inspired geometric cues from the loss landscape, improving sampling fidelity and robustness without retraining.

Details

Motivation: Diffusion models achieve state-of-the-art generative performance but suffer from hallucinations due to score approximation inaccuracies during sampling.

Method: Reinterpret diffusion sampling through optimization lens, detect high-risk steps using geometric cues from loss landscape, enforce smoother trajectories, and adaptively adjust perturbations.

Result: RODS maintains comparable image quality and diversity, detects over 70% of hallucinated samples, corrects more than 25%, improves sampling fidelity and robustness, without introducing new artifacts.

Conclusion: RODS provides an effective method to reduce hallucinations in diffusion sampling with minimal inference cost, enhancing robustness while preserving generation quality and diversity.

Abstract: Diffusion models have achieved state-of-the-art performance in generative modeling, yet their sampling procedures remain vulnerable to hallucinations-often stemming from inaccuracies in score approximation. In this work, we reinterpret diffusion sampling through the lens of optimization and introduce RODS (Robust Optimization-inspired Diffusion Sampler), a novel method that detects and corrects high-risk sampling steps using geometric cues from the loss landscape. RODS enforces smoother sampling trajectories and adaptively adjusts perturbations, reducing hallucinations without retraining and at minimal additional inference cost. Experiments on AFHQv2, FFHQ, and 11k-hands demonstrate that RODS maintains comparable image quality and preserves generation diversity. More importantly, it improves both sampling fidelity and robustness, detecting over 70% of hallucinated samples and correcting more than 25%, all while avoiding the introduction of new artifacts. We release our code at https://github.com/Yiqi-Verna-Tian/RODS.

[261] Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation

Masahiro Ogawa, Qi An, Atsushi Yamashita

Main category: cs.CV

TL;DR: FoELS is a method that integrates optical flow and texture information to separate moving objects from static scenes in complex environments with camera motion.

Details

Motivation: Existing approaches relying primarily on optical flow struggle to detect moving objects in complex, structured scenes with camera motion, limiting their effectiveness in robotics applications.

Method: FoELS computes focus of expansion (FoE) from optical flow, derives initial motion likelihood from FoE outliers, and fuses this with segmentation-based prior to estimate final moving probability.

Result: The method effectively handles complex structured scenes, rotational camera motion, and parallel motion, achieving state-of-the-art performance on DAVIS 2016 dataset and real-world traffic videos.

Conclusion: Integrating both optical flow and texture information through FoELS provides robust moving object detection in challenging scenarios with camera motion.

Abstract: Separating moving and static objects from a moving camera viewpoint is essential for 3D reconstruction, autonomous navigation, and scene understanding in robotics. Existing approaches often rely primarily on optical flow, which struggles to detect moving objects in complex, structured scenes involving camera motion. To address this limitation, we propose Focus of Expansion Likelihood and Segmentation (FoELS), a method based on the core idea of integrating both optical flow and texture information. FoELS computes the focus of expansion (FoE) from optical flow and derives an initial motion likelihood from the outliers of the FoE computation. This likelihood is then fused with a segmentation-based prior to estimate the final moving probability. The method effectively handles challenges including complex structured scenes, rotational camera motion, and parallel motion. Comprehensive evaluations on the DAVIS 2016 dataset and real-world traffic videos demonstrate its effectiveness and state-of-the-art performance.

[262] Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

Rongzhen Zhao, Jian Li, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: RandSF.Q improves video object-centric learning by incorporating next frame features and learning transition dynamics through random slot-feature pair sampling, achieving state-of-the-art performance in object discovery.

Details

Motivation: Existing video object-centric learning methods neglect to incorporate next frame features (the most informative source for query prediction) and fail to learn transition dynamics (essential knowledge for query prediction).

Method: Proposes Random Slot-Feature pair for learning Query prediction (RandSF.Q): (1) designs a new transitioner that incorporates both slots and features for query prediction, (2) trains the transitioner using randomly sampled slot-feature pairs from available recurrences to learn transition dynamics.

Result: Significantly surpasses existing video OCL methods, achieving up to 10 points improvement on object discovery metrics, setting new state-of-the-art. The superiority also benefits downstream tasks like dynamics modeling.

Conclusion: RandSF.Q effectively addresses the limitations of existing video OCL methods by incorporating next frame features and learning transition dynamics, demonstrating substantial improvements in object discovery and downstream tasks.

Abstract: Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code and training logs are available on https://github.com/Genera1Z/RandSF.Q.

[263] Distilling LLM Prior to Flow Model for Generalizable Agent’s Imagination in Object Goal Navigation

Badi Li, Ren-jie Lu, Yu Zhou, Jingke Meng, Wei-shi Zheng

Main category: cs.CV

TL;DR: GOAL is a generative flow-based framework for Object Goal Navigation that models semantic distributions of indoor environments using LLM-enriched semantic maps, achieving state-of-the-art performance and strong generalization.

Details

Motivation: Prior ObjectNav approaches rely on deterministic models that overlook uncertainty in indoor layouts and limit generalization to unseen environments.

Method: Proposes a generative flow-based framework that bridges observed regions with LLM-enriched full-scene semantic maps, encoding spatial priors from LLMs as 2D Gaussian fields to distill contextual knowledge.

Result: Achieves state-of-the-art performance on MP3D and Gibson datasets, and shows strong generalization in transfer settings to HM3D.

Conclusion: The generative approach with LLM-enriched semantic maps enables more generalizable scene completions and improves ObjectNav performance across diverse environments.

Abstract: The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.

[264] Increasing the Utility of Synthetic Images through Chamfer Guidance

Nicola Dall’Asen, Xiaofeng Zhang, Reyhane Askari Hemmat, Melissa Hall, Jakob Verbeek, Adriana Romero-Soriano, Michal Drozdzal

Main category: cs.CV

TL;DR: Chamfer Guidance is a training-free approach that uses a few real exemplar images to improve both quality and diversity of synthetic image generation, achieving state-of-the-art few-shot performance and better downstream classification accuracy.

Details

Motivation: Address the trade-off between generation quality and diversity in conditional image generative models, and the distribution shift between synthetic and real data that limits their utility as training data sources.

Method: Proposes Chamfer Guidance - a training-free guidance approach that leverages a handful of real exemplar images to characterize both quality and diversity of synthetic data without requiring unconditional models.

Result: Achieves 96.4% precision and 86.4% distributional coverage with only 2 real images, improving to 97.5% and 92.7% with 32 images. Provides 15-16% accuracy boost for downstream classifiers and 31% FLOPs reduction compared to classifier-free guidance.

Conclusion: Chamfer Guidance effectively balances quality and diversity in synthetic data generation using minimal real exemplars, significantly improving downstream task performance while being computationally efficient.

Abstract: Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4% in terms of precision, and 86.4% in terms of distributional coverage, which increase to 97.5% and 92.7%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15% for in-distribution over the baselines, and up to 16% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.

[265] Interpretable Decision-Making for End-to-End Autonomous Driving

Mona Mirzaie, Bodo Rosenhahn

Main category: cs.CV

TL;DR: This paper proposes a method to enhance interpretability in end-to-end autonomous driving models by using loss functions that generate sparse and localized feature maps, allowing visualization of which image regions influence control decisions.

Details

Motivation: End-to-end autonomous driving models are challenging to interpret due to deep neural networks with non-linear decision boundaries, making it difficult to understand AI-driven decisions in complex urban scenarios.

Method: The authors propose loss functions that promote interpretability by generating sparse and localized feature maps, which show which image regions contribute to predicted control commands. They conduct ablation studies on feature extraction and validate on CARLA benchmarks.

Result: The method improves interpretability while reducing infractions, yielding safer driving performance. Their monocular, non-ensemble model surpasses top CARLA Leaderboard approaches with lower infraction scores and highest route completion rate.

Conclusion: The approach successfully enhances interpretability in autonomous driving models while maintaining or improving performance, demonstrating that interpretability can correlate with reduced infractions and safer driving behavior.

Abstract: Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.

[266] GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Sharon Li

Main category: cs.CV

TL;DR: GeoArena is an open platform for evaluating large vision-language models on image geolocalization tasks, addressing data leakage issues and privacy concerns through in-the-wild image evaluation and human-centered benchmarking.

Details

Motivation: Current image geolocalization evaluation methods suffer from data leakage (models pretrained on test datasets) and privacy concerns from using exact coordinates, requiring a more robust and human-centered evaluation approach.

Method: GeoArena enables users to upload in-the-wild images for diverse evaluation and uses pairwise human judgments to determine which model output better aligns with human expectations.

Result: The platform collected thousands of voting records over two months, enabling detailed analysis and establishing a leaderboard of different LVLMs on image geolocalization tasks.

Conclusion: GeoArena provides a more accurate and privacy-preserving evaluation framework for image geolocalization and has been open-sourced to support future research.

Abstract: Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model’s actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task. GeoArena has been open-sourced to support future research.

[267] Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning

Trixia Simangan, Ahmed Nadeem Abbasi, Yipeng Hu, Shaheer U. Saeed

Main category: cs.CV

TL;DR: Cryo-RL uses reinforcement learning to automate cryoablation planning for prostate cancer, achieving performance comparable to human experts with significantly reduced planning time.

Details

Motivation: Current cryoablation planning is manual, expertise-dependent, and time-consuming, leading to treatment variability and limited scalability.

Method: Models cryoablation planning as a Markov decision process where an agent sequentially selects cryoprobe positions and ice sphere diameters in a simulated environment with clinical constraints.

Result: Achieved over 8 percentage-point Dice improvements compared to automated baselines and matched human expert performance while requiring substantially less planning time.

Conclusion: Reinforcement learning can deliver clinically viable, reproducible, and efficient cryoablation plans.

Abstract: Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.

Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang

Main category: cs.CV

TL;DR: Vision language models struggle with fragmented, fused, or partially occluded text that humans can easily read, revealing limitations in compositional processing.

Details

Motivation: To investigate whether advanced vision language models share human resilience in recognizing words despite character fragmentation, fusion, or occlusion across different writing systems.

Method: Created psychophysics-inspired benchmarks for Chinese logographs and English alphabetic words by splicing, recombining, and overlaying glyphs to create ‘visible but unreadable’ stimuli for models while remaining legible to humans.

Result: Contemporary VLMs show severe performance drops under these perturbations, frequently producing unrelated or incoherent outputs, despite strong performance on clean text.

Conclusion: Models rely heavily on generic visual invariances but underutilize compositional priors needed for robust literacy, motivating new architectures and training strategies for symbol segmentation, composition, and binding across scripts.

Abstract: Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ‘‘visible but unreadable’’ stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.

[269] Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation

Yi-Ruei Liu, You-Zhe Xie, Yu-Hsiang Hsu, I-Sheng Fang, Yu-Lun Liu, Jun-Cheng Chen

Main category: cs.CV

TL;DR: 4D-GRT is a two-stage pipeline that combines 4D Gaussian Splatting with ray tracing to generate videos with realistic camera effects like fisheye distortion and rolling shutter, addressing the limitation of computer vision systems trained only on ideal pinhole cameras.

Details

Motivation: Current computer vision systems fail with real-world camera effects because they lack training data with such effects. Existing data generation methods are either expensive, have sim-to-real gaps, or don't accurately model camera effects.

Method: Two-stage pipeline: 1) Reconstruct dynamic scenes from multi-view videos using 4D Gaussian Splatting, 2) Apply physically-based ray tracing to generate videos with controllable camera effects.

Result: 4D-GRT achieves the fastest rendering speed while performing better or comparable rendering quality compared to existing baselines. The authors also created a benchmark with eight synthetic dynamic scenes across four camera effects.

Conclusion: 4D-GRT provides an effective solution for generating training data with realistic camera effects, bridging the gap between ideal pinhole camera assumptions and real-world camera distortions.

Abstract: Common computer vision systems typically assume ideal pinhole cameras but fail when facing real-world camera effects such as fisheye distortion and rolling shutter, mainly due to the lack of learning from training data with camera effects. Existing data generation approaches suffer from either high costs, sim-to-real gaps or fail to accurately model camera effects. To address this bottleneck, we propose 4D Gaussian Ray Tracing (4D-GRT), a novel two-stage pipeline that combines 4D Gaussian Splatting with physically-based ray tracing for camera effect simulation. Given multi-view videos, 4D-GRT first reconstructs dynamic scenes, then applies ray tracing to generate videos with controllable, physically accurate camera effects. 4D-GRT achieves the fastest rendering speed while performing better or comparable rendering quality compared to existing baselines. Additionally, we construct eight synthetic dynamic scenes in indoor environments across four camera effects as a benchmark to evaluate generated videos with camera effects.

[270] SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, Hua Gang

Main category: cs.CV

TL;DR: SAMPO is a hybrid world model framework that combines visual autoregressive modeling with causal modeling to improve temporal consistency and rollout efficiency in video prediction and model-based control.

Details

Motivation: Existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling.

Method: SAMPO integrates temporal causal decoding with bidirectional spatial attention, uses an asymmetric multi-scale tokenizer, and includes a trajectory-aware motion prompt module to inject spatiotemporal cues.

Result: SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4× faster inference, and demonstrates strong zero-shot generalization and scaling behavior.

Conclusion: SAMPO effectively addresses limitations of existing world models by preserving spatial locality, supporting parallel decoding, and improving dynamic scene understanding through motion prompts, making it suitable for planning and long-horizon decision-making.

Abstract: World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO’s zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

[271] UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

Main category: cs.CV

TL;DR: UniPixel is a large multi-modal model that integrates pixel-level perception with visual reasoning, enabling mask-grounded responses and fine-grained pixel-level understanding across diverse tasks.

Details

Motivation: Existing LMMs focus on holistic image/video understanding but lack fine-grained pixel-level alignment between visual signals and language semantics. Previous models perform referring or segmentation tasks independently without integrating them into visual reasoning.

Method: Proposes UniPixel model that processes visual prompts and generates relevant masks on demand, then performs reasoning conditioning on these intermediate pointers during inference to enable pixel-level reasoning.

Result: Verified effectiveness on 10 benchmarks across diverse tasks including pixel-level referring/segmentation and object-centric understanding in images/videos. Also designed a novel PixelQA task requiring joint referring, segmentation, and question answering.

Conclusion: UniPixel successfully bridges the gap between pixel-level perception and visual reasoning, demonstrating flexible comprehension of visual prompts and generation of mask-grounded responses for fine-grained pixel-level understanding.

Abstract: Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.

Zhuang Qi, Pan Yu, Lei Meng, Sijin Zhou, Han Yu, Xiaoxiao Li, Xiangxu Meng

Main category: cs.CV

TL;DR: GPR-NIAM is a one-shot federated prompt learning method that uses attention masking to restrict interactions between text and prompt embeddings, enabling cross-task generalization without multi-round communication.

Details

Motivation: Existing federated prompt learning methods require multi-round communication and lack cross-task generalization capabilities, creating efficiency and adaptability limitations.

Method: Uses two modules: attention isolation (suppresses prompt-to-text attention, reweights text-to-prompt attention) and cross-silo collaborative refinement (integrates decentralized visual knowledge for global prompt calibration).

Result: Outperforms eight state-of-the-art methods on ten benchmark datasets in both class-level and domain-level generalization tasks.

Conclusion: GPR-NIAM enables effective one-shot federated prompt learning with strong cross-task generalization by controlling attention interactions and leveraging cross-modal knowledge alignment.

Abstract: Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the Global Prompt Refinement with Non-Interfering Attention Masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.

[273] LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models

Pranav Saxena, Avigyan Bhattacharya, Ji Zhang, Wenshan Wang

Main category: cs.CV

TL;DR: LLM-RG is a hybrid pipeline combining vision-language models for attribute extraction with large language models for symbolic reasoning to solve referential grounding in outdoor driving scenes, achieving substantial improvements over baseline methods.

Details

Motivation: Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references like "the black car on the right".

Method: A hybrid pipeline that uses LLMs to extract object types and attributes, detects candidate regions, generates visual descriptors with VLMs, and combines these with spatial metadata into natural-language prompts for chain-of-thought reasoning to identify referent bounding boxes.

Result: LLM-RG yields substantial gains over both LLM and VLM-based baselines on the Talk2Car benchmark. Adding 3D spatial cues further improves grounding performance.

Conclusion: The results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.

Abstract: Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., “the black car on the right”). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent’s bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.

[274] DA$^2$: Depth Anything in Any Direction

Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, Chunchao Guo

Main category: cs.CV

TL;DR: DA² is a zero-shot generalizable panoramic depth estimator that addresses data scarcity and spherical distortion challenges through data curation and SphereViT architecture, achieving state-of-the-art performance with 38% improvement over baselines.

Details

Motivation: Panoramic depth estimation faces challenges due to limited panoramic data and spherical distortions, leading to poor zero-shot generalization and inefficient perspective splitting methods.

Method: Proposes DA² with data curation engine to generate ~543K panoramic RGB-depth pairs (total ~607K) from perspective data, and SphereViT that uses spherical coordinates to enforce geometric consistency in panoramic features.

Result: Achieves SoTA performance with 38% average improvement on AbsRel over strongest zero-shot baseline, outperforms prior in-domain methods, and shows higher efficiency as end-to-end solution.

Conclusion: DA² provides accurate, zero-shot generalizable, and efficient panoramic depth estimation through large-scale data curation and spherical geometry-aware architecture.

Abstract: Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$’s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data has be released. Project page: https://depth-any-in-any-dir.github.io/.

[275] UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

Main category: cs.CV

TL;DR: UniVideo is a unified multimodal framework that extends unified modeling to video generation and editing, supporting diverse tasks under a single instruction paradigm with dual-stream architecture combining MLLM for instruction understanding and MMDiT for video generation.

Details

Motivation: Current unified multimodal models are largely limited to the image domain, creating a gap for video generation and editing capabilities that can handle complex multimodal instructions while maintaining visual consistency.

Method: Dual-stream design combining Multimodal Large Language Model (MLLM) for instruction understanding with Multimodal DiT (MMDiT) for video generation, with joint training across diverse video generation and editing tasks under unified multimodal instruction paradigm.

Result: UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and editing. It demonstrates task composition capabilities and transfers editing skills from image data to handle unseen video editing instructions.

Conclusion: UniVideo successfully extends unified multimodal modeling to video domain, enabling diverse video tasks under single framework with strong generalization capabilities including task composition and transfer learning from image editing data.

Abstract: Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

[276] VideoVerse: How Far is Your T2V Generator from a World Model?

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang

Main category: cs.CV

TL;DR: VideoVerse is a new benchmark for evaluating Text-to-Video models’ ability to understand temporal causality and world knowledge, addressing limitations of existing benchmarks.

Details

Motivation: Existing T2V benchmarks are insufficient for state-of-the-art models, lacking evaluation of temporal causality and world knowledge needed for building world models.

Method: Collected diverse videos across domains, extracted event-level descriptions with temporal causality, created 300 prompts with 815 events and 793 binary evaluation questions across 10 dimensions.

Result: Developed a human preference aligned QA-based evaluation pipeline using vision-language models and systematically evaluated state-of-the-art T2V models.

Conclusion: VideoVerse provides comprehensive assessment of T2V models’ capabilities for building world models, revealing current gaps in temporal causality understanding and world knowledge.

Abstract: The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models’’, makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.

[277] Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu

Main category: cs.CV

TL;DR: This paper introduces Honey-Data-15M, a 15M QA pair dataset with enhanced data quality and Chain-of-Thought reasoning, along with HoneyPipe data curation pipeline, achieving SOTA performance with Bee-8B model that competes with semi-open models.

Details

Motivation: Fully open MLLMs lag behind proprietary counterparts due to poor data quality in existing open-source datasets, particularly lacking complex reasoning data like Chain-of-Thought.

Method: Created Honey-Data-15M dataset with 15M QA pairs using multiple cleaning techniques and dual-level CoT enrichment strategy, plus HoneyPipe data curation pipeline and DataStudio framework.

Result: Bee-8B model trained on Honey-Data-15M establishes new SOTA for fully open MLLMs, achieving competitive performance with and sometimes surpassing semi-open models like InternVL3.5-8B.

Conclusion: Principled focus on data quality is key to developing competitive fully open MLLMs, demonstrated through comprehensive resources including dataset, pipeline, training recipes, and model weights.

Abstract: Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

[278] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Liao Shen, Wentao Jiang, Yiran Zhu, Jiahe Li, Tiezheng Ge, Zhiguo Cao, Bo Zheng

Main category: cs.CV

TL;DR: IPRO is a reinforcement learning-based video diffusion framework that enhances identity preservation in image-to-video generation for human faces, especially when faces occupy small image areas, using face identity scoring and KL-divergence regularization.

Details

Motivation: Existing image-to-video models struggle with maintaining identity consistency between input human images and generated videos, particularly when faces are small and undergo significant expression/movement changes, which is critical since humans are highly sensitive to identity variations.

Method: Proposes Identity-Preserving Reward-guided Optimization (IPRO) - a video diffusion framework using reinforcement learning with face identity scorer, backpropagating reward signals through last sampling steps, using ground-truth videos as facial feature pools for multi-angle information, and incorporating KL-divergence regularization.

Result: Extensive experiments on Wan 2.2 I2V model and in-house I2V model demonstrate the method’s effectiveness in enhancing identity preservation.

Conclusion: IPRO provides a direct and effective tuning algorithm for improving identity consistency in human-centric image-to-video generation without requiring auxiliary modules or architectural changes.

Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at https://ipro-alimama.github.io/.

Ziqi Jiang, Yanghao Wang, Long Chen

Main category: cs.CV

TL;DR: The paper proposes Flow Matching Alignment (FMA), a multi-step adjustment approach for cross-modal feature alignment that learns a cross-modal velocity field, addressing limitations of existing one-step PEFT methods.

Details

Motivation: Existing parameter-efficient fine-tuning (PEFT) methods perform one-step adjustment which is insufficient for complex datasets where features from different modalities are highly entangled, leading to suboptimal cross-modal alignment.

Method: FMA learns a cross-modal velocity field through three key components: fixed coupling strategy for category correspondence, noise augmentation to alleviate data scarcity, and an early-stopping solver for efficiency and accuracy.

Result: Extensive experiments show FMA consistently yields significant performance gains across various benchmarks and backbones, particularly on challenging datasets, demonstrating superior alignment compared to one-step PEFT methods.

Conclusion: FMA provides a multi-step rectification ability for more precise and robust cross-modal feature alignment, outperforming existing one-step PEFT methods while being model-agnostic.

Abstract: Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today’s PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

[280] Fourier Transform Multiple Instance Learning for Whole Slide Image Classification

Anthony Bilic, Guangyu Sun, Ming Li, Md Sanzid Bin Hossain, Yu Tian, Wei Zhang, Laura Brattain, Dexter Hadley, Chen Chen

Main category: cs.CV

TL;DR: FFT-MIL augments Multiple Instance Learning for Whole Slide Image classification by adding a frequency-domain branch using Fast Fourier Transform to capture global dependencies, improving performance across multiple datasets and architectures.

Details

Motivation: Existing MIL methods for WSI classification struggle to capture global dependencies due to the immense size of WSIs and local nature of patch embeddings, limiting their ability to model coarse structures essential for robust diagnostic prediction.

Method: Proposes FFT-MIL framework that extracts low-frequency crops from WSIs via Fast Fourier Transform, processes them through a modular FFT-Block with convolutional layers and Min-Max normalization, then fuses global frequency features with spatial patch features using lightweight integration strategies.

Result: Evaluation across six MIL methods on three public datasets (BRACS, LUAD, IMP) showed average improvements of 3.51% in macro F1 scores and 1.51% in AUC, with consistent gains across architectures and datasets.

Conclusion: Frequency-domain learning is an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.

Abstract: Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.

[281] Latent Diffusion Model without Variational Autoencoder

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: SVG introduces a novel latent diffusion model that replaces VAEs with self-supervised DINO features, creating semantically structured latent spaces for more efficient training and better generative quality.

Details

Motivation: Current VAE+diffusion models suffer from limited training efficiency, slow inference, and poor transferability due to VAE latent spaces lacking clear semantic separation and discriminative structure.

Method: SVG constructs a semantically structured latent space using frozen DINO features for semantic discriminability, with a lightweight residual branch for fine-grained details. Diffusion models are trained directly on this space.

Result: SVG enables accelerated diffusion training, supports few-step sampling, improves generative quality, and preserves semantic/discriminative capabilities of self-supervised representations.

Conclusion: SVG provides a principled pathway toward task-general, high-quality visual representations by leveraging semantically structured latent spaces without VAEs.

Abstract: Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.

[282] CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection

Huiming Yang

Main category: cs.CV

TL;DR: CrossRay3D is a sparse multi-modal 3D detector that improves token representation quality through Ray-Aware Supervision and Class-Balanced Supervision, achieving state-of-the-art performance on nuScenes benchmark while being computationally efficient.

Details

Motivation: Existing sparse detectors overlook token representation quality, resulting in sub-optimal foreground quality and limited performance. The paper identifies geometric structure preservation and class distribution as key factors for improving sparse detector performance.

Method: Proposes Sparse Selector (SS) with two core modules: Ray-Aware Supervision (RAS) to preserve geometric information during training, and Class-Balanced Supervision to adaptively reweight class semantics. Also introduces Ray Positional Encoding to address LiDAR-image modality distribution differences.

Result: Achieves state-of-the-art performance on nuScenes benchmark with 72.4 mAP and 74.7 NDS, running 1.84× faster than other leading methods. Shows strong robustness even with partial or complete missing LiDAR/camera data.

Conclusion: CrossRay3D demonstrates that focusing on token representation quality through geometric structure preservation and balanced class semantics significantly improves sparse multi-modal 3D detection performance while maintaining computational efficiency.

Abstract: The sparse cross-modality detector offers more advantages than its counterpart, the Bird’s-Eye-View (BEV) detector, particularly in terms of adaptability for downstream tasks and computational cost savings. However, existing sparse detectors overlook the quality of token representation, leaving it with a sub-optimal foreground quality and limited performance. In this paper, we identify that the geometric structure preserved and the class distribution are the key to improving the performance of the sparse detector, and propose a Sparse Selector (SS). The core module of SS is Ray-Aware Supervision (RAS), which preserves rich geometric information during the training stage, and Class-Balanced Supervision, which adaptively reweights the salience of class semantics, ensuring that tokens associated with small objects are retained during token sampling. Thereby, outperforming other sparse multi-modal detectors in the representation of tokens. Additionally, we design Ray Positional Encoding (Ray PE) to address the distribution differences between the LiDAR modality and the image. Finally, we integrate the aforementioned module into an end-to-end sparse multi-modality detector, dubbed CrossRay3D. Experiments show that, on the challenging nuScenes benchmark, CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS, while running 1.84 faster than other leading methods. Moreover, CrossRay3D demonstrates strong robustness even in scenarios where LiDAR or camera data are partially or entirely missing.

[283] PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin

Main category: cs.CV

TL;DR: PRISMM-Bench is the first benchmark for evaluating Large Multimodal Models’ ability to detect and resolve real inconsistencies in scientific papers across text, figures, tables, and equations, revealing poor performance (26.1-54.2%) among 21 leading LMMs.

Details

Motivation: Current LMM benchmarks fail to capture the real-world complexity of multimodal inconsistencies in scientific papers, which undermine clarity, reproducibility, and trust. Existing approaches either isolate single modalities or use synthetic errors that don't reflect actual scientific review challenges.

Method: Created PRISMM-Bench through a multi-stage pipeline: mining real reviewer-flagged inconsistencies from 242 papers, LLM-assisted filtering, and human verification to curate 262 inconsistencies. Designed three tasks (inconsistency identification, remedy, and pair matching) with structured JSON-based answer representations to minimize linguistic biases.

Result: Benchmarked 21 leading LMMs including GLM-4.5V 106B, InternVL3 78B, Gemini 2.5 Pro, and GPT-5. Results showed strikingly low performance ranging from 26.1% to 54.2%, highlighting the challenge of multimodal scientific reasoning.

Conclusion: Current LMMs struggle significantly with detecting and resolving real multimodal inconsistencies in scientific papers, motivating the need for progress towards more trustworthy scientific assistants capable of reliable multimodal reasoning.

Abstract: Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 262 inconsistencies from 242 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model’s capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (26.1-54.2%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.

[284] SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang

Main category: cs.CV

TL;DR: A novel framework called SCENECOT that enables grounded Chain-of-Thought reasoning in 3D scenes, achieving strong performance on complex 3D reasoning tasks with high grounding-QA coherence.

Details

Motivation: Existing 3D LLMs struggle with grounded question-answering due to under-exploration of human-like scene-object grounded reasoning mechanisms.

Method: Introduces SCENECOT framework with grounded Chain-of-Thought reasoning that decouples complex tasks into simpler problems using multimodal expert modules, and creates SCENECOT-185K dataset with 185K high-quality instances.

Result: Achieves strong performance across various complex 3D scene reasoning benchmarks with high grounding-QA coherence.

Conclusion: First successful application of CoT reasoning to 3D scene understanding, enabling human-like step-by-step reasoning with potential for broader 3D scene understanding applications.

Abstract: Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of-Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

[285] Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan

Main category: cs.CV

TL;DR: Edit-R1 is a post-training framework using policy optimization and MLLM-based rewards to improve instruction-based image editing models’ generalization beyond training data.

Details

Motivation: Supervised fine-tuning models overfit to annotated patterns and struggle with generalization beyond training distributions, limiting their practical effectiveness.

Method: Uses Diffusion Negative-aware Finetuning (DiffusionNFT) for policy optimization and employs MLLM as training-free reward model with group filtering to reduce scoring noise.

Result: UniWorld-V2 achieves state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench benchmarks, with framework being model-agnostic and improving diverse base models.

Conclusion: The Edit-R1 framework effectively addresses overfitting in instruction-based image editing and demonstrates wide applicability across different base models.

Abstract: Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.

[286] Facial Expression-based Parkinson’s Disease Severity Diagnosis via Feature Fusion and Adaptive Class Balancing

Yintao Zhou, Wei Huang, Zhengyu Li, Jing Huang, Meng Pang

Main category: cs.CV

TL;DR: A new method for Parkinson’s disease severity diagnosis using multiple facial expression features with attention-based fusion and adaptive class balancing to address class imbalance issues.

Details

Motivation: Current PD diagnosis methods rely on single facial expressions leading to misdiagnosis, ignore class imbalance across PD stages, and focus only on binary classification rather than severity diagnosis.

Method: Integrates multiple facial expression features through attention-based feature fusion and uses adaptive class balancing strategy that dynamically adjusts training sample contributions based on class distribution and classification difficulty.

Result: Experimental results demonstrate promising performance for PD severity diagnosis and confirm the efficacy of both attention-based feature fusion and adaptive class balancing.

Conclusion: The proposed method effectively addresses limitations of existing approaches by combining multiple facial expression features and handling class imbalance, showing strong potential for accurate PD severity diagnosis.

Abstract: Parkinson’s disease (PD) severity diagnosis is crucial for early detecting potential patients and adopting tailored interventions. Diagnosing PD based on facial expression is grounded in PD patients’ “masked face” symptom and gains growing interest recently for its convenience and affordability. However, current facial expression-based approaches often rely on single type of expression which can lead to misdiagnosis, and ignore the class imbalance across different PD stages which degrades the prediction performance. Moreover, most existing methods focus on binary classification (i.e., PD / non-PD) rather than diagnosing the severity of PD. To address these issues, we propose a new facial expression-based method for PD severity diagnosis which integrates multiple facial expression features through attention-based feature fusion. Moreover, we mitigate the class imbalance problem via an adaptive class balancing strategy which dynamically adjusts the contribution of training samples based on their class distribution and classification difficulty. Experimental results demonstrate the promising performance of the proposed method for PD severity diagnosis, as well as the efficacy of attention-based feature fusion and adaptive class balancing.

[287] DeepDetect: Learning All-in-One Dense Keypoints

Shaharyar Ahmed Khan Tareen, Filza Khan Tareen

Main category: cs.CV

TL;DR: DeepDetect is a dense keypoint detector that combines classical detectors with deep learning to overcome limitations of existing methods in photometric sensitivity, keypoint density, and semantic understanding.

Details

Motivation: Traditional and learning-based keypoint detectors suffer from sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding of visually important regions.

Method: Fuse outputs from 7 keypoint and 2 edge detectors to create ground-truth masks, then train a lightweight ESPNet model using these masks as labels to enable semantic focus and dense keypoint detection.

Result: Outperforms other detectors on Oxford Affine Covariant Regions dataset with maximum values: 0.5143 average keypoint density, 0.9582 average repeatability, and 59,003 correct matches.

Conclusion: DeepDetect successfully unifies classical detector strengths with deep learning to achieve superior keypoint density, repeatability, and matching performance while being adaptable to challenging conditions.

Abstract: Keypoint detection is the foundation of many computer vision tasks, including image registration, structure-from motion, 3D reconstruction, visual odometry, and SLAM. Traditional detectors (SIFT, SURF, ORB, BRISK, etc.) and learning based methods (SuperPoint, R2D2, LF-Net, D2-Net, etc.) have shown strong performance yet suffer from key limitations: sensitivity to photometric changes, low keypoint density and repeatability, limited adaptability to challenging scenes, and lack of semantic understanding, often failing to prioritize visually important regions. We present DeepDetect, an intelligent, all-in-one, dense keypoint detector that unifies the strengths of classical detectors using deep learning. Firstly, we create ground-truth masks by fusing outputs of 7 keypoint and 2 edge detectors, extracting diverse visual cues from corners and blobs to prominent edges and textures in the images. Afterwards, a lightweight and efficient model: ESPNet, is trained using these masks as labels, enabling DeepDetect to focus semantically on images while producing highly dense keypoints, that are adaptable to diverse and visually degraded conditions. Evaluations on the Oxford Affine Covariant Regions dataset demonstrate that DeepDetect surpasses other detectors in keypoint density, repeatability, and the number of correct matches, achieving maximum values of 0.5143 (average keypoint density), 0.9582 (average repeatability), and 59,003 (correct matches).

[288] Leveraging AV1 motion vectors for Fast and Dense Feature Matching

Julien Zouein, Hossein Javidnia, François Pitié, Anil Kokaram

Main category: cs.CV

TL;DR: Using AV1 motion vectors for dense sub-pixel correspondences and short tracks, achieving comparable performance to SIFT with less CPU usage and denser matches.

Details

Motivation: To create a resource-efficient front end for structure-from-motion by repurposing compressed-domain motion vectors instead of traditional feature extraction methods.

Method: Repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency, operating in the compressed domain.

Result: On 117-frame clip: registered all images, reconstructed 0.46-0.62M points with 0.51-0.53px reprojection error. Comparable to sequential SIFT but with less CPU usage and denser matches.

Conclusion: Compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

Abstract: We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

[289] Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Yuanli Wu, Long Zhang, Yue Du, Bin Li

Main category: cs.CV

TL;DR: A rubric-guided pseudo-labeling framework for zero-shot video summarization that uses LLMs with structured scoring rubrics to achieve competitive performance without training.

Details

Motivation: Current supervised methods are costly and brittle across datasets, while unsupervised methods miss high-level semantics. Zero-shot LLM approaches are sensitive to prompts and dataset-specific normalization.

Method: Convert human annotations into pseudo labels and structured rubrics. At inference, score boundary scenes from descriptions and intermediate scenes with adjacent segment summaries to assess progression and redundancy.

Result: Achieves F1 scores of 57.58 on SumMe, 63.05 on TVSum (surpassing zero-shot baseline), and 53.79 on QFVS benchmark, approaching supervised performance.

Conclusion: Rubric-guided pseudo labeling with contextual prompting stabilizes LLM-based scoring and provides a general, interpretable zero-shot paradigm for video summarization.

Abstract: With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific normalization.We propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter tuning.Across three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.

[290] CaMiT: A Time-Aware Car Model Dataset for Classification and Generation

Frédéric LIN, Biruk Abere Ambaw, Adrian Popescu, Hejer Ammar, Romaric Audigier, Hervé Le Borgne

Main category: cs.CV

TL;DR: CaMiT dataset captures temporal evolution of car models (2007-2023) for studying AI adaptation to changing visual environments. The paper introduces time-incremental learning strategies and time-aware generation to improve temporal robustness.

Details

Motivation: AI systems need to adapt to evolving visual environments where object appearances change over time, especially for technological artifacts like car models.

Method: Created CaMiT dataset with 787K labeled and 5.1M unlabeled car samples. Proposed time-incremental classification with two strategies: updating backbone or only final layer, plus time-aware image generation using temporal metadata.

Result: Static pretraining achieves competitive performance but accuracy declines across years. Time-incremental learning strategies improve temporal robustness. Time-aware generation produces more realistic outputs.

Conclusion: CaMiT provides a benchmark for studying temporal adaptation in fine-grained visual recognition and generation, addressing the challenge of evolving visual environments.

Abstract: AI systems must adapt to evolving visual environments, especially in domains where object appearances change over time. We introduce Car Models in Time (CaMiT), a fine-grained dataset capturing the temporal evolution of car models, a representative class of technological artifacts. CaMiT includes 787K labeled samples of 190 car models (2007-2023) and 5.1M unlabeled samples (2005-2023), supporting both supervised and self-supervised learning. Static pretraining on in-domain data achieves competitive performance with large-scale generalist models while being more resource-efficient, yet accuracy declines when models are tested across years. To address this, we propose a time-incremental classification setting, a realistic continual learning scenario with emerging, evolving, and disappearing classes. We evaluate two strategies: time-incremental pretraining, which updates the backbone, and time-incremental classifier learning, which updates only the final layer, both improving temporal robustness. Finally, we explore time-aware image generation that leverages temporal metadata during training, yielding more realistic outputs. CaMiT offers a rich benchmark for studying temporal adaptation in fine-grained visual recognition and generation.

[291] PICABench: How Far Are We from Physically Realistic Image Editing?

Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, Yihao Liu

Main category: cs.CV

TL;DR: PICABench is a benchmark that evaluates physical realism in image editing across eight sub-dimensions, revealing that current models struggle with physical consistency despite good instruction completion.

Details

Motivation: Existing image editing models focus on completing editing instructions but overlook accompanying physical effects like shadows, reflections, and object interactions, which are crucial for realism.

Method: Introduces PICABench with systematic evaluation across eight physical sub-dimensions, PICAEval protocol using VLM-as-a-judge with human annotations, and PICA-100K dataset from videos for training.

Result: Evaluation of mainstream models shows physical realism remains challenging with significant room for improvement, as models fail to maintain physical consistency in edits.

Conclusion: The benchmark and proposed solutions provide foundation for advancing from naive content editing to physically consistent realism in image manipulation.

Abstract: Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.

cs.AI

[292] Activation Manifold Projection: Liberating Task-Specific Behaviors from LLM Architectures

Al Kari

Main category: cs.AI

TL;DR: CAST enables transfer of LoRA-encoded behaviors between different LLM architectures by learning nonlinear mappings between activation manifolds, achieving 85-95% performance of fully retrained LoRAs without task-specific data.

Details

Motivation: Address architectural lock-in where valuable behaviors learned through LoRA fine-tuning are trapped within specific model architectures, overcoming limitations of existing weight-space transfer methods.

Method: Learns lightweight bidirectional projection heads that translate activation streams between source and target models, applying frozen LoRA kernels and projecting results back, trained on general text corpus without task-specific data.

Result: Achieves 85-95% performance of fully retrained LoRAs on target models, outperforms current weight-space transfer techniques, enables zero-shot translation between heterogeneous model families like Llama-2 and Mistral.

Conclusion: CAST establishes new state-of-the-art in model interoperability by directly mapping activation manifolds, effectively decoupling learned skills from source architectures.

Abstract: The proliferation of Large Language Model (LLM) architectures presents a fundamental challenge: valuable, task-specific behaviors learned through fine-tuning methods like Low-Rank Adaptation (LoRA) are effectively trapped within their source model’s architecture, herein referred to architectural lock-in. Existing transfer methods attempt to bridge this gap by aligning the static weight spaces of models, a brittle and indirect approach that relies on tenuous correlations between parameter geometries. This paper introduces a fundamentally different and more direct paradigm: the Cartridge Activation Space Transfer (CAST), a novel framework that liberates LoRA-encoded behaviors by learning a direct, nonlinear mapping between the activation manifolds, the geometric structures formed by the model’s internal neuron activations, of two distinct LLM architectures. CAST treats a pre-trained LoRA as a frozen “behavioral kernel.” It learns a set of lightweight, bidirectional projection heads that translate the target model’s activation stream into the source model’s latent space, apply the frozen kernel, and project the result back. This process, trained on a general text corpus without any task-specific data, effectively decouples the learned skill from the source architecture. We demonstrate that CAST enables true “zero-shot” translation of any standard LoRA adapter. Our experiments, including transfers between heterogeneous model families like Llama-2 and Mistral, show that CAST-translated adapters achieve 85-95% of the performance of a LoRA fully retrained on the target model, quantitatively outperforming current weight-space transfer techniques and establishing a new state-of-the-art in model interoperability.

[293] Beyond More Context: Retrieval Diversity Boosts Multi-Turn Intent Understanding

Zhiming Lin

Main category: cs.AI

TL;DR: A diversity-aware retrieval framework improves LLM intent understanding by selecting diverse in-context exemplars, achieving better Joint Goal Accuracy than longer prompts under fixed token budgets.

Details

Motivation: Real deployments face tight token budgets and noisy contexts, and most retrieval pipelines overlook set-level diversity and confounds like context length or exemplar order.

Method: A diversity-aware retrieval framework that selects in-context exemplars to balance intent coverage and linguistic variety, integrated with standard LLM decoders. Evaluation enforces budget-matched prompts and randomized positions.

Result: Strong gains in Joint Goal Accuracy on MultiWOZ 2.4 and SGD datasets under equal token budgets, surpassing strong LLM/DST baselines. Consistent improvements across K from 4 to 7 with moderate latency.

Conclusion: The study validates the impact of content diversity in retrieval and offers a simple, deployable selection principle for building accurate, budget-constrained multi-turn intent systems.

Abstract: Multi turn intent understanding is central to task oriented chatbots, yet real deployments face tight token budgets and noisy contexts, and most retrieval pipelines emphasize relevance while overlooking set level diversity and confounds such as more context or exemplar order. We ask whether retrieval diversity, rather than longer prompts, systematically improves LLM intent understanding under fixed budgets. We present a diversity aware retrieval framework that selects in context exemplars to balance intent coverage and linguistic variety, and integrates this selection with standard LLM decoders; the evaluation enforces budget matched prompts and randomized positions, and includes sensitivity analyses over exemplar count, diversity strength, and backbone size. On MultiWOZ 2.4 and SGD, the approach achieves strong gains in Joint Goal Accuracy under equal token budgets, surpassing strong LLM/DST baselines, with consistent improvements across K from 4 to 7 and moderate latency. Overall, the study isolates and validates the impact of content diversity in retrieval and offers a simple, deployable selection principle for building accurate, budget constrained multi turn intent systems.

[294] FABRIC: Framework for Agent-Based Realistic Intelligence Creation

Abhigya Verma, Seganrasan Subramanian, Nandhakumar Kandasamy, Naman Gupta

Main category: cs.AI

TL;DR: A framework for synthesizing agentic data using only LLMs without human supervision, enabling scalable generation of structured interaction records for training agentic language models.

Details

Motivation: Collecting agentic data from human annotators is costly, time-consuming, and difficult to scale, creating a bottleneck for developing tool-using LLM agents.

Method: Modular pipelines that produce complete interaction records including task specifications, tool definitions, policy pseudocode, natural language exchanges, and execution traces with strict syntactic/semantic constraints.

Result: The framework enables generation of high-quality synthetic datasets with machine-parseable records that faithfully align inputs, outputs, and tool calls, supporting both multi-task and multi-turn interactions.

Conclusion: Provides a reproducible, LLM-only alternative to manual data collection, advancing the development of agentic LLMs capable of robust tool use.

Abstract: Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments. Realizing these capabilities requires access to agentic data-structured interaction records that couple user intents with tool specifications, argument-grounded calls, and verifiable execution traces. However, collecting such data from human annotators is costly, time-consuming, and difficult to scale. We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision. This framework decomposes generation into modular pipelines that produce complete interaction records spanning task specifications, tool definitions, policy pseudocode, natural language exchanges, and execution traces. Records conform to strict syntactic and semantic constraints, ensuring machine-parseability and faithful alignment across inputs, outputs, and tool calls. Beyond single tasks, there is support for both multi-task and multi-turn agent interactions, enabling the construction of datasets that reflect the full spectrum of tool-use competencies. To ensure quality and consistency, the framework integrates constrained generation formats, JSON-schema validation, and judge-based filtering. This paper formalizes the schema for agentic records, details the prompt design principles that guide generation, and introduces scalable pipelines for high-quality synthetic data. By providing a reproducible, LLM-only alternative to manual collection, hence advancing the development of agentic LLMs capable of robust tool use.

[295] OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

Zhenyu Bi, Meng Lu, Yang Li, Swastik Roy, Weijie Guan, Morteza Ziyadi, Xuan Wang

Main category: cs.AI

TL;DR: A multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines collaboration structures through evaluating communication quality, outperforming existing methods on reasoning tasks.

Details

Motivation: Existing multi-agent systems use predefined structures or simple voting/debates that suppress correct minority opinions, and current graph-based approaches focus only on agent performance while neglecting interaction quality.

Method: Proposes a verbal reinforcement learning algorithm with action spaces and feedback mechanism that evaluates communication robustness and coherence during debates, using majority voting for final decisions.

Result: Significantly outperforms single-agent prompting and state-of-the-art multi-agent frameworks across mathematical reasoning, creative writing, scientific reasoning, and numerical sorting tasks.

Conclusion: Effective agent communication and debating quality are crucial for multi-agent reasoning, and the proposed dynamic collaboration structure optimization method successfully enhances collective intelligence.

Abstract: Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose $\ours$, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess $\ours$ on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.

[296] Subject-Event Ontology Without Global Time: Foundations and Execution Semantics

Alexander Boldachev

Main category: cs.AI

TL;DR: A formalization of a subject-event ontology for modeling dynamic systems without global time, using event fixation, causal ordering, and declarative dataflow for determinism.

Details

Motivation: To model complex dynamic systems without relying on global time, addressing distributed systems, microservices, DLT platforms, and multiperspectivity scenarios with conflicting facts.

Method: Proposes nine axioms (A1-A9) ensuring correctness: event as fixation act, causal order via happens-before, executable ontology via declarative dataflow, models as epistemic filters, presumption of truth. Includes monotonicity, acyclicity, traceability, and model-based validation.

Result: Developed the boldsea system - a workflow engine implementing the theory in BSL (Boldsea Semantic Language), demonstrating practical applicability for executable ontologies.

Conclusion: The formalization provides a robust framework for modeling dynamic systems without global time, applicable to distributed architectures and multiperspectivity scenarios, with practical implementation validated through the boldsea system.

Abstract: A formalization of a subject-event ontology is proposed for modeling complex dynamic systems without reliance on global time. Key principles: (1) event as an act of fixation - a subject discerns and fixes changes according to models (conceptual templates) available to them; (2) causal order via happens-before - the order of events is defined by explicit dependencies, not timestamps; (3) making the ontology executable via a declarative dataflow mechanism, ensuring determinism; (4) models as epistemic filters - a subject can only fix what falls under its known concepts and properties; (5) presumption of truth - the declarative content of an event is available for computation from the moment of fixation, without external verification. The formalization includes nine axioms (A1-A9), ensuring the correctness of executable ontologies: monotonicity of history (I1), acyclicity of causality (I2), traceability (I3). Special attention is given to the model-based approach (A9): event validation via schemas, actor authorization, automatic construction of causal chains (W3) without global time. Practical applicability is demonstrated on the boldsea system - a workflow engine for executable ontologies, where the theoretical constructs are implemented in BSL (Boldsea Semantic Language). The formalization is applicable to distributed systems, microservice architectures, DLT platforms, and multiperspectivity scenarios (conflicting facts from different subjects).

[297] CompactPrompt: A Unified Pipeline for Prompt Data Compression in LLM Workflows

Joong Ho Choi, Jiayang Zhao, Jeel Shah, Ritvika Sonawane, Vedant Singh, Avani Appalla, Will Flanagan, Filipe Condessa

Main category: cs.AI

TL;DR: CompactPrompt is a pipeline that combines prompt compression with data compression to reduce LLM token usage and inference costs by up to 60% while maintaining output quality.

Details

Motivation: LLMs incur substantial run-time costs in agentic workflows with lengthy prompts and rich data streams, creating a need for efficient compression methods.

Method: Uses self-information scoring and dependency-based phrase grouping to prune low-information tokens from prompts, plus n-gram abbreviation for text and uniform quantization for numerical data.

Result: Reduces total token usage and inference cost by up to 60% on TAT-QA and FinQA benchmarks while preserving output quality (less than 5% accuracy drop for Claude-3.5-Sonnet and GPT-4.1-Mini).

Conclusion: CompactPrompt enables leaner generative AI pipelines by providing effective compression that maintains semantic fidelity while significantly reducing costs.

Abstract: Large Language Models (LLMs) deliver powerful reasoning and generation capabilities but incur substantial run-time costs when operating in agentic workflows that chain together lengthy prompts and process rich data streams. We introduce CompactPrompt, an end-to-end pipeline that merges hard prompt compression with lightweight file-level data compression. CompactPrompt first prunes low-information tokens from prompts using self-information scoring and dependency-based phrase grouping. In parallel, it applies n-gram abbreviation to recurrent textual patterns in attached documents and uniform quantization to numerical columns, yielding compact yet semantically faithful representations. Integrated into standard LLM agents, CompactPrompt reduces total token usage and inference cost by up to 60% on benchmark dataset like TAT-QA and FinQA, while preserving output quality (Results in less than 5% accuracy drop for Claude-3.5-Sonnet, and GPT-4.1-Mini) CompactPrompt helps visualize real-time compression decisions and quantify cost-performance trade-offs, laying the groundwork for leaner generative AI pipelines.

[298] Planned Diffusion

Daniel Israel, Tian Jin, Ellie Cheng, Guy Van den Broeck, Aditya Grover, Suvinay Subramanian, Michael Carbin

Main category: cs.AI

TL;DR: Planned diffusion combines autoregressive planning with parallel diffusion generation to achieve better speed-quality trade-offs in text generation, providing 1.27x-1.81x speedup over autoregressive models with minimal quality loss.

Details

Motivation: Address the trade-off between generation speed and output quality in large language model inference, where autoregressive models are slow but high-quality while diffusion models are fast but require many iterations.

Method: Two-stage hybrid approach: first creates short autoregressive plans to break output into independent spans, then generates these spans simultaneously using diffusion.

Result: Achieves Pareto-optimal trade-off on AlpacaEval with 1.27x-1.81x speedup over autoregressive generation and only 0.87%-5.4% drop in win rate. Planning mechanism is minimal and reliable with flexible quality-latency control.

Conclusion: Planned diffusion expands the speed-quality Pareto frontier and provides a practical path to faster, high-quality text generation through hybrid autoregressive-diffusion approach.

Abstract: A central challenge in large language model inference is the trade-off between generation speed and output quality. Autoregressive models produce high-quality text but generate tokens sequentially. Diffusion models can generate tokens in parallel but often need many iterations to match the same quality. We propose planned diffusion, a hybrid method that combines the strengths of both paradigms. Planned diffusion works in two stages: first, the model creates a short autoregressive plan that breaks the output into smaller, independent spans. Second, the model generates these spans simultaneously using diffusion. This approach expands the speed-quality Pareto frontier and provides a practical path to faster, high-quality text generation. On AlpacaEval, a suite of 805 instruction-following prompts, planned diffusion achieves Pareto-optimal trade-off between quality and latency, achieving 1.27x to 1.81x speedup over autoregressive generation with only 0.87% to 5.4% drop in win rate, respectively. Our sensitivity analysis shows that the planning mechanism of planned diffusion is minimal and reliable, and simple runtime knobs exist to provide flexible control of the quality-latency trade-off.

[299] SMaRT: Select, Mix, and ReinvenT – A Strategy Fusion Framework for LLM-Driven Reasoning and Planning

Nikhil Verma, Manasa Bharadwaj, Wonjun Jang, Harmanpreet Singh, Yixiao Wang, Homa Fashandi, Chul Lee

Main category: cs.AI

TL;DR: SMaRT framework integrates multiple reasoning strategies in LLMs to create balanced solutions, outperforming single-strategy approaches across reasoning, planning, and decision-making tasks.

Details

Motivation: Current LLM methods rely on single-strategy prompting, missing the synergy of diverse reasoning approaches. No single strategy excels universally, highlighting the need for frameworks that fuse strategies to maximize performance and ensure robustness.

Method: Select, Mix, and ReinvenT (SMaRT) framework - uses LLMs as intelligent integrators rather than just evaluators to seamlessly integrate diverse reasoning strategies and create balanced solutions.

Result: Extensive empirical evaluations show SMaRT consistently outperforms state-of-the-art baselines in solution quality, constraint adherence, and performance metrics across reasoning, planning, and sequential decision-making benchmarks.

Conclusion: SMaRT redefines LLM-driven decision-making by pioneering cross-strategy calibration, unlocking superior outcomes for reasoning systems and advancing self-refining methodologies.

Abstract: Large Language Models (LLMs) have redefined complex task automation with exceptional generalization capabilities. Despite these advancements, state-of-the-art methods rely on single-strategy prompting, missing the synergy of diverse reasoning approaches. No single strategy excels universally, highlighting the need for frameworks that fuse strategies to maximize performance and ensure robustness. We introduce the Select, Mix, and ReinvenT (SMaRT) framework, an innovative strategy fusion approach designed to overcome this constraint by creating balanced and efficient solutions through the seamless integration of diverse reasoning strategies. Unlike existing methods, which employ LLMs merely as evaluators, SMaRT uses them as intelligent integrators, unlocking the “best of all worlds” across tasks. Extensive empirical evaluations across benchmarks in reasoning, planning, and sequential decision-making highlight the robustness and adaptability of SMaRT. The framework consistently outperforms state-of-the-art baselines in solution quality, constraint adherence, and performance metrics. This work redefines LLM-driven decision-making by pioneering a new paradigm in cross-strategy calibration, unlocking superior outcomes for reasoning systems and advancing the boundaries of self-refining methodologies.

[300] Measuring Reasoning in LLMs: a New Dialectical Angle

Soheil Abbasloo

Main category: cs.AI

TL;DR: The paper introduces SIEV, a dialectics-based framework for evaluating LLM reasoning processes rather than just final answers, revealing significant reasoning gaps in state-of-the-art models.

Details

Motivation: Current evaluations focus only on correctness of standalone answers, which doesn't reveal the reasoning process. The authors argue reasoning should be viewed as a dynamic trajectory of interacting ideas rather than static steps.

Method: Developed SIEV framework based on dialectics (thesis, antithesis, synthesis) to evaluate how LLMs resolve tension, integrate ideas, and synthesize reasoning, not just final conclusions.

Result: SIEV uncovered significant reasoning gaps in state-of-the-art models, with GPT-5-chat losing over 40 points on GSM benchmark when evaluated using this process-oriented approach.

Conclusion: Process-oriented, philosophically grounded approaches like SIEV enable deeper, more rigorous and discriminative assessment of LLM reasoning capabilities beyond traditional benchmarks.

Abstract: What does it truly mean for a language model to “reason”? Most current evaluations and benchmarks reward models’ correct standalone answers–but correctness alone reveals little about the process that produced them. In this work, we explore a different perspective: reasoning is not a static chain of steps, but a dynamic trajectory where ideas interact, clash, and evolve into deeper insights. To capture this dynamic, we draw on a well-established philosophical tradition: \textit{dialectics}, where reasoning unfolds through thesis, antithesis, and synthesis. Building on this, we present SIEV, a structured framework that evaluates reasoning of LLMs through dialectics. Unlike conventional evaluations, SIEV assesses not only the conclusion a model reaches, but how it gets there: its ability to resolve tension, integrate distinct ideas, and synthesize higher-order reasoning. This lens uncovers significant reasoning gaps in state-of-the-art models even under saturated benchmarks like GSM and MMLU. For instance, GPT-5-chat, a recent model, loses over 40 points (out of 100) when evaluated with SIEV on GSM. Our findings highlight that adopting a process-oriented, philosophically grounded approach enables a deeper, more rigorous, and more discriminative assessment of LLM reasoning.

[301] Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models

Huan Song, Deeksha Razdan, Yiyue Qian, Arijit Ghosh Chowdhury, Parth Patwa, Aman Chadha, Shinan Zhang, Sharlina Keshava, Hannah Marlowe

Main category: cs.AI

TL;DR: PaDA-Agent is an evaluation-driven approach that streamlines data augmentation for Small Language Models by discovering failure patterns and creating targeted strategies to reduce generalization gaps, outperforming existing LLM-based methods.

Details

Motivation: Small Language Models have deployment advantages but lag in accuracy for complex tasks. Supervised fine-tuning requires substantial manual effort in data preparation and optimization.

Method: PaDA-Agent discovers failure patterns from validation data via evaluations and drafts targeted data augmentation strategies to directly reduce generalization gaps, using coordinated operations.

Result: Experimental results show significant improvements over state-of-the-art LLM-based data augmentation approaches for Llama 3.2 1B Instruct model fine-tuning.

Conclusion: The approach effectively bridges the performance gap for SLMs through pattern-guided data augmentation, reducing the need for manual data preparation while improving model accuracy.

Abstract: Small Language Models (SLMs) offer compelling advantages in deployment cost and latency, but their accuracy often lags behind larger models, particularly for complex domain-specific tasks. While supervised fine-tuning can help bridge this performance gap, it requires substantial manual effort in data preparation and iterative optimization. We present PaDA-Agent (Pattern-guided Data Augmentation Agent), an evaluation-driven approach that streamlines the data augmentation process for SLMs through coordinated operations. Unlike state-of-the-art approaches that focus on model training errors only and generating error-correcting samples, PaDA-Agent discovers failure patterns from the validation data via evaluations and drafts targeted data augmentation strategies aiming to directly reduce the generalization gap. Our experimental results demonstrate significant improvements over state-of-the-art LLM-based data augmentation approaches for Llama 3.2 1B Instruct model fine-tuning.

[302] Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety

Antonio-Gabriel Chacón Menke, Phan Xuan Tan, Eiji Kamioka

Main category: cs.AI

TL;DR: A sentence-level labeled dataset for activation-based monitoring of safety behaviors during LLM reasoning, enabling detection and steering of safety concerns through model activations.

Details

Motivation: Current textual reasoning analysis approaches can miss subtle harmful patterns and may be circumvented by models hiding unsafe reasoning, highlighting the need for activation-level safety monitoring.

Method: Created a sentence-level labeled dataset with annotations of safety behaviors, used to extract steering vectors for detecting and influencing these behaviors within model activations.

Result: Successfully extracted representations that both detect and steer safety behaviors in model activations, demonstrating the utility of activation-level techniques for safety oversight.

Conclusion: The dataset fills a key gap in safety research and showcases the potential of activation-level monitoring techniques for improving safety oversight on chain-of-thought reasoning.

Abstract: Recent work has highlighted the importance of monitoring chain-of-thought reasoning for AI safety; however, current approaches that analyze textual reasoning steps can miss subtle harmful patterns and may be circumvented by models that hide unsafe reasoning. We present a sentence-level labeled dataset that enables activation-based monitoring of safety behaviors during LLM reasoning. Our dataset contains reasoning sequences with sentence-level annotations of safety behaviors such as expression of safety concerns or speculation on user intent, which we use to extract steering vectors for detecting and influencing these behaviors within model activations. The dataset fills a key gap in safety research: while existing datasets label reasoning holistically, effective application of steering vectors for safety monitoring could be improved by identifying precisely when specific behaviors occur within reasoning chains. We demonstrate the dataset’s utility by extracting representations that both detect and steer safety behaviors in model activations, showcasing the potential of activation-level techniques for improving safety oversight on reasoning. Content Warning: This paper discusses AI safety in the context of harmful prompts and may contain references to potentially harmful content.

[303] LLM-Based Multi-Agent System for Simulating and Analyzing Marketing and Consumer Behavior

Man-Lin Chu, Lucian Terhorst, Kadin Reed, Tom Ni, Weiwei Chen, Rongyu Lin

Main category: cs.AI

TL;DR: LLM-powered multi-agent simulation framework for consumer decision-making that captures complex human behavior and social dynamics without predefined rules, enabling pre-implementation marketing strategy testing.

Details

Motivation: Traditional methods like post-event analyses and rule-based agent-based models fail to capture the complexity of human behavior and social interactions in consumer decision-making, leading to costly real-world deployment risks.

Method: Multi-agent simulation framework powered by large language models, where generative agents interact, express internal reasoning, form habits, and make purchasing decisions autonomously in a sandbox environment without predefined rules.

Result: The system successfully delivers actionable strategy-testing outcomes in price-discount marketing scenarios and reveals emergent social patterns that conventional methods cannot capture.

Conclusion: This approach provides marketers with a scalable, low-risk tool for pre-implementation testing, reducing reliance on time-intensive post-event evaluations and lowering the risk of underperforming campaigns.

Abstract: Simulating consumer decision-making is vital for designing and evaluating marketing strategies before costly real-world deployment. However, post-event analyses and rule-based agent-based models (ABMs) struggle to capture the complexity of human behavior and social interaction. We introduce an LLM-powered multi-agent simulation framework that models consumer decisions and social dynamics. Building on recent advances in large language model simulation in a sandbox environment, our framework enables generative agents to interact, express internal reasoning, form habits, and make purchasing decisions without predefined rules. In a price-discount marketing scenario, the system delivers actionable strategy-testing outcomes and reveals emergent social patterns beyond the reach of conventional methods. This approach offers marketers a scalable, low-risk tool for pre-implementation testing, reducing reliance on time-intensive post-event evaluations and lowering the risk of underperforming campaigns.

[304] Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model

Yihong Dong, Zhaoyu Ma, Xue Jiang, Zhiyuan Fan, Jiaru Qian, Yongmin Li, Jianha Xiao, Zhi Jin, Rongyu Cao, Binhua Li, Fei Huang, Yongbin Li, Ge Li

Main category: cs.AI

TL;DR: Saber is a training-free sampling algorithm for diffusion language models that improves code generation by adaptively accelerating inference and incorporating backtracking mechanisms, achieving 1.9% accuracy improvement and 251.4% speedup.

Details

Motivation: Diffusion language models offer parallel generation and bidirectional context advantages but suffer from poor performance-speed trade-off in code generation, where reduced sampling steps cause catastrophic performance collapse.

Method: Saber uses adaptive acceleration based on established code context and backtracking mechanisms to reverse generated tokens, without requiring additional training.

Result: Extensive experiments show Saber boosts Pass@1 accuracy by 1.9% average improvement over mainstream DLM methods while achieving 251.4% average inference speedup.

Conclusion: Saber significantly narrows the performance gap between diffusion language models and autoregressive models in code generation by leveraging DLM’s inherent advantages.

Abstract: Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, the performance of DLMs on code generation tasks, which have stronger structural constraints, is significantly hampered by the critical trade-off between inference speed and output quality. We observed that accelerating the code generation process by reducing the number of sampling steps usually leads to a catastrophic collapse in performance. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs to achieve better inference speed and output quality in code generation. Specifically, Saber is motivated by two key insights in the DLM generation process: 1) it can be adaptively accelerated as more of the code context is established; 2) it requires a backtracking mechanism to reverse the generated tokens. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average improvement of 1.9% over mainstream DLM sampling methods, meanwhile achieving an average 251.4% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.

[305] AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

Manik Rana, Calissa Man, Anotida Expected Msiiwa, Jeffrey Paine, Kevin Zhu, Sunishchal Dev, Vasu Sharma, Ahan M R

Main category: cs.AI

TL;DR: AgentChangeBench is a benchmark for evaluating how tool-augmented language model agents adapt to mid-dialogue goal changes across enterprise domains, using four metrics to measure effectiveness, reliability, efficiency, and adaptation latency.

Details

Motivation: Current agent benchmarks primarily evaluate static objectives or one-shot tool use, but real-world multi-turn interactions frequently involve goal changes that require dynamic adaptation.

Method: The benchmark comprises 2,835 task sequences and five user personas designed to trigger realistic shift points in workflows, evaluated through four metrics: Task Success Rate, Tool Use Efficiency, Tool Call Redundancy Rate, and Goal-Shift Recovery Time.

Result: Evaluation of frontier models revealed sharp contrasts: GPT-4o reached 92.2% recovery on airline booking shifts while Gemini collapsed to 48.6%, and retail tasks showed near perfect parameter validity but redundancy rates above 80%, revealing major inefficiencies.

Conclusion: High raw accuracy does not imply robustness under dynamic goals, and explicit measurement of recovery time and redundancy is essential for evaluating agent resilience in realistic enterprise settings.

Abstract: Goal changes are a defining feature of real world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts across three enterprise domains. Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Using this setup, we evaluate several frontier models and uncover sharp contrasts obscured by traditional $\text{pass}@k$ scores: for example, GPT-4o reaches $92.2%$ recovery on airline booking shifts while Gemini collapses to $48.6%$, and retail tasks show near perfect parameter validity yet redundancy rates above $80%$, revealing major inefficiencies. These findings demonstrate that high raw accuracy does not imply robustness under dynamic goals, and that explicit measurement of recovery time and redundancy is essential. AgentChangeBench establishes a reproducible testbed for diagnosing and improving agent resilience in realistic enterprise settings.

[306] Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains

Soumya Rani Samineni, Durgesh Kalwar, Vardaan Gangal, Siddhant Bhambri, Subbarao Kambhampati

Main category: cs.AI

TL;DR: RL post-training improves local coherence of reasoning traces but doesn’t guarantee correct final answers, challenging claims about RL improving reasoning.

Details

Motivation: Existing RLVR methods treat all tokens uniformly and claim improved reasoning traces based on final answer metrics, but don't examine intermediate token effects.

Method: Used GRPO algorithm with Qwen-2.5-0.5B model on GSM8K dataset, introduced trace coherence measure based on First-Order Logic to identify errors in reasoning steps.

Result: RL post-training improves trace coherence, especially on problems where base model fails but RL model succeeds. However, improved coherence doesn’t guarantee valid solutions or correct answers.

Conclusion: Claims about RL improving reasoning must be carefully examined as improved trace coherence may not translate to valid mathematical proofs or correct final answers.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR)-based post-training of Large Language Models (LLMs) has been shown to improve accuracy on reasoning tasks and continues to attract significant attention. Existing RLVR methods, however, typically treat all tokens uniformly without accounting for token-level advantages. These methods primarily evaluate performance based on final answer correctness or Pass@K accuracy, and yet make claims about RL post-training leading to improved reasoning traces. This motivates our investigation into the effect of RL post-training on intermediate tokens which are not directly incentivized. To study this, we design an experimental setup using the GRPO algorithm with Qwen-2.5-0.5B model on the GSM8K dataset. We introduce trace coherence, a First-Order Logic (FOL)-based measure to capture the consistency of reasoning steps by identifying errors in the traces. We distinguish between trace validity and trace coherence, noting that the former implies logical soundness while the latter measures local coherence via lack of errors. Our results show that RL post-training overall improves trace coherence with the most significant gains on problems where the base model fails but the RL model succeeds. Surprisingly, RL enhances local coherence without necessarily producing valid or correct solutions. This highlights a crucial distinction: improved local coherence in reasoning steps does not guarantee final answer correctness. We argue that claims of improved reasoning via RL must be examined with care, as these may be based on improved trace coherence, which may not translate into fully valid mathematical proofs.

[307] FST.ai 2.0: An Explainable AI Ecosystem for Fair, Fast, and Inclusive Decision-Making in Olympic and Paralympic Taekwondo

Keivan Shariatmadar, Ahmad Osman, Ramin Ray, Usman Dildar, Kisam Kim

Main category: cs.AI

TL;DR: FST.ai 2.0 is an explainable AI ecosystem for Olympic/Paralympic combat sports that integrates pose-based action recognition, uncertainty modeling, and explainability overlays to support real-time decision-making in Taekwondo competitions.

Details

Motivation: To address the critical challenge of fair, transparent, and explainable decision-making in Olympic and Paralympic combat sports, particularly Taekwondo.

Method: Uses graph convolutional networks (GCNs) for pose-based action recognition, epistemic uncertainty modeling through credal sets, explainability overlays for visual decision support, and interactive dashboards for human-AI collaboration.

Result: 85% reduction in decision review time and 93% referee trust in AI-assisted decisions when validated on competition data.

Conclusion: The framework establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and represents a step toward equitable, accountable, and human-aligned AI in sports.

Abstract: Fair, transparent, and explainable decision-making remains a critical challenge in Olympic and Paralympic combat sports. This paper presents \emph{FST.ai 2.0}, an explainable AI ecosystem designed to support referees, coaches, and athletes in real time during Taekwondo competitions and training. The system integrates {pose-based action recognition} using graph convolutional networks (GCNs), {epistemic uncertainty modeling} through credal sets, and {explainability overlays} for visual decision support. A set of {interactive dashboards} enables human–AI collaboration in referee evaluation, athlete performance analysis, and Para-Taekwondo classification. Beyond automated scoring, FST.ai2.0 incorporates modules for referee training, fairness monitoring, and policy-level analytics within the World Taekwondo ecosystem. Experimental validation on competition data demonstrates an {85% reduction in decision review time} and {93% referee trust} in AI-assisted decisions. The framework thus establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment. By bridging real-time perception, explainable inference, and governance-aware design, FST.ai2.0 represents a step toward equitable, accountable, and human-aligned AI in sports.

[308] A Definition of AGI

Dan Hendrycks, Dawn Song, Christian Szegedy, Honglak Lee, Yarin Gal, Erik Brynjolfsson, Sharon Li, Andy Zou, Lionel Levine, Bo Han, Jie Fu, Ziwei Liu, Jinwoo Shin, Kimin Lee, Mantas Mazeika, Long Phan, George Ingebretsen, Adam Khoja, Cihang Xie, Olawale Salaudeen, Matthias Hein, Kevin Zhao, Alexander Pan, David Duvenaud, Bo Li, Steve Omohundro, Gabriel Alfour, Max Tegmark, Kevin McGrew, Gary Marcus, Jaan Tallinn, Eric Schmidt, Yoshua Bengio

Main category: cs.AI

TL;DR: A quantifiable framework for AGI is proposed, defining it as matching human cognitive versatility. Current AI shows jagged profiles with strengths in knowledge domains but deficits in long-term memory, scoring GPT-4 at 27% and GPT-5 at 58% of AGI.

Details

Motivation: The lack of concrete AGI definition obscures the gap between specialized AI and human-level cognition. A quantifiable framework is needed to measure progress toward AGI.

Method: Ground methodology in Cattell-Horn-Carroll theory, dissecting intelligence into 10 cognitive domains and adapting human psychometric batteries to evaluate AI systems.

Result: Current AI shows highly jagged cognitive profiles - proficient in knowledge domains but with critical deficits in foundational cognitive machinery, especially long-term memory storage. GPT-4 scores 27% and GPT-5 scores 58% of AGI.

Conclusion: The framework concretely quantifies both rapid AI progress and the substantial gap remaining before achieving AGI, revealing specific cognitive deficiencies that need addressing.

Abstract: The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today’s specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly “jagged” cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 58%) concretely quantify both rapid progress and the substantial gap remaining before AGI.

[309] Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making

Stelios Triantafyllou, Aleksa Sukovic, Yasaman Zolfimoselo, Goran Radanovic

Main category: cs.AI

TL;DR: The paper introduces a causal explanation framework for decomposing counterfactual effects in multi-agent Markov decision processes, attributing contributions to individual agents and state variables.

Details

Motivation: To explain counterfactual outcomes in multi-agent systems by quantifying how an agent's action influences the final outcome through environment dynamics and other agents' behaviors.

Method: A novel causal explanation formula that decomposes counterfactual effects into two components: effects propagating through agents’ actions (attributed using Shapley values) and effects propagating through state transitions (attributed to state variables via structure-preserving interventions).

Result: The approach successfully decomposes total counterfactual effects and demonstrates interpretability in Gridworld environments with LLM-assisted agents and sepsis management simulators.

Conclusion: The proposed framework provides interpretable causal explanations for counterfactual outcomes in multi-agent systems by systematically attributing effects to agents and state variables.

Abstract: We address the challenge of explaining counterfactual outcomes in multi-agent Markov decision processes. In particular, we aim to explain the total counterfactual effect of an agent’s action on the outcome of a realized scenario through its influence on the environment dynamics and the agents’ behavior. To achieve this, we introduce a novel causal explanation formula that decomposes the counterfactual effect by attributing to each agent and state variable a score reflecting their respective contributions to the effect. First, we show that the total counterfactual effect of an agent’s action can be decomposed into two components: one measuring the effect that propagates through all subsequent agents’ actions and another related to the effect that propagates through the state transitions. Building on recent advancements in causal contribution analysis, we further decompose these two effects as follows. For the former, we consider agent-specific effects – a causal concept that quantifies the counterfactual effect of an agent’s action that propagates through a subset of agents. Based on this notion, we use Shapley value to attribute the effect to individual agents. For the latter, we consider the concept of structure-preserving interventions and attribute the effect to state variables based on their “intrinsic” contributions. Through extensive experimentation, we demonstrate the interpretability of our approach in a Gridworld environment with LLM-assisted agents and a sepsis management simulator.

[310] ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

Xiaohan Qin, Xiaoxing Wang, Ning Liao, Cancheng Zhang, Xiangdong Zhang, Mingquan Feng, Jingzhi Wang, Junchi Yan

Main category: cs.AI

TL;DR: ssToken is a token-level data selection method for LLM fine-tuning that uses self-modulated loss differences and semantic-aware attention metrics to select tokens without needing additional reference models.

Details

Motivation: Existing token-level selection methods require training additional reference models and rely solely on loss information, which may not preserve semantically important tokens that aren't favored by loss-based metrics.

Method: Uses history models to compute per-token loss differences as self-modulated signals, and introduces attention-based token importance estimation for semantic-aware selection orthogonal to loss-based filtering.

Result: Both self-modulated and semantic-aware selection alone outperform full-data fine-tuning, and their integration achieves synergistic gains surpassing prior token-level selection methods while maintaining training efficiency.

Conclusion: ssToken provides an effective token selection approach that eliminates the need for additional reference models and combines loss-based and semantic-aware metrics for improved fine-tuning performance.

Abstract: Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose ssToken, a Self-modulated and Semantic-aware Token Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integration–ssToken–achieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency.

[311] Illusions of reflection: open-ended task reveals systematic failures in Large Language Models’ reflective reasoning

Sion Weatherhead, Flora Salim, Aaron Belbasis

Main category: cs.AI

TL;DR: Current LLM ‘reflection’ lacks functional evidence of active, goal-driven monitoring that helps humans respect constraints, showing only modest gains in self-correction on open-ended tasks.

Details

Motivation: To test whether LLM 'reflection' is functionally equivalent to human reflective reasoning by examining self-correction capabilities on open-ended yet rule-constrained tasks.

Method: Tested eight frontier models on producing valid scientific test items, then revising after self-critique, measuring performance before and after reflection.

Result: First-pass performance was poor (mean ≈1 valid item out of 4), reflection yielded only modest gains (also ≈1), and models frequently repeated the same constraint violations, indicating gains came from chance rather than principled error correction.

Conclusion: Current LLM reflection lacks the active, goal-driven monitoring that helps humans respect constraints, and reliable performance requires external structure that enforces constraints.

Abstract: Humans do not just find mistakes after the fact – we often catch them mid-stream because ‘reflection’ is tied to the goal and its constraints. Today’s large language models produce reasoning tokens and ‘reflective’ text, but is it functionally equivalent with human reflective reasoning? Prior work on closed-ended tasks – with clear, external ‘correctness’ signals – can make ‘reflection’ look effective while masking limits in self-correction. We therefore test eight frontier models on a simple, real-world task that is open-ended yet rule-constrained, with auditable success criteria: to produce valid scientific test items, then revise after considering their own critique. First-pass performance is poor (often zero valid items out of 4 required; mean $\approx$ 1), and reflection yields only modest gains (also $\approx$ 1). Crucially, the second attempt frequently repeats the same violation of constraint, indicating ‘corrective gains’ arise largely from chance production of a valid item rather than error detection and principled, constraint-sensitive repair. Performance before and after reflection deteriorates as open-endedness increases, and models marketed for ‘reasoning’ show no advantage. Our results suggest that current LLM ‘reflection’ lacks functional evidence of the active, goal-driven monitoring that helps humans respect constraints even on a first pass. Until such mechanisms are instantiated in the model itself, reliable performance requires external structure that enforces constraints.

[312] Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

Zheng Zhang, Jiarui He, Yuchen Cai, Deheng Ye, Peilin Zhao, Ruili Feng, Hao Wang

Main category: cs.AI

TL;DR: Genesis is an agentic framework for attacking web agents that combines genetic algorithms with dynamic strategy discovery to outperform existing attack methods.

Details

Motivation: Current red-teaming approaches for web agents rely on manual strategies or static models, failing to capture behavioral patterns and generalize across environments. Web agent attacks require continuous discovery and evolution of attack strategies.

Method: Genesis framework with three modules: Attacker (generates adversarial injections using genetic algorithm with hybrid strategy representation), Scorer (evaluates target agent responses), and Strategist (dynamically uncovers effective strategies from interaction logs and builds a growing strategy library).

Result: Extensive experiments across various web tasks show that Genesis discovers novel strategies and consistently outperforms existing attack baselines.

Conclusion: The proposed framework successfully addresses the limitations of existing web agent attack methods by enabling continuous strategy discovery and evolution through its three-module architecture.

Abstract: As large language model (LLM) agents increasingly automate complex web tasks, they boost productivity while simultaneously introducing new security risks. However, relevant studies on web agent attacks remain limited. Existing red-teaming approaches mainly rely on manually crafted attack strategies or static models trained offline. Such methods fail to capture the underlying behavioral patterns of web agents, making it difficult to generalize across diverse environments. In web agent attacks, success requires the continuous discovery and evolution of attack strategies. To this end, we propose Genesis, a novel agentic framework composed of three modules: Attacker, Scorer, and Strategist. The Attacker generates adversarial injections by integrating the genetic algorithm with a hybrid strategy representation. The Scorer evaluates the target web agent’s responses to provide feedback. The Strategist dynamically uncovers effective strategies from interaction logs and compiles them into a continuously growing strategy library, which is then re-deployed to enhance the Attacker’s effectiveness. Extensive experiments across various web tasks show that our framework discovers novel strategies and consistently outperforms existing attack baselines.

Aaron Bell, Amit Aides, Amr Helmy, Arbaaz Muslim, Aviad Barzilai, Aviv Slobodkin, Bolous Jaber, David Schottlander, George Leifman, Joydeep Paul, Mimi Sun, Nadav Sherman, Natalie Williams, Per Bjornsson, Roy Lee, Ruth Alcantara, Thomas Turnbull, Tomer Shekel, Vered Silverman, Yotam Gigi, Adam Boulanger, Alex Ottenwess, Ali Ahmadalipour, Anna Carter, Charles Elliott, David Andre, Elad Aharoni, Gia Jung, Hassler Thurston, Jacob Bien, Jamie McPike, Juliet Rothenberg, Kartik Hegde, Kel Markert, Kim Philipp Jablonski, Luc Houriez, Monica Bharel, Phing VanLee, Reuven Sayag, Sebastian Pilarski, Shelley Cazares, Shlomi Pasternak, Siduo Jiang, Stone Jiang, Thomas Colthurst, Yang Chen, Yehonathan Refael, Yochai Blau, Yuval Carny, Yael Maguire, Avinatan Hassidim, James Manyika, Tim Thelin, Genady Beryozkin, Gautam Prasad, Luke Barrington, Yossi Matias, Niv Efron, Shravya Shetty

Main category: cs.AI

TL;DR: Earth AI is a geospatial AI system using foundation models for imagery, population, and environmental data, combined with Gemini-powered reasoning to extract insights from complex geospatial data.

Details

Motivation: Geospatial data is vast and diverse but challenging to analyze due to volume, varied resolutions, timescales, and sparsity. There's a need to unlock novel insights from this data.

Method: Developed foundation models across three domains (Planet-scale Imagery, Population, Environment) and a Gemini-powered reasoning engine that jointly reasons over multiple models and geospatial data sources.

Result: Benchmarks show foundation models provide complementary value and superior predictive capabilities when used together. The agent delivers critical insights on real-world crisis scenarios, bridging raw data to actionable understanding.

Conclusion: Earth AI enables significant advances in geospatial analysis by combining foundation models with intelligent reasoning, effectively handling complex multi-step queries and unlocking profound insights about our planet.

Abstract: Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. This paper introduces Earth AI, a family of geospatial AI models and agentic reasoning that enables significant advances in our ability to unlock novel and profound insights into our planet. This approach is built upon foundation models across three key domains–Planet-scale Imagery, Population, and Environment–and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that when used together, they provide complementary value for geospatial inference and their synergies unlock superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding.

[314] ShortcutBreaker: Low-Rank Noisy Bottleneck with Global Perturbation Attention for Multi-Class Unsupervised Anomaly Detection

Peng Tang, Xiaoxiao Yan, Xiaobin Hu, Yuning Cui, Donghao Luo, Jiangning Zhang, Pengcheng Xu, Jinlong Peng, Qingdong He, Feiyue Huang, Song Xue, Tobias Lasser

Main category: cs.AI

TL;DR: ShortcutBreaker is a unified feature-reconstruction framework for multi-class unsupervised anomaly detection that addresses identity shortcuts in Transformer-based models through low-rank noisy bottleneck and global perturbation attention.

Details

Motivation: Existing Transformer-based MUAD methods suffer from identity shortcuts that copy inputs to outputs, narrowing the reconstruction error gap between normal and abnormal cases, making them harder to distinguish.

Method: Proposes ShortcutBreaker with two key innovations: 1) Low-rank noisy bottleneck (LRNB) using matrix rank inequality to project features into low-rank latent space preventing trivial identity reproduction, 2) Global perturbation attention leveraging ViT’s global modeling to prevent information shortcuts in decoders.

Result: Achieves remarkable image-level AUROC scores: 99.8% on MVTec-AD, 98.9% on ViSA, 90.6% on Real-IAD, and 87.8% on Universal Medical, consistently outperforming previous MUAD methods across industrial and medical datasets.

Conclusion: ShortcutBreaker effectively addresses identity shortcut issues in MUAD, demonstrating superior performance across diverse scenarios and establishing a robust unified framework for multi-class anomaly detection.

Abstract: Multi-class unsupervised anomaly detection (MUAD) has garnered growing research interest, as it seeks to develop a unified model for anomaly detection across multiple classes, i.e., eliminating the need to train separate models for distinct objects and thereby saving substantial computational resources. Under the MUAD setting, while advanced Transformer-based architectures have brought significant performance improvements, identity shortcuts persist: they directly copy inputs to outputs, narrowing the gap in reconstruction errors between normal and abnormal cases, and thereby making the two harder to distinguish. Therefore, we propose ShortcutBreaker, a novel unified feature-reconstruction framework for MUAD tasks, featuring two key innovations to address the issue of shortcuts. First, drawing on matrix rank inequality, we design a low-rank noisy bottleneck (LRNB) to project highdimensional features into a low-rank latent space, and theoretically demonstrate its capacity to prevent trivial identity reproduction. Second, leveraging ViTs global modeling capability instead of merely focusing on local features, we incorporate a global perturbation attention to prevent information shortcuts in the decoders. Extensive experiments are performed on four widely used anomaly detection benchmarks, including three industrial datasets (MVTec-AD, ViSA, and Real-IAD) and one medical dataset (Universal Medical). The proposed method achieves a remarkable image-level AUROC of 99.8%, 98.9%, 90.6%, and 87.8% on these four datasets, respectively, consistently outperforming previous MUAD methods across different scenarios.

[315] Memory-Augmented State Machine Prompting: A Novel LLM Agent Framework for Real-Time Strategy Games

Runnan Qi, Yanan Ni, Lumin Jiang, Zongyuan Li, Kuihua Huang, Xian Guo

Main category: cs.AI

TL;DR: MASMP is a framework that combines state machine prompting with memory mechanisms to improve LLM agents’ decision-making in real-time strategy games, achieving 60% win rate against the hardest AI in StarCraft II.

Details

Motivation: To address challenges like hallucinations and fragmented decision-making in existing LLM approaches for real-time strategy games, and to bridge the "Knowing-Doing Gap" between semantic understanding and reliable action execution.

Method: Integrates state machine prompting with memory mechanisms: (1) natural language-driven state machine architecture that guides LLMs to emulate finite state machines and behavior trees, (2) lightweight memory module preserving strategic variables across decision cycles.

Result: Achieved 60% win rate against the hardest built-in AI (Lv7) in StarCraft II, vastly outperforming baselines (0%). Case studies show the method retains LLMs’ semantic comprehension while resolving the “Knowing-Doing Gap” through strict state-action mapping.

Conclusion: MASMP establishes a new paradigm for combining neural and symbolic AI in complex decision-making, achieving both interpretability and FSM-like reliability while maintaining LLMs’ semantic capabilities.

Abstract: This paper proposes Memory-Augmented State Machine Prompting (MASMP), a novel framework for LLM agents in real-time strategy games. Addressing key challenges like hallucinations and fragmented decision-making in existing approaches, MASMP integrates state machine prompting with memory mechanisms to unify structured actions with long-term tactical coherence. The framework features: (1) a natural language-driven state machine architecture that guides LLMs to emulate finite state machines and behavior trees through prompts, and (2) a lightweight memory module preserving strategic variables (e.g., tactics, priority units) across decision cycles. Experiments in StarCraft II demonstrate MASMP’s 60% win rate against the hardest built-in AI (Lv7), vastly outperforming baselines (0%). Case studies reveal the method retains LLMs’ semantic comprehension while resolving the “Knowing-Doing Gap” through strict state-action mapping, achieving both interpretability and FSM-like reliability. This work establishes a new paradigm for combining neural and symbolic AI in complex decision-making.

[316] Heterogeneous Adversarial Play in Interactive Environments

Manjie Xu, Xinyi Yang, Jiayu Zhan, Wei Liang, Chi Zhang, Yixin Zhu

Main category: cs.AI

TL;DR: HAP is an adversarial Automatic Curriculum Learning framework that formalizes teacher-student interactions through minimax optimization, enabling bidirectional feedback between instructors and learners for adaptive curriculum generation.

Details

Motivation: Conventional self-play frameworks are inadequate for open-ended learning with inherent asymmetry, while human pedagogical systems demonstrate effective asymmetric instructional frameworks that need to be operationalized in artificial systems.

Method: HAP establishes teacher-student interactions as minimax optimization where task-generating instructors and problem-solving learners co-evolve through adversarial dynamics with bidirectional feedback for continuous task complexity recalibration.

Result: Experimental validation shows HAP achieves performance parity with SOTA baselines while generating curricula that enhance learning efficacy in both artificial agents and human subjects.

Conclusion: HAP successfully operationalizes asymmetric adaptive pedagogical mechanisms for autonomous curriculum synthesis without predetermined task hierarchies.

Abstract: Self-play constitutes a fundamental paradigm for autonomous skill acquisition, whereby agents iteratively enhance their capabilities through self-directed environmental exploration. Conventional self-play frameworks exploit agent symmetry within zero-sum competitive settings, yet this approach proves inadequate for open-ended learning scenarios characterized by inherent asymmetry. Human pedagogical systems exemplify asymmetric instructional frameworks wherein educators systematically construct challenges calibrated to individual learners’ developmental trajectories. The principal challenge resides in operationalizing these asymmetric, adaptive pedagogical mechanisms within artificial systems capable of autonomously synthesizing appropriate curricula without predetermined task hierarchies. Here we present Heterogeneous Adversarial Play (HAP), an adversarial Automatic Curriculum Learning framework that formalizes teacher-student interactions as a minimax optimization wherein task-generating instructor and problem-solving learner co-evolve through adversarial dynamics. In contrast to prevailing ACL methodologies that employ static curricula or unidirectional task selection mechanisms, HAP establishes a bidirectional feedback system wherein instructors continuously recalibrate task complexity in response to real-time learner performance metrics. Experimental validation across multi-task learning domains demonstrates that our framework achieves performance parity with SOTA baselines while generating curricula that enhance learning efficacy in both artificial agents and human subjects.

[317] Deep Learning-Based Control Optimization for Glass Bottle Forming

Mattia Pujatti, Andrea Di Luca, Nicola Peghini, Federico Monegaglia, Marco Cristoforetti

Main category: cs.AI

TL;DR: Deep learning control algorithm optimizes glass bottle forming process using neural networks to predict parameter effects and identify optimal machine settings for better quality control.

Details

Motivation: Precise control of forming machines is critical for ensuring quality and minimizing defects in glass bottle manufacturing, requiring optimization in real production environments.

Method: Uses deep learning neural network trained on real operational data to predict effects of parameter changes, with a specifically designed inversion mechanism to identify optimal machine settings for desired glass gob characteristics.

Result: Experimental results on historical datasets from multiple production lines show promising outcomes with potential for enhanced process stability, reduced waste, and improved product consistency.

Conclusion: Deep learning shows significant potential for process control applications in glass manufacturing, offering improved optimization capabilities over traditional methods.

Abstract: In glass bottle manufacturing, precise control of forming machines is critical for ensuring quality and minimizing defects. This study presents a deep learning-based control algorithm designed to optimize the forming process in real production environments. Using real operational data from active manufacturing plants, our neural network predicts the effects of parameter changes based on the current production setup. Through a specifically designed inversion mechanism, the algorithm identifies the optimal machine settings required to achieve the desired glass gob characteristics. Experimental results on historical datasets from multiple production lines show that the proposed method yields promising outcomes, suggesting potential for enhanced process stability, reduced waste, and improved product consistency. These results highlight the potential of deep learning to process control in glass manufacturing.

[318] Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents

Guangfu Guo, Xiaoqian Lu, Yue Feng

Main category: cs.AI

TL;DR: Med-VRAgent is a framework that combines Visual Guidance, Self-Reward paradigms, and Monte Carlo Tree Search to reduce hallucinations and improve medical visual reasoning in VLMs, with performance enhanced through PPO fine-tuning.

Details

Motivation: Current Visual Language Models (VLMs) struggle with hallucinations, vague descriptions, inconsistent logic, and poor localization in medical reasoning tasks.

Method: Proposed Med-VRAgent framework using Visual Guidance, Self-Reward paradigms, and Monte Carlo Tree Search (MCTS), with trajectories used for fine-tuning VLMs via proximal policy optimization (PPO).

Result: The method outperforms existing approaches on multiple medical VQA benchmarks.

Conclusion: Med-VRAgent effectively improves medical visual reasoning capabilities of VLMs by addressing key limitations through a novel agent framework and reinforcement learning fine-tuning.

Abstract: Visual Language Models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (\textbf{Med-VRAgent}). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-VRAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.

[319] Automated urban waterlogging assessment and early warning through a mixture of foundation models

Chenxu Zhang, Fuxiang Huang, Lei Zhang

Main category: cs.AI

TL;DR: UWAssess is a foundation model-driven framework that automatically identifies waterlogged areas in surveillance images and generates structured assessment reports, addressing urban waterlogging monitoring challenges.

Details

Motivation: With climate change intensifying, urban waterlogging poses an increasingly severe threat to global public safety and infrastructure, while existing monitoring approaches rely heavily on manual reporting and fail to provide timely and comprehensive assessments.

Method: The framework uses a semi-supervised fine-tuning strategy and chain-of-thought (CoT) prompting strategy to unleash the potential of foundation models for data-scarce downstream tasks, enabling automatic identification of waterlogged areas and generation of structured reports.

Result: Evaluations on challenging visual benchmarks demonstrate substantial improvements in perception performance, and GPT-based evaluations confirm the ability to generate reliable textual reports that accurately describe waterlogging extent, depth, risk and impact.

Conclusion: UWAssess enables a shift from perception to generation in waterlogging monitoring, while the collaborative framework of multiple foundation models lays the groundwork for intelligent and scalable systems supporting urban management, disaster response and climate resilience.

Abstract: With climate change intensifying, urban waterlogging poses an increasingly severe threat to global public safety and infrastructure. However, existing monitoring approaches rely heavily on manual reporting and fail to provide timely and comprehensive assessments. In this study, we present Urban Waterlogging Assessment (UWAssess), a foundation model-driven framework that automatically identifies waterlogged areas in surveillance images and generates structured assessment reports. To address the scarcity of labeled data, we design a semi-supervised fine-tuning strategy and a chain-of-thought (CoT) prompting strategy to unleash the potential of the foundation model for data-scarce downstream tasks. Evaluations on challenging visual benchmarks demonstrate substantial improvements in perception performance. GPT-based evaluations confirm the ability of UWAssess to generate reliable textual reports that accurately describe waterlogging extent, depth, risk and impact. This dual capability enables a shift of waterlogging monitoring from perception to generation, while the collaborative framework of multiple foundation models lays the groundwork for intelligent and scalable systems, supporting urban management, disaster response and climate resilience.

[320] AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library

Minwei Kong, Ao Qu, Xiaotong Guo, Wenbin Ouyang, Chonghe Jiang, Han Zheng, Yining Ma, Dingyi Zhuang, Yuhan Tang, Junyi Li, Hai Wang, Cathy Wu, Jinhua Zhao

Main category: cs.AI

TL;DR: AlphaOPT is a self-improving experience library that enables LLMs to learn optimization modeling from limited demonstrations and solver feedback without annotated reasoning traces or parameter updates, using a continual two-phase cycle of library learning and evolution.

Details

Motivation: Optimization modeling is difficult to automate as informal language must be mapped to precise mathematical formulations and executable solver code. Prior LLM approaches either rely on brittle prompting or costly retraining with limited generalization.

Method: AlphaOPT operates in a continual two-phase cycle: (1) Library Learning phase that reflects on failed attempts, extracting solver-verified structured insights as {taxonomy, condition, explanation, example}; (2) Library Evolution phase that diagnoses retrieval misalignments and refines applicability conditions of stored insights.

Result: Experiments show AlphaOPT steadily improves with more data (65% to 72% from 100 to 300 training items) and surpasses the strongest baseline by 7.7% on the out-of-distribution OptiBench dataset when trained only on answers.

Conclusion: AlphaOPT learns efficiently from limited demonstrations without curated rationales, expands continually without costly retraining by updating the library rather than model weights, and makes knowledge explicit and interpretable for human inspection and intervention.

Abstract: Optimization modeling enables critical decisions across industries but remains difficult to automate: informal language must be mapped to precise mathematical formulations and executable solver code. Prior LLM approaches either rely on brittle prompting or costly retraining with limited generalization. We present AlphaOPT, a self-improving experience library that enables an LLM to learn from limited demonstrations (even answers alone, without gold-standard programs) and solver feedback - without annotated reasoning traces or parameter updates. AlphaOPT operates in a continual two-phase cycle: (i) a Library Learning phase that reflects on failed attempts, extracting solver-verified, structured insights as {taxonomy, condition, explanation, example}; and (ii) a Library Evolution phase that diagnoses retrieval misalignments and refines the applicability conditions of stored insights, improving transfer across tasks. This design (1) learns efficiently from limited demonstrations without curated rationales, (2) expands continually without costly retraining by updating the library rather than model weights, and (3) makes knowledge explicit and interpretable for human inspection and intervention. Experiments show that AlphaOPT steadily improves with more data (65% to 72% from 100 to 300 training items) and surpasses the strongest baseline by 7.7% on the out-of-distribution OptiBench dataset when trained only on answers. Code and data are available at: https://github.com/Minw913/AlphaOPT.

[321] PlanU: Large Language Model Decision Making through Planning under Uncertainty

Ziwei Deng, Mian Deng, Chenjing Liang, Zeming Gao, Chennan Ma, Chenxing Lin, Haipeng Zhang, Songzhu Mei, Cheng Wang, Siqi Shen

Main category: cs.AI

TL;DR: PlanU is an LLM-based planning method that addresses uncertainty in decision-making by integrating quantile distributions and curiosity-driven exploration within Monte Carlo Tree Search.

Details

Motivation: LLMs struggle with decision-making under uncertainty, particularly in stochastic environments, due to both LLM uncertainty (from stochastic sampling) and environmental uncertainty. Existing approaches either overlook environmental uncertainty or are not designed for multi-step decision-making tasks.

Method: PlanU integrates LLMs with Monte Carlo Tree Search (MCTS) by modeling node returns as quantile distributions and introducing an Upper Confidence Bounds with Curiosity (UCC) score to balance exploration and exploitation during tree search.

Result: Extensive experiments demonstrate that PlanU is effective for LLM-based decision-making tasks under uncertainty, showing improved performance in stochastic environments compared to existing approaches.

Conclusion: PlanU successfully addresses uncertainty challenges in LLM decision-making by combining quantile distribution modeling and curiosity-driven exploration within MCTS, enabling better performance in stochastic environments that require multi-step planning.

Abstract: Large Language Models (LLMs) are increasingly being explored across a range of decision-making tasks. However, LLMs sometimes struggle with decision-making tasks under uncertainty that are relatively easy for humans, such as planning actions in stochastic environments. The adoption of LLMs for decision-making is impeded by uncertainty challenges, such as LLM uncertainty and environmental uncertainty. LLM uncertainty arises from the stochastic sampling process inherent to LLMs. Most LLM-based Decision-Making (LDM) approaches address LLM uncertainty through multiple reasoning chains or search trees. However, these approaches overlook environmental uncertainty, which leads to poor performance in environments with stochastic state transitions. Some recent LDM approaches deal with uncertainty by forecasting the probability of unknown variables. However, they are not designed for multi-step decision-making tasks that require interaction with the environment. To address uncertainty in LLM decision-making, we introduce PlanU, an LLM-based planning method that captures uncertainty within Monte Carlo Tree Search (MCTS). PlanU models the return of each node in the MCTS as a quantile distribution, which uses a set of quantiles to represent the return distribution. To balance exploration and exploitation during tree search, PlanU introduces an Upper Confidence Bounds with Curiosity (UCC) score which estimates the uncertainty of MCTS nodes. Through extensive experiments, we demonstrate the effectiveness of PlanU in LLM-based decision-making tasks under uncertainty.

[322] CircuitSeer: Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs

Shaobo Wang, Yongliang Miao, Yuancheng Liu, and Qianli Ma, Ning Liao, Linfeng Zhang

Main category: cs.AI

TL;DR: CircuitSeer is a novel data selection method that uses internal attention head activation patterns to identify reasoning complexity, achieving better performance with only 10% of training data.

Details

Motivation: Current data selection methods for LLMs rely on expensive external models or opaque heuristics, while internal model mechanisms remain underutilized for identifying reasoning complexity.

Method: The method identifies sparse, specialized attention heads that form core reasoning circuits and quantifies data complexity by measuring its influence on these circuits.

Result: Fine-tuning Qwen2.5-Math-7B on just 10% of data selected by CircuitSeer achieved a 1.4-point gain in average Pass@1 over training on the full dataset across 4 models and 9 datasets.

Conclusion: CircuitSeer demonstrates that leveraging internal model mechanisms for data selection is more efficient and effective than external heuristics, enabling better performance with significantly less training data.

Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities, but scaling their performance often relies on massive reasoning datasets that are computationally expensive to train on. Existing data selection methods aim to curate smaller, high-quality subsets but often rely on costly external models or opaque heuristics. In this work, we shift the focus from external heuristics to the model’s internal mechanisms. We find that complex reasoning tasks consistently activate a sparse, specialized subset of attention heads, forming core reasoning circuits. Building on this insight, we propose CircuitSeer, a novel data selection method that quantifies the reasoning complexity of data by measuring its influence on these crucial circuits. Extensive experiments on 4 models and 9 datasets demonstrate CircuitSeer’s superiority. Notably, fine-tuning Qwen2.5-Math-7B on just 10% of data selected by our method achieves a 1.4-point gain in average Pass@1 over training on the full dataset, highlighting its efficiency and effectiveness.

[323] Probabilistic Modeling of Intentions in Socially Intelligent LLM Agents

Feifan Xia, Yuyang Fang, Defang Li, Yantong Xie, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang

Main category: cs.AI

TL;DR: A probabilistic intent modeling framework for LLM agents in multi-turn social dialogue that maintains belief distributions over partner intentions and enables adaptive dialogue strategies.

Details

Motivation: To develop socially intelligent LLM agents that can better understand and respond to partner intentions in multi-turn social dialogues under uncertainty.

Method: Maintains a belief distribution over partner’s latent intentions, initialized from contextual priors and dynamically updated through likelihood estimation after each utterance. The evolving distribution provides contextual grounding for adaptive dialogue policies.

Result: In SOTOPIA environment, increases Overall score by 9.0% on SOTOPIA-All and 4.1% on SOTOPIA-Hard compared to Qwen2.5-7B baseline, and slightly surpasses an oracle agent that directly observes partner intentions.

Conclusion: Probabilistic intent modeling can contribute to the development of socially intelligent LLM agents by enabling better understanding of partner intentions in social dialogues.

Abstract: We present a probabilistic intent modeling framework for large language model (LLM) agents in multi-turn social dialogue. The framework maintains a belief distribution over a partner’s latent intentions, initialized from contextual priors and dynamically updated through likelihood estimation after each utterance. The evolving distribution provides additional contextual grounding for the policy, enabling adaptive dialogue strategies under uncertainty. Preliminary experiments in the SOTOPIA environment show consistent improvements: the proposed framework increases the Overall score by 9.0% on SOTOPIA-All and 4.1% on SOTOPIA-Hard compared with the Qwen2.5-7B baseline, and slightly surpasses an oracle agent that directly observes partner intentions. These early results suggest that probabilistic intent modeling can contribute to the development of socially intelligent LLM agents.

[324] LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources

Haichao Ji, Zibo Wang, Yifei Zhu, Meng han, Dan Wang, Zhu Han

Main category: cs.AI

TL;DR: LAFA is the first system that integrates LLM-agent-based data analytics with federated analytics, enabling privacy-preserving computation across distributed data sources while supporting natural language queries.

Details

Motivation: Existing LLM-agent analytics frameworks lack privacy protection by assuming centralized data access, while federated analytics preserves privacy but requires structured queries and doesn't support natural language input.

Method: LAFA uses a hierarchical multi-agent architecture with a coarse-grained planner to decompose queries into sub-queries, a fine-grained planner to map subqueries to FA operation DAGs, and an optimizer agent to rewrite and merge DAGs for efficiency.

Result: LAFA outperforms baseline prompting strategies with higher execution plan success rates and significantly reduces resource-intensive FA operations.

Conclusion: LAFA establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in federated analytics settings.

Abstract: Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates LLM-agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the FA setting.

[325] StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking

Haoran Zhang, Chenhao Zhu, Sicong Guo, Hanzhe Guo, Haiming Li, Donglin Yu

Main category: cs.AI

TL;DR: StarBench is a benchmark for evaluating vision-language models’ ability to play turn-based RPGs like humans - mapping screenshots to low-level actions and deciding when to seek guidance.

Details

Motivation: Current VLMs struggle with human-like gameplay that requires mapping raw screenshots to temporally coherent low-level actions while deciding when to ask for help, unlike simplified control scenarios.

Method: StarBench standardizes evaluation across 8 combat tasks in two regimes: direct control (raw screenshots to low-level primitives) and tool-assisted control (higher-level intents with optional OCR). Includes ask-or-act diagnostic for measuring guidance-seeking behavior.

Result: Results show significant gaps in perception-to-control fidelity in direct control regime, while judicious information seeking correlates with improved success rates.

Conclusion: StarBench establishes a reproducible benchmark for evaluating agentic information seeking and multimodal decision-making in real-client gameplay scenarios.

Abstract: Human players do more than press buttons: they ground what they see on screen into precise keyboard-mouse actions and, when stuck, they seek information before trying again. We ask whether current vision-language models (VLMs) can do the same. Despite encouraging results under simplified control or tool scaffolds, human-like play in a real client - mapping raw screenshots to temporally coherent low-level actions while deciding when to ask for guidance - remains an open challenge. We introduce StarBench, a turn-based RPG benchmark derived from Honkai: Star Rail that targets these two human-like competencies: multimodal decision-making from pixels to actions and agentic information seeking. StarBench standardizes evaluation across eight combat tasks and two regimes with shared tasks and metrics: (i) direct control, where agents receive only screenshots and must emit low-level primitives (click and keypress) with no semantic hints; and (ii) tool-assisted control, where higher-level intents can be mapped to primitives by detectors and OCR outputs provide optional textualized observations to ease UI grounding. To mirror human practice, StarBench also includes an ask-or-act diagnostic that measures whether and when agents choose to request brief guidance before proceeding, and how that choice affects subsequent performance. We report reference baselines for contemporary VLMs and a human reference. Results expose sizable gaps in perception-to-control fidelity in the direct regime, while showing that judicious information seeking correlates with improved success, establishing StarBench as a reproducible yardstick for agentic information seeking and multimodal decision-making in real-client play.

[326] AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification

Ho Fai Leung, Xiaoyan Xi, Fei Zuo

Main category: cs.AI

TL;DR: The paper identifies flaws in AndroidControl benchmark that underestimate GUI agent capabilities, creates AndroidControl-Curated benchmark where models achieve 75% success rate, and introduces Magma-R1-3B model that matches performance of much larger models at 200x smaller size.

Details

Motivation: Current on-device virtual assistants rely on rigid APIs, while GUI agents offer API-independent alternatives but are perceived as underperforming due to flawed benchmarks that systematically underestimate their capabilities.

Method: Enhanced AndroidControl benchmark through rigorous purification pipeline to create AndroidControl-Curated, and post-trained Magma-R1-3B model on 2.4k curated samples using minimal compute resources.

Result: On AndroidControl-Curated, state-of-the-art models achieve 75% success rate (15% improvement), and Magma-R1-3B delivers comparable performance to Qwen3-VL-235B despite being 200x smaller, trained with only $60 compute cost.

Conclusion: On-device GUI agents are closer to practical deployment than previously thought, and the enhanced benchmark better reflects true model capabilities, accelerating development of robust virtual assistants.

Abstract: On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their adoption is hindered by the perception of poor performance, as even the best models (e.g. Qwen3-VL-235B) scores are capped at around 60% on benchmarks like AndroidControl, far from viability for real-world use. Our research reveals that issue lies not only with the models but with the benchmarks themselves. We identified notable shortcomings in AndroidControl, including ambiguities and factual errors, which systematically underrates agent capabilities. To address this critical oversight, we enhanced AndroidControl into AndroidControl-Curated, a refined version of the benchmark improved through a rigorous purification pipeline. On this enhanced benchmark, state-of-the-art models achieve success rates nearing 75% on complex tasks (15% improvement), reflecting that on-device GUI agents are actually closer to practical deployment than previously thought. We introduce our new SOTA model, Magma-R1- 3B, post-trained on just 2.4k curated samples using 60 hours of an H20 GPU (approximately $60). Despite being 200 times smaller in parameters, this model delivers performance comparable to Qwen3- VL-235B. We release both AndroidControl-Curated benchmark and Magma-R1 model to the research community, encouraging adoption of this enhanced benchmark to better reflect model capabilities and accelerate the development of robust, on-device virtual assistants.

[327] Crucible: Quantifying the Potential of Control Algorithms through LLM Agents

Lianchen Jia, Chaoyang Li, Qian Houde, Tianchi Huang, Jiangchuan Liu, Lifeng Sun

Main category: cs.AI

TL;DR: Crucible is an LLM-driven agent that uses multi-level expert simulation to tune control algorithms and quantitatively evaluates their Tuning Potential, bridging the gap between ideal performance and real-world tuning needs.

Details

Motivation: Existing research focuses on algorithmic performance under ideal configurations but overlooks Tuning Potential - the critical aspect of how well algorithms can be tuned by domain experts for specific real-world scenarios.

Method: Crucible employs LLM-driven multi-level expert simulation to tune algorithms and defines a formalized metric to quantitatively evaluate Tuning Potential across various control tasks and computer systems.

Result: Crucible systematically quantifies the tunable space across different algorithms and demonstrates effectiveness in case studies from classic control tasks to complex computer systems, with validation in real-world deployment.

Conclusion: Crucible provides a new dimension for algorithm analysis and design that leads to performance improvements by formally evaluating Tuning Potential, addressing a critical gap in current control algorithm research.

Abstract: Control algorithms in production environments typically require domain experts to tune their parameters and logic for specific scenarios. However, existing research predominantly focuses on algorithmic performance under ideal or default configurations, overlooking the critical aspect of Tuning Potential. To bridge this gap, we introduce Crucible, an agent that employs an LLM-driven, multi-level expert simulation to turn algorithms and defines a formalized metric to quantitatively evaluate their Tuning Potential. We demonstrate Crucible’s effectiveness across a wide spectrum of case studies, from classic control tasks to complex computer systems, and validate its findings in a real-world deployment. Our experimental results reveal that Crucible systematically quantifies the tunable space across different algorithms. Furthermore, Crucible provides a new dimension for algorithm analysis and design, which ultimately leads to performance improvements. Our code is available at https://github.com/thu-media/Crucible.

[328] Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models

Hanze Guo, Jing Yao, Xiao Zhou, Xiaoyuan Yi, Xing Xie

Main category: cs.AI

TL;DR: COUPLE is a counterfactual reasoning framework that uses structural causal modeling to align LLMs with pluralistic human values, addressing value complexity and steerability challenges.

Details

Motivation: Current LLM alignment methods treat multiple values as independent and equally important, ignoring their interdependence and relative priorities, and struggle to control nuanced value priorities.

Method: Proposes COUPLE framework with structural causal model to capture value interdependency and prioritization, using counterfactual reasoning to generate outputs aligned with desired value objectives.

Result: COUPLE outperforms other baselines across diverse value objectives on two datasets with different value systems, and provides better interpretability.

Conclusion: COUPLE effectively addresses value complexity and steerability challenges in pluralistic value alignment through causal modeling and counterfactual reasoning.

Abstract: As large language models (LLMs) become increasingly integrated into applications serving users across diverse cultures, communities and demographics, it is critical to align LLMs with pluralistic human values beyond average principles (e.g., HHH). In psychological and social value theories such as Schwartz’s Value Theory, pluralistic values are represented by multiple value dimensions paired with various priorities. However, existing methods encounter two challenges when aligning with such fine-grained value objectives:

they often treat multiple values as independent and equally important, ignoring their interdependence and relative priorities (value complexity); 2) they struggle to precisely control nuanced value priorities, especially those underrepresented ones (value steerability). To handle these challenges, we propose COUPLE, a COUnterfactual reasoning framework for PLuralistic valuE alignment. It introduces a structural causal model (SCM) to feature complex interdependency and prioritization among features, as well as the causal relationship between high-level value dimensions and behaviors. Moreover, it applies counterfactual reasoning to generate outputs aligned with any desired value objectives. Benefitting from explicit causal modeling, COUPLE also provides better interpretability. We evaluate COUPLE on two datasets with different value systems and demonstrate that COUPLE advances other baselines across diverse types of value objectives.

[329] Physics-guided Emulators Reveal Resilience and Fragility under Operational Latencies and Outages

Sarth Dubey, Subimal Ghosh, Udit Bhatia

Main category: cs.AI

TL;DR: Developed an operationally ready emulator of GloFAS that maintains physical coherence under data latency and outages, tested across 5,000+ basins globally.

Details

Motivation: Most hydrologic forecasting models are evaluated under ideal data conditions, lacking operational resilience when input data are delayed, missing, or inconsistent.

Method: Coupled long- and short-term memory networks with relaxed water-balance constraint, trained in US catchments and tested across 5,000+ basins including regulated rivers in India.

Result: Emulator reproduces GloFAS hydrological core and degrades smoothly as information quality declines, showing reduced but physically consistent performance across different regimes.

Conclusion: Framework establishes operational robustness as measurable property of hydrological ML and advances reliable real-time forecasting system design.

Abstract: Reliable hydrologic and flood forecasting requires models that remain stable when input data are delayed, missing, or inconsistent. However, most advances in rainfall-runoff prediction have been evaluated under ideal data conditions, emphasizing accuracy rather than operational resilience. Here, we develop an operationally ready emulator of the Global Flood Awareness System (GloFAS) that couples long- and short-term memory networks with a relaxed water-balance constraint to preserve physical coherence. Five architectures span a continuum of information availability: from complete historical and forecast forcings to scenarios with data latency and outages, allowing systematic evaluation of robustness. Trained in minimally managed catchments across the United States and tested in more than 5,000 basins, including heavily regulated rivers in India, the emulator reproduces the hydrological core of GloFAS and degrades smoothly as information quality declines. Transfer across contrasting hydroclimatic and management regimes yields reduced yet physically consistent performance, defining the limits of generalization under data scarcity and human influence. The framework establishes operational robustness as a measurable property of hydrological machine learning and advances the design of reliable real-time forecasting systems.

[330] SOCIA-Nabla: Textual Gradient Meets Multi-Agent Orchestration for Automated Simulator Generation

Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Hao Xue, Flora D. Salim

Main category: cs.AI

TL;DR: SOCIA-Nabla is an end-to-end agentic framework that treats simulator construction as instance optimization over code within a textual computation graph, using LLM-driven agents and Textual-Gradient Descent for automated code synthesis, execution, evaluation, and repair.

Details

Motivation: To convert brittle prompt pipelines into reproducible, constraint-aware simulator code generation that scales across domains and simulation granularities, while minimizing expert effort.

Method: Uses specialized LLM-driven agents embedded as graph nodes with a workflow manager executing a loss-driven loop (code synthesis -> execution -> evaluation -> code repair) and performs Textual-Gradient Descent optimization.

Result: Attains state-of-the-art overall accuracy across three CPS tasks: User Modeling, Mask Adoption, and Personal Mobility.

Conclusion: SOCIA-Nabla successfully unifies multi-agent orchestration with a loss-aligned optimization view, creating reproducible simulator code generation that scales across domains and granularities.

Abstract: In this paper, we present SOCIA-Nabla, an end-to-end, agentic framework that treats simulator construction asinstance optimization over code within a textual computation graph. Specialized LLM-driven agents are embedded as graph nodes, and a workflow manager executes a loss-driven loop: code synthesis -> execution -> evaluation -> code repair. The optimizer performs Textual-Gradient Descent (TGD), while human-in-the-loop interaction is reserved for task-spec confirmation, minimizing expert effort and keeping the code itself as the trainable object. Across three CPS tasks, i.e., User Modeling, Mask Adoption, and Personal Mobility, SOCIA-Nabla attains state-of-the-art overall accuracy. By unifying multi-agent orchestration with a loss-aligned optimization view, SOCIA-Nabla converts brittle prompt pipelines into reproducible, constraint-aware simulator code generation that scales across domains and simulation granularities. This work is under review, and we will release the code soon.

[331] Extracting alignment data in open models

Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veličković, Ilia Shumailov, Jamie Hayes

Main category: cs.AI

TL;DR: Models can extract significant alignment training data from post-trained models using embedding-based similarity metrics, revealing overlooked risks in data extraction and distillation practices.

Details

Motivation: To investigate the feasibility of extracting alignment training data from post-trained models and understand the risks associated with data regurgitation in distillation practices.

Method: Used embedding models to measure semantic similarity between extracted and original training data, rather than traditional string matching methods like edit distance.

Result: Found that models readily regurgitate training data from post-training phases (SFT/RL), and this extracted data can train base models to recover meaningful performance. Embedding-based methods revealed 10x more extractable data than string matching.

Conclusion: Exposes overlooked risks in alignment data extraction and suggests distillation practices may indirectly train on original datasets through model regurgitation.

Abstract: In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model – useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of $10\times$) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model’s original dataset.

[332] QuantEvolve: Automating Quantitative Strategy Discovery through Multi-Agent Evolutionary Framework

Junhyeog Yun, Hyoun Jun Lee, Insu Jeon

Main category: cs.AI

TL;DR: QuantEvolve is an evolutionary framework that combines quality-diversity optimization with hypothesis-driven strategy generation to automate trading strategy development while maintaining diversity across investor preferences and market conditions.

Details

Motivation: Automating quantitative trading strategy development is challenging in dynamic markets, especially with increasing demand for personalized investment solutions. Existing methods fail to explore the vast strategy space while preserving diversity essential for robust performance across changing market conditions.

Method: QuantEvolve combines quality-diversity optimization with hypothesis-driven strategy generation. It uses a feature map aligned with investor preferences (strategy type, risk profile, turnover, return characteristics) and integrates a hypothesis-driven multi-agent system to systematically explore strategy space through iterative generation and evaluation.

Result: Empirical results show that QuantEvolve outperforms conventional baselines, producing diverse, sophisticated strategies that adapt to both market regime shifts and individual investment needs.

Conclusion: QuantEvolve validates its effectiveness in automated trading strategy development and the authors release a dataset of evolved strategies to support future research.

Abstract: Automating quantitative trading strategy development in dynamic markets is challenging, especially with increasing demand for personalized investment solutions. Existing methods often fail to explore the vast strategy space while preserving the diversity essential for robust performance across changing market conditions. We present QuantEvolve, an evolutionary framework that combines quality-diversity optimization with hypothesis-driven strategy generation. QuantEvolve employs a feature map aligned with investor preferences, such as strategy type, risk profile, turnover, and return characteristics, to maintain a diverse set of effective strategies. It also integrates a hypothesis-driven multi-agent system to systematically explore the strategy space through iterative generation and evaluation. This approach produces diverse, sophisticated strategies that adapt to both market regime shifts and individual investment needs. Empirical results show that QuantEvolve outperforms conventional baselines, validating its effectiveness. We release a dataset of evolved strategies to support future research.

[333] VAR: Visual Attention Reasoning via Structured Search and Backtracking

Wei Cai, Jian Zhao, Yuchen Yuan, Tianle Zhang, Ming Zhu, Haichuan Tang, Chi Zhang, Xuelong Li

Main category: cs.AI

TL;DR: VAR introduces a structured search framework for multimodal reasoning that decomposes reasoning into traceable evidence grounding and search-based chain-of-thought with backtracking, guided by multi-faceted rewards to reduce hallucinations.

Details

Motivation: Address limitations of MLLMs including high hallucination tendency and brittle linear reasoning processes that fail in complex tasks.

Method: Recasts grounded reasoning as structured search over reasoning trajectory space with two stages: traceable evidence grounding and search-based CoT generation with backtracking, guided by semantic and geometric self-verification rewards.

Result: VAR-7B sets new SOTA on comprehensive suite of hallucination and safety benchmarks, significantly outperforming open-source models and showing competitive performance against proprietary systems.

Conclusion: The VAR framework effectively addresses hallucination and reasoning limitations in MLLMs through structured search with verification mechanisms, demonstrating strong empirical performance.

Abstract: Multimodal Large Language Models (MLLMs), despite their advances, are hindered by their high hallucination tendency and heavy reliance on brittle, linear reasoning processes, leading to failures in complex tasks. To address these limitations, we introduce Visual Attention Reasoning (VAR), a novel framework that recasts grounded reasoning as a structured search over a reasoning trajectory space. VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought (CoT) generation, which incorporates a backtracking mechanism for self-correction. The search is guided by a multi-faceted reward function with semantic and geometric self-verification components, which penalize outputs that are not faithfully grounded in the visual input. We provide a theoretical analysis for our search strategy, validating its capability to find the correct solution with high probability. Experimental results show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks, significantly outperforming existing open-source models and demonstrating competitive performance against leading proprietary systems.

[334] Leveraging Association Rules for Better Predictions and Better Explanations

Gilles Audemard, Sylvie Coste-Marquis, Pierre Marquis, Mehdi Sabiri, Nicolas Szczepanski

Main category: cs.AI

TL;DR: Combines data mining-derived association rules with tree-based models to improve classification performance and generate more general explanations.

Details

Motivation: To enhance both predictive performance and explanation quality of tree-based classification models by incorporating knowledge from association rules.

Method: Use data mining to derive association rules (including negations) from data, then leverage these rules to improve decision trees and random forests for classification and explanation tasks.

Result: Experiments show improved predictive performance and smaller explanation sizes for both decision trees and random forests when using the approach.

Conclusion: The integration of association rules with tree-based models offers benefits in both classification accuracy and explanation quality.

Abstract: We present a new approach to classification that combines data and knowledge. In this approach, data mining is used to derive association rules (possibly with negations) from data. Those rules are leveraged to increase the predictive performance of tree-based models (decision trees and random forests) used for a classification task. They are also used to improve the corresponding explanation task through the generation of abductive explanations that are more general than those derivable without taking such rules into account. Experiments show that for the two tree-based models under consideration, benefits can be offered by the approach in terms of predictive performance and in terms of explanation sizes.

[335] Comparative Expressivity for Structured Argumentation Frameworks with Uncertain Rules and Premises

Carlo Proietti, Antonio Yuste-Ginel

Main category: cs.AI

TL;DR: This paper studies qualitative uncertainty modeling in formal argumentation by comparing abstract and structured approaches, introducing an expressivity framework and providing comparative results.

Details

Motivation: To address the gap between abstract uncertainty models and practical implementations by studying plausible instantiations of abstract models and grounding uncertainty in argument components.

Method: Developed a notion of expressivity that handles both abstract and structured formalisms, and compared expressivity of incomplete abstract argumentation frameworks (with dependencies) versus structured ASPIC+ models.

Result: Presented both negative and positive expressivity results showing differences in expressivity between abstract and structured models of argumentation with uncertainty.

Conclusion: The expressivity analysis reveals important differences between abstract and structured approaches to uncertainty in argumentation, impacting how uncertainty should be modeled and implemented in practical applications.

Abstract: Modelling qualitative uncertainty in formal argumentation is essential both for practical applications and theoretical understanding. Yet, most of the existing works focus on \textit{abstract} models for arguing with uncertainty. Following a recent trend in the literature, we tackle the open question of studying plausible instantiations of these abstract models. To do so, we ground the uncertainty of arguments in their components, structured within rules and premises. Our main technical contributions are: i) the introduction of a notion of expressivity that can handle abstract and structured formalisms, and ii) the presentation of both negative and positive expressivity results, comparing the expressivity of abstract and structured models of argumentation with uncertainty. These results affect incomplete abstract argumentation frameworks, and their extension with dependencies, on the abstract side, and ASPIC+, on the structured side.

[336] Query Decomposition for RAG: Balancing Exploration-Exploitation

Roxana Petcu, Kenton Murray, Daniel Khashabi, Evangelos Kanoulas, Maarten de Rijke, Dawn Lawrie, Kevin Duh

Main category: cs.AI

TL;DR: RAG systems use bandit learning methods to dynamically select informative sub-queries by balancing exploration-exploitation trade-offs, achieving significant improvements in document retrieval precision and downstream generation tasks.

Details

Motivation: To address the trade-off in RAG systems between retrieving broadly enough to capture relevant material while limiting retrieval to avoid noise and computational costs.

Method: Formulate query decomposition and document retrieval as an exploitation-exploration setting using bandit learning methods, estimating document relevance using rank information and human judgments.

Result: 35% gain in document-level precision, 15% increase in α-nDCG, and better performance on long-form generation tasks.

Conclusion: Bandit learning methods effectively balance exploration-exploitation in RAG systems, significantly improving document retrieval efficiency and downstream generation quality.

Abstract: Retrieval-augmented generation (RAG) systems address complex user requests by decomposing them into subqueries, retrieving potentially relevant documents for each, and then aggregating them to generate an answer. Efficiently selecting informative documents requires balancing a key trade-off: (i) retrieving broadly enough to capture all the relevant material, and (ii) limiting retrieval to avoid excessive noise and computational cost. We formulate query decomposition and document retrieval in an exploitation-exploration setting, where retrieving one document at a time builds a belief about the utility of a given sub-query and informs the decision to continue exploiting or exploring an alternative. We experiment with a variety of bandit learning methods and demonstrate their effectiveness in dynamically selecting the most informative sub-queries. Our main finding is that estimating document relevance using rank information and human judgments yields a 35% gain in document-level precision, 15% increase in {\alpha}-nDCG, and better performance on the downstream task of long-form generation.

[337] Sherlock Your Queries: Learning to Ask the Right Questions for Dialogue-Based Retrieval

Dong Yun, Marco Schouten, Dim Papadopoulos

Main category: cs.AI

TL;DR: SherlockLLM is a dialogue-driven retrieval framework that uses Reinforcement Learning to learn optimal questioning strategies, enabling efficient user intent clarification without requiring large annotated dialogue datasets.

Details

Motivation: User queries in information retrieval are often ambiguous, and existing dialogue-based interactive retrieval systems lack explicit strategies to ask the most informative questions, making them inefficient.

Method: Proposes SherlockLLM framework where an agent is trained via Reinforcement Learning to generate sequences of binary questions that efficiently narrow down the search space, avoiding the need for large-scale annotated dialogue data.

Result: Experimental results show SherlockLLM matches strong baselines on structured tasks and approaches theoretical optimal defined by binary search. On unstructured tasks, it significantly outperforms baselines, demonstrating effective information-seeking dialogue policy learning.

Conclusion: SherlockLLM is a robust and efficient solution for dialogue-driven retrieval that learns highly effective questioning strategies through Reinforcement Learning, particularly excelling in challenging unstructured tasks.

Abstract: User queries in information retrieval are often ambiguous, making it challenging for systems to identify a user’s target from a single query. While recent dialogue-based interactive retrieval systems can clarify user intent, they are inefficient as they often lack an explicit strategy to ask the most informative questions. To address this limitation, we propose SherlockLLM, a dialogue-driven retrieval framework that learns an optimal questioning strategy via Reinforcement Learning (RL) and avoids the need for large-scale annotated dialogue data. In our framework, an agent is trained to generate a sequence of binary questions to efficiently narrow down the search space. To validate our approach, we introduce a benchmark with both structured and unstructured tasks. Experimental results show that SherlockLLM is a robust and efficient solution. On the structured tasks, its performance matches strong baselines and approaches the theoretical optimal defined by binary search. On the challenging unstructured task, our agent significantly outperforms these baselines, showcasing its ability to learn a highly effective information-seeking dialogue policy.

[338] Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation

Patterson Hsieh, Jerry Yeh, Mao-Chi He, Wen-Han Hsieh, Elvis Hsieh

Main category: cs.AI

TL;DR: ALGOS is a segmentation-and-reasoning system for harmful algal bloom monitoring that combines remote sensing image analysis with severity estimation using vision-language models.

Details

Motivation: Climate change is intensifying harmful algal blooms (cyanobacteria) that threaten aquatic ecosystems and human health, while traditional monitoring methods are labor-intensive and limited in coverage.

Method: Integrates GeoSAM-assisted human evaluation for segmentation mask curation and fine-tunes vision language models on severity prediction using NASA’s Cyanobacteria Aggregated Manual Labels (CAML).

Result: ALGOS achieves robust performance on both segmentation and severity-level estimation tasks.

Conclusion: The system paves the way toward practical and automated cyanobacterial monitoring systems for scalable environmental monitoring.

Abstract: Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.

[339] Decoding Funded Research: Comparative Analysis of Topic Models and Uncovering the Effect of Gender and Geographic Location

Shirin Tavakoli Kafiabad, Andrea Schiffauerova, Ashkan Ebadi

Main category: cs.AI

TL;DR: This study analyzes 18 years of Canadian research proposals to optimize scientific investment by comparing three topic modeling approaches (LDA, STM, BERTopic) and introduces COFFEE algorithm for BERTopic covariate analysis.

Details

Motivation: To understand evolving research trends and demographic/geographical forces shaping them, particularly in light of equity, diversity, and inclusion commitments, for optimizing national scientific investment.

Method: Comparative evaluation of LDA, STM, and BERTopic topic modeling approaches on 18 years of NSERC research proposals (2005-2022), plus development of COFFEE algorithm for covariate effect estimation in BERTopic.

Result: BERTopic outperformed other models by identifying more granular, coherent, and emergent themes (e.g., AI expansion). COFFEE-enabled covariate analysis revealed distinct provincial research specializations and consistent gender-based thematic patterns.

Conclusion: The insights provide a robust empirical foundation for funding organizations to formulate more equitable and impactful funding strategies, enhancing scientific ecosystem effectiveness.

Abstract: Optimizing national scientific investment requires a clear understanding of evolving research trends and the demographic and geographical forces shaping them, particularly in light of commitments to equity, diversity, and inclusion. This study addresses this need by analyzing 18 years (2005-2022) of research proposals funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). We conducted a comprehensive comparative evaluation of three topic modelling approaches: Latent Dirichlet Allocation (LDA), Structural Topic Modelling (STM), and BERTopic. We also introduced a novel algorithm, named COFFEE, designed to enable robust covariate effect estimation for BERTopic. This advancement addresses a significant gap, as BERTopic lacks a native function for covariate analysis, unlike the probabilistic STM. Our findings highlight that while all models effectively delineate core scientific domains, BERTopic outperformed by consistently identifying more granular, coherent, and emergent themes, such as the rapid expansion of artificial intelligence. Additionally, the covariate analysis, powered by COFFEE, confirmed distinct provincial research specializations and revealed consistent gender-based thematic patterns across various scientific disciplines. These insights offer a robust empirical foundation for funding organizations to formulate more equitable and impactful funding strategies, thereby enhancing the effectiveness of the scientific ecosystem.

[340] Discovering the curriculum with AI: A proof-of-concept demonstration with an intelligent tutoring system for teaching project selection

Lovis Heindrich, Falk Lieder

Main category: cs.AI

TL;DR: AI discovers and teaches optimal project selection strategies to executives, improving decision-making in real-world scenarios.

Details

Motivation: Human decision-making is often suboptimal due to cognitive limitations. This research extends AI-driven heuristic discovery from artificial tasks to real-world executive decisions about project selection.

Method: Developed MGPS computational method to automatically discover project selection strategies optimized for real people, and created an intelligent tutor to teach these procedures. Evaluated through computational benchmarks and training experiments with control groups.

Result: MGPS outperformed state-of-the-art methods with better computational efficiency. Participants using the intelligent tutor learned significantly better project selection strategies than control groups.

Conclusion: AI can automate the discovery and formalization of cognitive strategies for intelligent tutoring systems, enhancing real-world decision-making capabilities.

Abstract: The decisions of individuals and organizations are often suboptimal because fully rational decision-making is too demanding in the real world. Recent work suggests that some errors can be prevented by leveraging artificial intelligence to discover and teach clever heuristics. So far, this line of research has been limited to simplified, artificial decision-making tasks. This article is the first to extend this approach to a real-world decision problem, namely, executives deciding which project their organization should launch next. We develop a computational method (MGPS) that automatically discovers project selection strategies that are optimized for real people, and we develop an intelligent tutor that teaches the discovered project selection procedures. We evaluated MGPS on a computational benchmark and tested the intelligent tutor in a training experiment with two control conditions. MGPS outperformed a state-of-the-art method and was more computationally efficient. Moreover, people who practiced with our intelligent tutor learned significantly better project selection strategies than the control groups. These findings suggest that AI could be used to automate the process of discovering and formalizing the cognitive strategies taught by intelligent tutoring systems.

[341] LENS: Large Pre-trained Transformer for Exploring Financial Time Series Regularities

Yuanjian Xu, Anxian Liu, Jianing Hao, Zhenzhuo Li, Shichang Meng, Guang Zhang

Main category: cs.AI

TL;DR: LENS is a pre-trained foundation model for financial time series that addresses domain-specific challenges like stochasticity and low signal-to-noise ratios through specialized architecture and noise mitigation techniques.

Details

Motivation: Traditional methods and general pre-training approaches are ineffective for financial time series due to inherent stochasticity and low signal-to-noise ratios in financial systems, creating a need for domain-specific foundation models.

Method: LENS uses a carefully crafted model architecture with an invertible embedding module to capture financial stochastic system complexity and mitigate noise during pre-training. The approach is theoretically justified.

Result: Pre-trained on 100 billion financial observations, LENS achieves exceptional performance across a wide range of critical downstream tasks in finance.

Conclusion: LENS successfully bridges the gap for financial time series modeling and provides practical insights for developing pre-trained models in high-noise environments, advancing this important research domain.

Abstract: Modeling large-scale time series has gained significant attention in recent years. However, its direct application in finance remains challenging due to substantial differences in data characteristics across domains. Specifically, financial systems feature inherent stochasticity and low signal-to-noise ratios, rendering traditional methods and pre-training approaches ineffective. This underscores the urgent need for a foundation model tailored to financial time series. To bridge this gap, we propose \textbf{LENS}, a pre-trained model for this domain. \textbf{LENS} effectively captures the complexity of financial stochastic systems through a carefully crafted model architecture and mitigates noise during pre-training by using an invertible embedding module. We provide a rigorous theoretical explanation of the model’s effectiveness and validate its performance through extensive experiments. Pre-trained on a dataset comprising 100 billion financial observations, \textbf{LENS} achieves exceptional results across a wide range of critical downstream tasks. Moreover, our work offers practical insights into developing pre-trained time series models in high-noise environments, paving the way for further advancements in this pivotal research domain.

[342] InternLM2.5-StepProver: Advancing Automated Theorem Proving via Critic-Guided Search

Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Zheng Yuan, Wenwei Zhang, Dahua Lin, Kai Chen

Main category: cs.AI

TL;DR: The paper proposes a prover-critic framework for LLM-based mathematical theorem proving, where a critic model captures preference information from tactic trajectories to guide the prover’s search, significantly improving performance from 59.4% to 65.9%.

Details

Motivation: Current LLM-based theorem proving methods using iterative tactic construction often ignore preference information in existing tactic trajectories, which hinders the search for deeper proofs.

Method: A prover-critic framework where a critic model captures preference information from tactic trajectories to guide the prover’s search at runtime, followed by large-scale expert iteration with over 20,000 CPU days to fine-tune both models.

Result: The trained InternLM2.5-StepProver critic significantly boosts the prover model’s performance from 59.4% to 65.9%, demonstrating the effectiveness of the approach.

Conclusion: The prover-critic framework with preference-guided search and expert iteration is an effective method for improving LLM-based mathematical theorem proving, with the critic playing a crucial role in enhancing proof search capabilities.

Abstract: Large Language Models (LLMs) have emerged as powerful tools in mathematical theorem proving, particularly when utilizing formal languages such as LEAN. A prevalent proof method involves the LLM prover iteratively constructing the proof tactic by tactic, typically following a best-first search scheme. However, this method often ignores the critical preference information inside the existing tactic trajectories, hindering the search for deeper proofs. We propose an intuitive yet effective method, which utilizes a critic model to capture the preference information and to guide the search of the prover model at runtime. Given the prover-critic framework, a large-scale expert iteration with more than 20,000 CPU days is then applied to further fine-tune the prover and the critic. The trained InternLM2.5-StepProver critic significantly boosts the performance of the prover model (59.4% to 65.9%). We also analyze the impact of the critic on various aspects of the theorem proving process during expert iteration, providing insights into its effectiveness. We open-source our models and searched proofs at https://github.com/InternLM/InternLM-Math and https://huggingface.co/datasets/internlm/Lean-Workbook.

[343] Do LLMs Strategically Reveal, Conceal, and Infer Information? A Theoretical and Empirical Analysis in The Chameleon Game

Mustafa O. Karabag, Jan Sobotka, Ufuk Topcu

Main category: cs.AI

TL;DR: LLM-based agents struggle with information control in hidden-identity games, revealing too much information to adversaries while failing to conceal secrets effectively.

Details

Motivation: To investigate whether LLMs have the necessary information control and decision-making capabilities for settings with non-cooperative parties, where agents need to conceal information from adversaries, reveal information to cooperators, and infer others' characteristics.

Method: Use LLM agents to play The Chameleon game, a language-based hidden-identity game where non-chameleon agents try to identify the chameleon without revealing a secret. Conduct theoretical analysis of strategies and empirical testing with GPT, Gemini 2.5 Pro, Llama 3.1, and Qwen3 models.

Result: Non-chameleon LLM agents can identify the chameleon but fail to conceal the secret from it, with winning probabilities far below even trivial strategies. Information-revealing levels are linearly encoded in LLMs’ internal representations.

Conclusion: LLM-based agents reveal excessive information to unknown agents. While instructions alone are ineffective for concealment, steering internal representations along the linear direction of information-revealing levels can reliably induce concealing behavior.

Abstract: Large language model-based (LLM-based) agents have become common in settings that include non-cooperative parties. In such settings, agents’ decision-making needs to conceal information from their adversaries, reveal information to their cooperators, and infer information to identify the other agents’ characteristics. To investigate whether LLMs have these information control and decision-making capabilities, we make LLM agents play the language-based hidden-identity game, The Chameleon. In this game, a group of non-chameleon agents who do not know each other aim to identify the chameleon agent without revealing a secret. The game requires the aforementioned information control capabilities both as a chameleon and a non-chameleon. We begin with a theoretical analysis for a spectrum of strategies, from concealing to revealing, and provide bounds on the non-chameleons’ winning probability. The empirical results with GPT, Gemini 2.5 Pro, Llama 3.1, and Qwen3 models show that while non-chameleon LLM agents identify the chameleon, they fail to conceal the secret from the chameleon, and their winning probability is far from the levels of even trivial strategies. Based on these empirical results and our theoretical analysis, we deduce that LLM-based agents may reveal excessive information to agents of unknown identities. Interestingly, we find that, when instructed to adopt an information-revealing level, this level is linearly encoded in the LLM’s internal representations. While the instructions alone are often ineffective at making non-chameleon LLMs conceal, we show that steering the internal representations in this linear direction directly can reliably induce concealing behavior.

[344] Modeling Human Beliefs about AI Behavior for Scalable Oversight

Leon Lang, Patrick Forré

Main category: cs.AI

TL;DR: Modeling human evaluators’ beliefs to improve AI value learning when AI exceeds human capabilities, using belief model covering and foundation model representations to interpret feedback more reliably.

Details

Motivation: Scalable oversight is critical as AI systems advance beyond human capabilities, but human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference.

Method: Formalize human belief models, analyze their theoretical role in value learning, introduce “belief model covering” as a relaxation, and propose using internal representations of adapted foundation models to mimic human evaluators’ beliefs.

Result: The approach enables learning correct values from human feedback even when evaluators misunderstand AI behavior, suggesting belief modeling can improve value learning.

Conclusion: Modeling human beliefs can improve value learning and outlines practical research directions for implementing scalable oversight approaches.

Abstract: As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference. To address this, we propose modeling evaluators’ beliefs to interpret their feedback more reliably. We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains. To reduce reliance on precise belief models, we introduce “belief model covering” as a relaxation. This motivates our preliminary proposal to use the internal representations of adapted foundation models to mimic human evaluators’ beliefs. These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI’s behavior. Our work suggests that modeling human beliefs can improve value learning and outlines practical research directions for implementing this approach to scalable oversight.

[345] A representational framework for learning and encoding structurally enriched trajectories in complex agent environments

Corina Catarau-Cotutiu, Esther Mondragon, Eduardo Alonso

Main category: cs.AI

TL;DR: The paper proposes Structurally Enriched Trajectories (SETs) to enhance AI agent decision-making by incorporating hierarchical relations between objects, interactions, and affordances into trajectory representations, improving generalization across domains.

Details

Motivation: Traditional state-action transition representations lack structural richness, compromising AI agents' ability to make optimal decisions and generalize across complex scenarios and different domains.

Method: Proposes Structurally Enriched Trajectories (SETs) as multi-level graphs that encode hierarchical relations between objects, interactions, and affordances. Implements SETLE architecture with heterogeneous graph-based memory structure for learning relational dependencies.

Result: SETLE enables agents to recognize task-relevant structural patterns across CREATE and MiniGrid environments. Integration with reinforcement learning shows measurable performance improvements, including breakthrough success rates in complex, sparse-reward tasks.

Conclusion: Structurally enriched representations through SETs and SETLE architecture significantly enhance AI agents’ generalization capabilities and performance in complex decision-making scenarios.

Abstract: The ability of artificial intelligence agents to make optimal decisions and generalise them to different domains and tasks is compromised in complex scenarios. One way to address this issue has focused on learning efficient representations of the world and on how the actions of agents affect them in state-action transitions. Whereas such representations are procedurally efficient, they lack structural richness. To address this problem, we propose to enhance the agent’s ontology and extend the traditional conceptualisation of trajectories to provide a more nuanced view of task execution. Structurally Enriched Trajectories (SETs) extend the encoding of sequences of states and their transitions by incorporating hierarchical relations between objects, interactions, and affordances. SETs are built as multi-level graphs, providing a detailed representation of the agent dynamics and a transferable functional abstraction of the task. SETs are integrated into an architecture, Structurally Enriched Trajectory Learning and Encoding (SETLE), that employs a heterogeneous graph-based memory structure of multi-level relational dependencies essential for generalisation. We demonstrate that SETLE can support downstream tasks, enabling agents to recognise task relevant structural patterns across CREATE and MiniGrid environments. Finally, we integrate SETLE with reinforcement learning and show measurable improvements in downstream performance, including breakthrough success rates in complex, sparse-reward tasks.

[346] HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation

Haoran Luo, Haihong E, Guanting Chen, Yandan Zheng, Xiaobao Wu, Yikai Guo, Qika Lin, Yu Feng, Zemin Kuang, Meina Song, Yifan Zhu, Luu Anh Tuan

Main category: cs.AI

TL;DR: HyperGraphRAG is a novel hypergraph-based RAG method that addresses limitations of standard chunk-based RAG and binary graph-based RAG by representing n-ary relations via hyperedges, outperforming existing methods in accuracy, efficiency, and quality.

Details

Motivation: Existing graph-based RAG approaches are constrained by binary relations (edges connecting only two entities), limiting their ability to represent the n-ary relations (n >= 2) found in real-world knowledge.

Method: HyperGraphRAG uses hypergraph-based knowledge representation with hyperedges to capture n-ary relational facts, consisting of three main components: knowledge hypergraph construction, retrieval, and generation.

Result: Experiments across medicine, agriculture, computer science, and law domains demonstrate that HyperGraphRAG outperforms both standard RAG and previous graph-based RAG methods in answer accuracy, retrieval efficiency, and generation quality.

Conclusion: HyperGraphRAG provides a more effective approach for representing complex n-ary relations in knowledge representation for retrieval-augmented generation tasks, with publicly available data and code.

Abstract: Standard Retrieval-Augmented Generation (RAG) relies on chunk-based retrieval, whereas GraphRAG advances this approach by graph-based knowledge representation. However, existing graph-based RAG approaches are constrained by binary relations, as each edge in an ordinary graph connects only two entities, limiting their ability to represent the n-ary relations (n >= 2) in real-world knowledge. In this work, we propose HyperGraphRAG, a novel hypergraph-based RAG method that represents n-ary relational facts via hyperedges, and consists of knowledge hypergraph construction, retrieval, and generation. Experiments across medicine, agriculture, computer science, and law demonstrate that HyperGraphRAG outperforms both standard RAG and previous graph-based RAG methods in answer accuracy, retrieval efficiency, and generation quality. Our data and code are publicly available at https://github.com/LHRLAB/HyperGraphRAG.

[347] Improving Human-AI Coordination through Online Adversarial Training and Generative Models

Paresh Chaudhary, Yancheng Liang, Daphne Chen, Simon S. Du, Natasha Jaques

Main category: cs.AI

TL;DR: GOAT (Generative Online Adversarial Training) is a novel method that combines pretrained generative models with adversarial training to create robust cooperative AI agents that can generalize to diverse human behaviors.

Details

Motivation: Cooperative AI needs to work with diverse humans in economically valuable tasks, but requires training on diverse human behavior data. Adversarial training can generate dynamic data but is difficult to apply in cooperative settings.

Method: GOAT uses a frozen pretrained generative model to simulate valid cooperative policies, then searches the latent space for strategies where the learning agent underperforms, creating adversarial training scenarios while maintaining realistic coordination.

Result: GOAT achieves state-of-the-art performance on the Overcooked benchmark when evaluated with real human partners, demonstrating effective generalization to diverse human behaviors.

Conclusion: The combination of generative models and adversarial training enables robust cooperative AI that can adapt to novel human partners, with GOAT showing promising results in human-AI collaboration tasks.

Abstract: Being able to cooperate with diverse humans is an important component of many economically valuable AI tasks, from household robotics to autonomous driving. However, generalizing to novel humans requires training on data that captures the diversity of human behaviors. Adversarial training is a promising method that allows dynamic data generation and ensures that agents are robust. It creates a feedback loop where the agent’s performance influences the generation of new adversarial data, which can be used immediately to train the agent. However, adversarial training is difficult to apply in a cooperative task; how can we train an adversarial cooperator? We propose a novel strategy that combines a pretrained generative model to simulate valid cooperative agent policies with adversarial training to maximize regret. We call our method GOAT: Generative Online Adversarial Training. In this framework, the GOAT dynamically searches the latent space of the generative model for coordination strategies where the learning policy, the Cooperator agent, underperforms. GOAT enables better generalization by exposing the Cooperator to various challenging interaction scenarios. We maintain realistic coordination strategies by keeping the generative model frozen, thus avoiding adversarial exploitation. We evaluate GOAT with real human partners, and the results demonstrate state of the art performance on the Overcooked benchmark, highlighting its effectiveness in generalizing to diverse human behaviors.

[348] Pretraining a Shared Q-Network for Data-Efficient Offline Reinforcement Learning

Jongchan Park, Mingyu Park, Donghwan Lee

Main category: cs.AI

TL;DR: A plug-and-play pretraining method for offline RL that enhances data efficiency by initializing Q-network features through supervised next-state prediction.

Details

Motivation: Offline RL requires large datasets which are expensive to collect, especially when environment interaction is restricted. There's a need for better data efficiency similar to sample efficiency in online RL.

Method: Proposes a shared Q-network structure that outputs both next-state predictions and Q-values. Pretrains the network through supervised regression to predict next states, then trains with various offline RL methods.

Result: Method enhances performance of existing offline RL methods on D4RL, Robomimic and V-D4RL benchmarks. Significantly boosts data-efficient offline RL across various data qualities and distributions. With only 10% of dataset, outperforms standard algorithms using full datasets.

Conclusion: The proposed pretraining approach effectively improves data efficiency in offline RL, enabling better performance with minimal datasets and working well across different data conditions.

Abstract: Offline reinforcement learning (RL) aims to learn a policy from a static dataset without further interactions with the environment. Collecting sufficiently large datasets for offline RL is exhausting since this data collection requires colossus interactions with environments and becomes tricky when the interaction with the environment is restricted. Hence, how an agent learns the best policy with a minimal static dataset is a crucial issue in offline RL, similar to the sample efficiency problem in online RL. In this paper, we propose a simple yet effective plug-and-play pretraining method to initialize a feature of a Q-network to enhance data efficiency in offline RL. Specifically, we introduce a shared Q-network structure that outputs predictions of the next state and Q-value. We pretrain the shared Q-network through a supervised regression task that predicts a next state and trains the shared Q-network using diverse offline RL methods. Through extensive experiments, we empirically demonstrate that our method enhances the performance of existing popular offline RL methods on the D4RL, Robomimic and V-D4RL benchmarks. Furthermore, we show that our method significantly boosts data-efficient offline RL across various data qualities and data distributions trough D4RL and ExoRL benchmarks. Notably, our method adapted with only 10% of the dataset outperforms standard algorithms even with full datasets.

[349] MTRE: Multi-Token Reliability Estimation for Hallucination Detection in VLMs

Geigh Zollicoffer, Minh Vu, Manish Bhattarai

Main category: cs.AI

TL;DR: MTRE is a lightweight hallucination detection method that analyzes the first ten tokens’ logits using KL divergence and multi-token log-likelihood ratios, achieving significant improvements over existing methods.

Details

Motivation: Current hallucination detectors only analyze the first token's logit, missing richer signals in early token distributions. Hallucinations often emerge gradually over multiple tokens as inconsistencies accumulate.

Method: Multi-Token Reliability Estimation (MTRE) aggregates logits from the first ten tokens using KL divergence between hallucinated/non-hallucinated tokens, multi-token log-likelihood ratios, and self-attention mechanisms.

Result: MTRE achieves 9.4% higher accuracy and 14.8% higher AUROC than standard detection methods across multiple benchmarks including MAD-Bench, MM-SafetyBench, MathVista, and compositional-geometry tasks.

Conclusion: Analyzing complete sequences of early token logits provides substantially more diagnostic information for hallucination detection, establishing MTRE as a new state-of-the-art method for open-source VLMs.

Abstract: Vision-language models (VLMs) now rival human performance on many multimodal tasks, yet they still hallucinate objects or generate unsafe text. Current hallucination detectors, e.g., single-token linear probing (LP) and PTrue, typically analyze only the logit of the first generated token or just its highest-scoring component, overlooking richer signals embedded within earlier token distributions. We demonstrate that analyzing the complete sequence of early logits potentially provides substantially more diagnostic information. We emphasize that hallucinations may only emerge after several tokens, as subtle inconsistencies accumulate over time. By analyzing the Kullback-Leibler (KL) divergence between logits corresponding to hallucinated and non-hallucinated tokens, we underscore the importance of incorporating later-token logits to more accurately capture the reliability dynamics of VLMs. In response, we introduce Multi-Token Reliability Estimation (MTRE), a lightweight, white-box method that aggregates logits from the first ten tokens using multi-token log-likelihood ratios and self-attention. Despite the challenges posed by large vocabulary sizes and long logit sequences, MTRE remains efficient and tractable. Across MAD-Bench, MM-SafetyBench, MathVista, and four compositional-geometry benchmarks, MTRE achieves a 9.4% gain in accuracy and a 14.8% gain in AUROC over standard detection methods, establishing a new state of the art in hallucination detection for open-source VLMs.

[350] SOCIA: Joint Structure-Parameter Co-Optimization for Automated Simulator Construction

Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Jianxiang Xie, Ji Miao, Hao Xue, Flora D. Salim

Main category: cs.AI

TL;DR: SOCIA is a framework for building credible simulators through joint structure-parameter co-optimization, combining Bayesian Optimization and Simulation-Based Inference to handle both in-distribution fitting and out-of-distribution robustness.

Details

Motivation: Traditional simulator construction faces challenges with tightly coupled structure design, parameter calibration, and OOD robustness, making it difficult to build credible simulators from data.

Method: SOCIA treats simulator construction as joint structure-parameter co-optimization, using Bayesian Optimization for sample-efficient calibration and Simulation-Based Inference for uncertainty-aware fitting, with diagnostics triggering structural edits in an outer refinement loop.

Result: SOCIA consistently outperforms strong baselines across three diverse tasks, excelling at both in-distribution fitting and out-of-distribution shift. Ablation studies show near-monotonic degradation when weakening structure, calibration design, or tuning.

Conclusion: Unified structure-parameter optimization is necessary for building credible simulators, and SOCIA provides an effective framework that addresses the tight coupling between structure design, parameter calibration, and OOD robustness.

Abstract: Building credible simulators from data is difficult because structure design, parameter calibration, and out-of-distribution (OOD) robustness are tightly coupled. We introduce SOCIA (Simulation Orchestration for Computational Intelligence with Agents), a framework that treats simulator construction as joint structure-parameter co-optimization: it elicits mechanism-rich blueprints, exposes explicit tunable parameters, and instantiates a calibration schema, producing an executable simulator with built-in calibration hooks. SOCIA couples Bayesian Optimization for sample-efficient point calibration with Simulation-Based Inference for uncertainty-aware fitting; diagnostics trigger targeted structural edits in an outer refinement loop to co-optimize design and parameters under tight budgets. Across three diverse tasks, SOCIA consistently outperforms strong baselines, excelling on both in-distribution (ID) fitting and OOD shift. Ablations that weaken structure, calibration design, or tuning yield near-monotone degradations, underscoring the necessity of unified structure-parameter optimization. We will release the code soon.

[351] Can Agents Fix Agent Issues?

Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, Yiling Lou

Main category: cs.AI

TL;DR: This paper introduces AGENTISSUE-BENCH, a benchmark for evaluating software engineering agents’ ability to resolve real-world issues in LLM-based agent systems, revealing their limited effectiveness (3.33%-12.67% success rates).

Details

Motivation: LLM-based agent systems are widely used but prone to bugs and evolving requirements, making automatic issue resolution crucial. Current SE agents show promise for traditional software but their effectiveness on agent systems remains unknown.

Method: Manually analyzed 201 real agent issues, identified common categories, and constructed AGENTISSUE-BENCH with 50 reproducible agent issue resolution tasks including executable environments and failure-triggering tests.

Result: State-of-the-art SE agents achieved only 3.33% - 12.67% resolution rates on AGENTISSUE-BENCH, demonstrating limited effectiveness in resolving agent system issues.

Conclusion: Agent system maintenance presents unique challenges distinct from traditional software, highlighting the need for developing more advanced SE agents specifically designed for resolving agent issues.

Abstract: LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/ .

[352] VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin

Main category: cs.AI

TL;DR: VIKI-Bench is the first hierarchical benchmark for embodied multi-agent cooperation with three levels: agent activation, task planning, and trajectory perception. VIKI-R is a two-stage framework that fine-tunes VLMs and uses reinforcement learning, achieving superior performance and enabling compositional cooperation patterns.

Details

Motivation: Current VLM-based approaches for multi-agent cooperation are limited in supporting diverse embodiment types, and there's a need for better benchmarks and methods for visual-driven cooperation in embodied AI systems.

Method: Proposed VIKI-Bench benchmark with hierarchical structure and diverse robot embodiments, and VIKI-R framework that fine-tunes pretrained VLMs using Chain-of-Thought demonstrations followed by reinforcement learning with multi-level reward signals.

Result: VIKI-R significantly outperforms baseline methods across all task levels. Reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents.

Conclusion: VIKI-Bench and VIKI-R provide a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems, demonstrating superior performance and enabling complex cooperation patterns.

Abstract: Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

[353] Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning

Khurram Yamin, Gaurav Ghosal, Bryan Wilder

Main category: cs.AI

TL;DR: LLMs struggle with counterfactual reasoning, often relying on parametric knowledge instead of integrating new contextual information, and simple finetuning fails to improve this ability without degrading existing knowledge.

Details

Motivation: To investigate whether LLMs can effectively integrate their parametric knowledge with new contextual information in novel settings through counterfactual reasoning.

Method: Used synthetic and real experiments in multi-hop reasoning problems to test LLMs’ counterfactual reasoning abilities, and applied simple post-hoc finetuning to assess improvement potential.

Result: LLMs generally perform poorly on counterfactual reasoning tasks, defaulting to parametric knowledge rather than integrating contextual information. Finetuning fails to instill counterfactual reasoning ability and often degrades stored parametric knowledge.

Conclusion: Current LLMs have significant limitations in repurposing parametric knowledge for novel settings, revealing fundamental challenges in knowledge integration and counterfactual reasoning capabilities.

Abstract: Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability – often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM’s abilities to re-purpose parametric knowledge in novel settings.

[354] Can Large Language Models Adequately Perform Symbolic Reasoning Over Time Series?

Zewen Liu, Juntong Ni, Xianfeng Tang, Max S. Y. Lau, Wenpeng Yin, Wei Jin

Main category: cs.AI

TL;DR: SymbolBench is a new benchmark for evaluating LLMs’ ability to discover symbolic laws from time series data, covering multivariate regression, Boolean networks, and causal discovery. The authors propose a framework combining LLMs with genetic programming for closed-loop symbolic reasoning.

Details

Motivation: To systematically evaluate LLMs' capability in inferring interpretable, context-aligned symbolic structures from time series data - a core challenge in scientific discovery that remains underexplored despite LLMs' promise in structured reasoning.

Method: Introduce SymbolBench benchmark with three tasks (multivariate symbolic regression, Boolean network inference, causal discovery) and propose a unified framework integrating LLMs with genetic programming to form a closed-loop symbolic reasoning system where LLMs act as both predictors and evaluators.

Result: Empirical results reveal key strengths and limitations of current models, showing that combining domain knowledge, context alignment, and reasoning structure is crucial for improving LLMs in automated scientific discovery.

Conclusion: The study demonstrates the importance of systematic evaluation and hybrid approaches for advancing LLMs’ symbolic reasoning capabilities in scientific discovery, with SymbolBench providing a comprehensive framework for future research.

Abstract: Uncovering hidden symbolic laws from time series data, as an aspiration dating back to Kepler’s discovery of planetary motion, remains a core challenge in scientific discovery and artificial intelligence. While Large Language Models show promise in structured reasoning tasks, their ability to infer interpretable, context-aligned symbolic structures from time series data is still underexplored. To systematically evaluate this capability, we introduce SymbolBench, a comprehensive benchmark designed to assess symbolic reasoning over real-world time series across three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. Unlike prior efforts limited to simple algebraic equations, SymbolBench spans a diverse set of symbolic forms with varying complexity. We further propose a unified framework that integrates LLMs with genetic programming to form a closed-loop symbolic reasoning system, where LLMs act both as predictors and evaluators. Our empirical results reveal key strengths and limitations of current models, highlighting the importance of combining domain knowledge, context alignment, and reasoning structure to improve LLMs in automated scientific discovery.

Meiping Wang, Jian Zhong, Rongduo Han, Liming Kang, Zhengkun Shi, Xiao Liang, Xing Lin, Nan Gao, Haining Zhang

Main category: cs.AI

TL;DR: An automated multi-modal evaluation framework using LLMs and multi-agent collaboration to address challenges in current evaluation methods for mobile AI assistants.

Details

Motivation: Current evaluation methods for multi-modal AI assistants face high manual costs, inconsistent standards, and subjective bias, creating a need for automated, standardized evaluation.

Method: Three-tier agent architecture (interaction evaluation, semantic verification, experience decision agents) using supervised fine-tuning on Qwen3-8B model for multi-modal evaluation.

Result: Achieved significant evaluation matching accuracy with human experts; demonstrated effectiveness in predicting user satisfaction and identifying generation defects across eight major intelligent agents.

Conclusion: The proposed automated framework successfully addresses evaluation challenges and provides reliable, scalable assessment of multi-modal AI assistants.

Abstract: With the rapid development of mobile intelligent assistant technologies, multi-modal AI assistants have become essential interfaces for daily user interactions. However, current evaluation methods face challenges including high manual costs, inconsistent standards, and subjective bias. This paper proposes an automated multi-modal evaluation framework based on large language models and multi-agent collaboration. The framework employs a three-tier agent architecture consisting of interaction evaluation agents, semantic verification agents, and experience decision agents. Through supervised fine-tuning on the Qwen3-8B model, we achieve a significant evaluation matching accuracy with human experts. Experimental results on eight major intelligent agents demonstrate the framework’s effectiveness in predicting users’ satisfaction and identifying generation defects.

[356] ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents

Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, Jie Tang

Main category: cs.AI

TL;DR: ComputerRL is a framework for autonomous desktop intelligence that unifies API calls and GUI interactions, enabling agents to operate complex digital workspaces through scalable distributed RL training with Entropulse strategy.

Details

Motivation: To address the mismatch between machine agents and human-centric desktop environments, and overcome challenges in scaling end-to-end RL training due to environmental inefficiency and instability.

Method: Uses API-GUI paradigm to unify programmatic API calls and direct GUI interaction, develops distributed RL infrastructure for thousands of parallel virtual desktops, and proposes Entropulse training strategy that alternates RL with supervised fine-tuning.

Result: AutoGLM-OS-9B achieves state-of-the-art 48.9% accuracy on OSWorld benchmark, demonstrating significant improvements for general agents in desktop automation.

Conclusion: ComputerRL enables scalable and robust training for autonomous desktop intelligence, with the framework being adopted in building AutoGLM systems.

Abstract: We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks; however, it remains challenging due to environmental inefficiency and instability during extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and GLM-4.1V-9B-Thinking, and evaluate them on the OSWorld benchmark. The AutoGLM-OS-9B achieves a new state-of-the-art accuracy of 48.9%, demonstrating significant improvements for general agents in desktop automation. Our code and the new OfficeWorld benchmark are available at https://github.com/thudm/ComputerRL. The algorithm and framework are adopted in building AutoGLM (Liu et al., 2024b).

[357] PowerChain: A Verifiable Agentic AI System for Automating Distribution Grid Analyses

Emmanuel O. Badmus, Peng Sang, Dimitrios Stamoulis, Amritanshu Pandey

Main category: cs.AI

TL;DR: PowerChain is an agentic system that autonomously performs complex distribution grid analyses, achieving up to 144% performance improvement over baselines by dynamically generating structured context using power systems tools and expert-annotated reasoning trajectories.

Details

Motivation: Rapid electrification and decarbonization are increasing distribution grid complexity, requiring advanced computational analyses that are difficult to automate due to disparate workflows, substantial expert knowledge requirements, and workforce/budget constraints limiting utilities' ability to scale such analyses.

Method: PowerChain dynamically generates structured context by leveraging supervisory signals from self-contained power systems tools (e.g., GridLAB-D) and an optimized set of expert-annotated and verified reasoning trajectories, enabling generalization to unseen distribution grid analysis tasks.

Result: Empirical results on real utility data demonstrate that PowerChain achieves up to a 144% improvement in performance over baselines for complex distribution grid tasks defined in natural language.

Conclusion: PowerChain provides an effective agentic system solution for autonomously performing complex grid analyses, addressing the scalability challenges faced by utilities in distribution grid operation and planning.

Abstract: Rapid electrification and decarbonization are increasing the complexity of distribution grid (DG) operation and planning, necessitating advanced computational analyses to ensure reliability and resilience. These analyses depend on disparate workflows comprising complex models, function calls, and data pipelines that require substantial expert knowledge and remain difficult to automate. Workforce and budget constraints further limit utilities’ ability to apply such analyses at scale. To address this gap, we build an agentic system PowerChain, which is capable of autonomously performing complex grid analyses. Existing agentic AI systems are typically developed in a bottom-up manner with customized context for predefined analysis tasks; therefore, they do not generalize to tasks that the agent has never seen. In comparison, to generalize to unseen DG analysis tasks, PowerChain dynamically generates structured context by leveraging supervisory signals from self-contained power systems tools (e.g., GridLAB-D) and an optimized set of expert-annotated and verified reasoning trajectories. For complex DG tasks defined in natural language, empirical results on real utility data demonstrate that PowerChain achieves up to a 144/% improvement in performance over baselines.

[358] When Agents go Astray: Course-Correcting SWE Agents with PRMs

Shubham Gandhi, Jason Tsay, Jatin Ganhotra, Kiran Kate, Yara Rizk

Main category: cs.AI

TL;DR: SWE-PRM is a Process Reward Model that intervenes during LLM agent execution to detect and correct trajectory-level inefficiencies like redundant exploration and looping, improving software engineering task resolution by 10.6 percentage points.

Details

Motivation: LLM agents deployed for complex software engineering tasks often exhibit costly inefficiencies such as redundant exploration, looping, and failure to terminate, which prior work only addressed post-execution.

Method: SWE-PRM is an inference-time Process Reward Model that leverages a taxonomy of common inefficiencies to provide lightweight, interpretable feedback during execution without modifying the underlying policy.

Result: On SWE-bench Verified, closed-source PRMs improved resolution from 40.0% to 50.6% (+10.6 p.p.), with largest gains on medium and hard tasks. Taxonomy-guided PRMs outperformed unguided variants, increasing success rate while reducing trajectory length.

Conclusion: PRMs provide a practical and scalable mechanism for improving SWE agents’ reliability and efficiency, with acceptable added inference cost as low as $0.2.

Abstract: Large Language Model (LLM) agents are increasingly deployed for complex, multi-step software engineering (SWE) tasks. However, their trajectories often contain costly inefficiencies, such as redundant exploration, looping, and failure to terminate once a solution is reached. Prior work has largely treated these errors in a post-hoc manner, diagnosing failures only after execution. In this paper, we introduce SWE-PRM, an inference-time Process Reward Model (PRM) that intervenes during execution to detect and course-correct trajectory-level errors. Our PRM design leverages a taxonomy of common inefficiencies and delivers lightweight, interpretable feedback without modifying the underlying policy. On SWE-bench Verified, closed-source PRMs improve resolution from 40.0% to 50.6% (+10.6 p.p.), with the largest gains on medium and hard tasks. Among feedback strategies, taxonomy-guided PRMs outperform unguided or explicit action-prescriptive variants, increasing success rate while reducing trajectory length. These benefits come at an acceptable added inference cost of as low as $0.2, making PRMs a practical and scalable mechanism for improving SWE agents’ reliability and efficiency.

[359] Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

Manvi Jha, Jiaxin Wan, Deming Chen

Main category: cs.AI

TL;DR: Proof2Silicon is an end-to-end framework that generates correctness-by-construction hardware from natural language specifications by combining PREFACE’s RL-based prompt optimization for verifiable Dafny code generation with automated translation to C and RTL synthesis.

Details

Motivation: LLMs often generate code that fails formal verification, which is critical for hardware and safety-critical applications. There's a need for automated methods that ensure formal correctness while maintaining the convenience of natural language specifications.

Method: Three-step approach: (1) Use PREFACE’s verifier-driven RL agent to iteratively optimize prompts for generating formally verifiable Dafny code, (2) Automatically translate verified Dafny programs to synthesizable C code using Dafny’s Python backend and PyLog, (3) Employ Vivado HLS to produce RTL implementations.

Result: PREFACE improved Dafny verification success rates by up to 21% across diverse LLMs. Proof2Silicon achieved up to 72% end-to-end hardware synthesis success rate on a 100-task benchmark, generating RTL designs through Vivado HLS.

Conclusion: Proof2Silicon provides a robust, scalable, and automated pipeline for LLM-driven formally verified hardware synthesis, successfully bridging natural-language specification to silicon realization without requiring costly fine-tuning.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE’s verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny’s Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE’s RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.

[360] Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian

Main category: cs.AI

TL;DR: Tree of Agents (TOA) is a multi-agent framework that segments long inputs into chunks processed by independent agents, enabling collaborative reasoning through tree-structured information exchange to mitigate position bias and reduce hallucinations in long-context tasks.

Details

Motivation: Address persistent challenges in LLMs handling long-context tasks, particularly the 'lost in the middle' issue where middle information is underutilized, and limitations of existing methods that either risk discarding key information or cause attention dispersion.

Method: Segments input into chunks processed by independent agents, generates local cognition for each agent, enables dynamic information exchange along tree-structured paths for collaborative reasoning, incorporates prefix-hash caching and adaptive pruning for efficiency.

Result: TOA powered by LLaMA3.1-8B significantly outperforms multiple baselines and demonstrates comparable performance to much larger commercial models like Gemini1.5-pro on various long-context tasks, with significant performance improvements and comparable API overhead.

Conclusion: TOA effectively mitigates position bias and reduces hallucinations in long-context processing through multi-agent collaborative reasoning, achieving state-of-the-art performance with efficient resource utilization.

Abstract: Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at https://github.com/Aireduce952/Tree-of-Agents.

[361] RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang

Main category: cs.AI

TL;DR: RepIt is a data-efficient framework for isolating concept-specific representations in LLMs, enabling precise interventions that suppress refusal on targeted concepts while preserving safety elsewhere.

Details

Motivation: Current activation steering methods in LLMs often have broader effects than desired, motivating the need for purer concept vectors to enable targeted interventions and understand model behavior at a granular level.

Method: RepIt framework isolates concept-specific representations using corrective signals localized to just 100-200 neurons, requiring minimal data (as few as a dozen examples) and compute (single A6000 GPU).

Result: RepIt successfully suppresses refusal on targeted concepts (like WMD-related questions) while maintaining safety scores on standard benchmarks, demonstrating precise behavioral control.

Conclusion: Targeted interventions can counteract overgeneralization in LLMs, laying foundation for granular control of model behavior, though the efficiency also raises concerns about potential misuse with modest resources.

Abstract: While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.

[362] GPO: Learning from Critical Steps to Improve LLM Reasoning

Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing

Main category: cs.AI

TL;DR: GPO is a fine-tuning strategy that identifies critical steps in LLM reasoning trajectories using advantage functions, then resets policies to those steps to prioritize learning from pivotal moments, improving reasoning performance across various benchmarks.

Details

Motivation: Existing optimization methods treat reasoning trajectories as a whole without considering critical steps within the trajectory, limiting their ability to enhance multi-step reasoning capabilities of LLMs.

Method: GPO identifies critical steps in reasoning trajectories using advantage functions, resets the policy to these steps, samples new rollouts, and prioritizes learning from these pivotal moments. It’s a general strategy that can integrate with various optimization methods.

Result: Experiments across challenging reasoning benchmarks show GPO consistently and significantly enhances the performance of existing optimization methods, demonstrating effectiveness and generalizability.

Conclusion: GPO effectively improves LLM reasoning by focusing on pivotal moments within the generation process, making it a valuable general strategy for enhancing reasoning capabilities.

Abstract: Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the \textit{reasoning} or \textit{thinking} capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi-step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce \textbf{G}uided \textbf{P}ivotal \textbf{O}ptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step’ within a reasoning trajectory - a point that the model must carefully proceed to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step, samples the new rollout and prioritizes the learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.

[363] Program Synthesis via Test-Time Transduction

Kang-il Lee, Jahyun Koo, Seunghyun Yoon, Minbeom Kim, Hyukhun Koh, Dongryeol Lee, Kyomin Jung

Main category: cs.AI

TL;DR: Transductive program synthesis improves robustness by using test inputs during synthesis via active learning over program outputs, reducing LLM queries through greedy maximin selection.

Details

Motivation: Prior program synthesis methods struggle with robustness in real-world settings with limited training examples and edge cases in test inputs.

Method: A framework that treats synthesis as active learning over a finite hypothesis class, using LLM to predict outputs for selected test inputs and eliminate inconsistent hypotheses via greedy maximin algorithm.

Result: Significant improvements in program synthesis accuracy and efficiency on four benchmarks: Playgol, MBPP+, 1D-ARC, and programmatic world modeling on MiniGrid.

Conclusion: Transductive program synthesis effectively enhances robustness and efficiency in program synthesis tasks.

Abstract: We introduce transductive program synthesis, a new formulation of the program synthesis task that explicitly leverages test inputs during synthesis. While prior approaches to program synthesis–whether based on natural language descriptions or input-output examples–typically aim to generalize from training examples, they often struggle with robustness, especially in real-world settings where training examples are limited and test inputs involve various edge cases. To address this, we propose a novel framework that improves robustness by treating synthesis as an active learning over a finite hypothesis class defined by programs’ outputs. We use an LLM to predict outputs for selected test inputs and eliminate inconsistent hypotheses, where the inputs are chosen via a greedy maximin algorithm to minimize the number of LLM queries required. We evaluate our approach on four benchmarks: Playgol, MBPP+, 1D-ARC, and programmatic world modeling on MiniGrid. We demonstrate that our method significantly improves program synthesis in both accuracy and efficiency. We release our code at https://github.com/klee972/SYNTRA.

[364] SpecExit: Accelerating Large Reasoning Model via Speculative Exit

Rubing Yang, Huajun Bai, Song Liu, Guanghua Yu, Runzhi Fan, Yanbin Dang, Jiejing Zhang, Kai Liu, Jianchen Zhu, Peng Chen

Main category: cs.AI

TL;DR: SpecExit is a novel framework that uses hidden states from a lightweight draft model to predict both future tokens and early-exit signals, reducing generation length by 66% and achieving 2.5x speedup in latency without accuracy loss.

Details

Motivation: Large reasoning models suffer from overthinking, producing unnecessarily long outputs with high latency, limiting real-world deployment. Existing early-exit methods have detection overhead that limits latency gains and generalizability.

Method: Proposes SpecExit framework that predicts future tokens and early-exit signals directly from hidden states of a lightweight draft model, eliminating probing overhead used in traditional early-exit mechanisms.

Result: Reduces average generation length by 66% and achieves 2.5x speedup in end-to-end latency compared to speculative decoding baseline, without compromising accuracy.

Conclusion: Hidden states provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning in large language models.

Abstract: Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.

[365] R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai

Main category: cs.AI

TL;DR: R-HORIZON is a method and benchmark for evaluating and enhancing long-horizon reasoning in Large Reasoning Models (LRMs) through query composition, revealing performance degradation in current models and enabling improved training via reinforcement learning.

Details

Motivation: Existing benchmarks focus on single-horizon tasks, failing to evaluate models' ability to handle complex, long-horizon scenarios with interdependent problems, creating an incomplete assessment of LRMs' reasoning capabilities.

Method: Proposed R-HORIZON method stimulates long-horizon reasoning through query composition and constructs a benchmark with complex multi-step reasoning tasks spanning long reasoning horizons. Used R-HORIZON to create data for reinforcement learning with verified rewards (RLVR).

Result: Advanced LRMs show significant performance degradation on long-horizon tasks, exhibiting limited effective reasoning length and poor thinking budget allocation. RLVR training with R-HORIZON data substantially improves multi-horizon reasoning performance and boosts accuracy on standard tasks by 7.5 on AIME2024.

Conclusion: R-HORIZON provides a scalable, controllable, and low-cost paradigm for both evaluating and enhancing long-horizon reasoning capabilities in LRMs, addressing current limitations in reasoning model assessment.

Abstract: Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models’ ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

[366] Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

Marius Dragoi, Ioana Pintilie, Florin Gogianu, Florin Brad

Main category: cs.AI

TL;DR: The paper proposes Cover@tau as an alternative to Pass@k for evaluating reasoning boundaries in RLVR models, arguing that Pass@k at large k values reflects random guessing rather than genuine reasoning in discrete answer spaces.

Details

Motivation: To address the misleading nature of Pass@k metrics at large sampling budgets, where base models appear to outperform RLVR models due to random guessing rather than genuine reasoning capabilities.

Method: Introduces Cover@tau metric that measures the fraction of problems a model can solve where at least a tau proportion of completions are correct, capturing reasoning under explicit reliability thresholds.

Result: Evaluation shows that relative rankings of popular RLVR algorithms change significantly when using Cover@tau compared to Pass@1, providing a different perspective on reasoning boundaries.

Conclusion: Cover@tau offers a more reliable way to assess reasoning boundaries by penalizing models that rely on random guessing, revealing different performance patterns than traditional Pass@k metrics.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.

[367] SAFER: Risk-Constrained Sample-then-Filter in Large Language Models

Qingni Wang, Yue Fan, Xin Eric Wang

Main category: cs.AI

TL;DR: SAFER is a two-stage framework for trustworthy LLM deployment in open-ended QA that provides statistical guarantees through abstention-aware sampling and conformal filtering, addressing the limitation of finite solution space assumptions in prior methods.

Details

Motivation: Existing selective conformal prediction methods unrealistically assume finite sampling can obtain all admissible answers for open-ended QA, which lacks a fixed solution space. This creates trustworthiness issues for LLMs in risk-sensitive applications.

Method: Two-stage framework: 1) Calibrates sampling budget using Clopper-Pearson exact method with user-defined risk level, abstaining if risk cannot be met within sampling cap. 2) Uses conformal risk control to filter unreliable distractors from candidate sets with statistically valid uncertainty thresholds.

Result: SAFER provides statistical guarantees for LLM outputs in open-ended QA, controls risk of excluding correct answers, and demonstrates compatibility with various admission criteria and calibration-test split ratios.

Conclusion: SAFER offers a robust, data-efficient solution for trustworthy LLM deployment in open-ended scenarios by addressing the finite solution space limitation and providing formal risk control guarantees.

Abstract: As large language models (LLMs) are increasingly deployed in risk-sensitive applications such as real-world open-ended question answering (QA), ensuring the trustworthiness of their outputs has become critical. Existing selective conformal prediction (SCP) methods provide statistical guarantees by constructing prediction sets with a constrained miscoverage rate for correct answers. However, prior works unrealistically assume that admissible answers for all instances can be obtained via finite sampling, even for open-ended QA scenarios that lack a fixed and finite solution space. To address this, we introduce a two-stage risk control framework comprising abstention-aware sampling and conformalized filtering (SAFER). Firstly, on a held-out calibration set, SAFER calibrates a sampling budget within the maximum sampling cap, using the Clopper-Pearson exact method at a user-desired risk level (i.e., the maximum allowable miscoverage rate of the sampling sets). If the risk level cannot be satisfied within the cap, we abstain; otherwise, the calibrated sampling budget becomes the minimum requirements at test time. Then, we employ calibration instances where correct answers are attainable under the calibrated budget and apply the conformal risk control method to determine a statistically valid uncertainty threshold, which filters unreliable distractors from the candidate set for each test data point. In this stage, SAFER introduces an additional risk level to guide the calculation of the threshold, thereby controlling the risk of correct answers being excluded. Furthermore, we show that SAFER is compatible with various task-specific admission criteria and calibration-test split ratios, highlighting its robustness and high data efficiency.

[368] Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Trilok Padhi, Pinxian Lu, Abdulkadir Erol, Tanmay Sutar, Gauri Sharma, Mina Sonmez, Munmun De Choudhury, Ugur Kursuncu

Main category: cs.AI

TL;DR: This paper introduces a multi-turn harassment benchmark for LLM agents, showing that jailbreak attacks significantly increase harassment success rates and reduce refusal rates, with closed-source models being particularly vulnerable.

Details

Motivation: Current jailbreak research focuses on single-turn prompts, but real harassment occurs over multiple interactions. There's a need to understand and test LLM agent vulnerabilities in multi-turn harassment scenarios.

Method: Created Online Harassment Agentic Benchmark with synthetic multi-turn harassment dataset, multi-agent simulation using game theory, three jailbreak methods (memory, planning, fine-tuning), and mixed-methods evaluation using LLaMA-3.1-8B-Instruct and Gemini-2.0-flash.

Result: Jailbreak tuning increased harassment success rates to 95.78-96.89% vs 57.25-64.19% without tuning in Llama, and 99.33% vs 98.46% in Gemini. Refusal rates dropped to 1-2%. Most prevalent toxic behaviors were Insult (84.9-87.8% vs 44.2-50.8%) and Flaming (81.2-85.1% vs 31.5-38.8%).

Conclusion: Multi-turn, theory-grounded attacks successfully mimic human-like harassment dynamics, highlighting the need for robust safety guardrails in LLM agents to protect online platforms.

Abstract: Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78–96.89% vs. 57.25–64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9–87.8% vs. 44.2–50.8% without tuning, and Flaming with 81.2–85.1% vs. 31.5–38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

[369] Towards Agentic Self-Learning LLMs in Search Environment

Wangtao Sun, Xiang Cheng, Jialin Fan, Yao Xu, Xing Yu, Shizhu He, Jun Zhao, Kang Liu

Main category: cs.AI

TL;DR: Agentic Self-Learning (ASL) is a closed-loop RL framework that enables LLM-based agents to self-improve without human data or rule-based rewards, using multi-role co-evolution of task generation, policy execution, and generative reward modeling.

Details

Motivation: To scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards, addressing the limitations of existing approaches that plateau or degrade.

Method: Proposes ASL framework with three coordinated components: Prompt Generator (creates tasks), Policy Model (executes tasks), and Generative Reward Model (evaluates performance). Uses multi-role reinforcement learning with co-evolution of GRM and policy in a shared tool environment.

Result: ASL achieves steady performance gains, surpasses RLVR baselines like Search-R1, continues improving under zero-labeled-data conditions, and shows superior sample efficiency. GRM verification capacity is identified as the main bottleneck.

Conclusion: Reward source and data scale are critical for open-domain agent learning. Multi-role co-evolution enables scalable, self-improving agents, with continual GRM training being essential to prevent reward hacking and maintain progress.

Abstract: We study whether self-learning can scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards. Through controlled experiments in a search-agent setting, we identify two key determinants of scalable agent training: the source of reward signals and the scale of agent task data. We find that rewards from a Generative Reward Model (GRM) outperform rigid rule-based signals for open-domain learning, and that co-evolving the GRM with the policy further boosts performance. Increasing the volume of agent task data-even when synthetically generated-substantially enhances agentic capabilities. Building on these insights, we propose \textbf{Agentic Self-Learning} (ASL), a fully closed-loop, multi-role reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone. ASL coordinates a Prompt Generator, a Policy Model, and a Generative Reward Model to form a virtuous cycle of harder task setting, sharper verification, and stronger solving. Empirically, ASL delivers steady, round-over-round gains, surpasses strong RLVR baselines (e.g., Search-R1) that plateau or degrade, and continues improving under zero-labeled-data conditions, indicating superior sample efficiency and robustness. We further show that GRM verification capacity is the main bottleneck: if frozen, it induces reward hacking and stalls progress; continual GRM training on the evolving data distribution mitigates this, and a small late-stage injection of real verification data raises the performance ceiling. This work establishes reward source and data scale as critical levers for open-domain agent learning and demonstrates the efficacy of multi-role co-evolution for scalable, self-improving agents. The data and code of this paper are released at https://github.com/forangel2014/Towards-Agentic-Self-Learning

[370] SimKO: Simple Pass@K Policy Optimization

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen

Main category: cs.AI

TL;DR: RLVR methods suffer from exploitation bias that improves pass@1 but harms pass@K performance. The paper identifies a probability concentration effect and proposes SimKO to mitigate this by asymmetrically boosting top-K candidates for correct responses and penalizing top-1 for incorrect ones.

Details

Motivation: To address the systematic bias in RLVR methods toward exploitation over exploration, evidenced by improved pass@1 but reduced pass@K performance, and understand the underlying training dynamics.

Method: Proposed Simple Pass@K Optimization (SimKO) - an asymmetric method that boosts probabilities of top-K candidates for verified-correct responses and applies stronger penalties to top-1 candidate for verified-incorrect responses, particularly at high-entropy tokens.

Result: SimKO consistently yields higher pass@K across various math and logical-reasoning benchmarks for a wide range of K, effectively mitigating the over-concentration issue.

Conclusion: SimKO provides a simple and effective way to improve RLVR’s exploration capabilities by addressing the probability concentration effect, leading to better pass@K performance without sacrificing pass@1 gains.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR’s exploration.

[371] Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao

Main category: cs.AI

TL;DR: This paper introduces a framework to detect and resolve logical inconsistencies (preference cycles) in LLM judge feedback for language model alignment, improving training stability and performance.

Details

Motivation: LLM judge feedback offers scalable alignment but suffers from judgment inconsistencies that destabilize reinforcement learning, with preference cycles being a critical unaddressed issue.

Method: Proposes an end-to-end framework with Conflict Detection Rate (CDR) metric and Deconflicted Graph Rewards (DGR) - a signal-purification framework that constructs preference graphs, transforms them into conflict-free DAGs, and generates coherent reward signals.

Result: Experiments show the framework significantly improves training stability and model performance over strong baselines.

Conclusion: Logical consistency is established as a crucial and now-addressable dimension of AI feedback for language model alignment.

Abstract: Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the critical issue of logical coherence particularly preference cycles has been largely unaddressed. To address this gap, this work introduces an end to end framework to systematically detect and resolve these inconsistencies within the reinforcement learning training loop. Our framework features two core contributions: the Conflict Detection Rate (CDR), a novel metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a signal-purification framework that eliminates cycles before policy optimization. DGR constructs preference graphs from raw judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal compatible with any policy optimizer. Experiments confirm that our framework significantly improves training stability and model performance over strong baselines, establishing logical consistency as a crucial and now-addressable dimension of AI feedback. The code for our method is available at https://github.com/modelscope/RM-Gallery.

[372] PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold

Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, Zheqing Zhu

Main category: cs.AI

TL;DR: PokeeResearch-7B is a 7B-parameter deep research agent that uses reinforcement learning from AI feedback and chain-of-thought reasoning to achieve state-of-the-art performance on research benchmarks.

Details

Motivation: Current tool-augmented LLM agents have limitations including shallow retrieval, weak alignment metrics, and brittle tool-use behavior that need to be addressed.

Method: Uses unified reinforcement learning framework with RLAIF (Reinforcement Learning from AI Feedback) for training, and incorporates chain-of-thought-driven multi-call reasoning with self-verification and adaptive recovery from tool failures.

Result: Achieves state-of-the-art performance among 7B-scale deep research agents across 10 popular deep research benchmarks.

Conclusion: Careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents, demonstrating the effectiveness of the proposed approach.

Abstract: Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under Apache 2.0 license at https://github.com/Pokee-AI/PokeeResearchOSS.

[373] Uncertain Knowledge Graph Completion via Semi-Supervised Confidence Distribution Learning

Tianxing Wu, Shutong Zhu, Jingting Wang, Ning Xu, Guilin Qi, Haofen Wang

Main category: cs.AI

TL;DR: Proposes ssCDL, a semi-supervised confidence distribution learning method for uncertain knowledge graph completion that addresses imbalanced confidence distributions through meta-learning and pseudo-labeling.

Details

Motivation: Current UKG completion methods neglect the extremely imbalanced distributions of triple confidences, causing insufficient embeddings for high-quality completion.

Method: Transforms triple confidences into distributions, uses relational learning on labeled and unlabeled data with pseudo labels generated by meta-learning to augment training data and rebalance confidence distributions.

Result: Experiments on two UKG datasets show ssCDL consistently outperforms state-of-the-art baselines across different evaluation metrics.

Conclusion: ssCDL effectively addresses confidence imbalance in UKG completion through semi-supervised confidence distribution learning and meta-learning-based pseudo-labeling.

Abstract: Uncertain knowledge graphs (UKGs) associate each triple with a confidence score to provide more precise knowledge representations. Recently, since real-world UKGs suffer from the incompleteness, uncertain knowledge graph (UKG) completion attracts more attention, aiming to complete missing triples and confidences. Current studies attempt to learn UKG embeddings to solve this problem, but they neglect the extremely imbalanced distributions of triple confidences. This causes that the learnt embeddings are insufficient to high-quality UKG completion. Thus, in this paper, to address the above issue, we propose a new semi-supervised Confidence Distribution Learning (ssCDL) method for UKG completion, where each triple confidence is transformed into a confidence distribution to introduce more supervision information of different confidences to reinforce the embedding learning process. ssCDL iteratively learns UKG embedding by relational learning on labeled data (i.e., existing triples with confidences) and unlabeled data with pseudo labels (i.e., unseen triples with the generated confidences), which are predicted by meta-learning to augment the training data and rebalance the distribution of triple confidences. Experiments on two UKG datasets demonstrate that ssCDL consistently outperforms state-of-the-art baselines in different evaluation metrics.

[374] Offline Policy Evaluation of Multi-Turn LLM Health Coaching with Real Users

Melik Ozolcer, Sang Won Bae

Main category: cs.AI

TL;DR: Study of a web-deployed LLM health coach showing that uniform heavy-tool policies harm low-health-literacy users, and that early information-gathering improves personalization outcomes.

Details

Motivation: To evaluate and improve personalization in tool-augmented LLM health coaching systems, particularly addressing subgroup harms that average metrics may obscure.

Method: Used offline policy evaluation with factorized decision heads (Tool/Style) on real user data (7 users, 280 turns), plus lightweight simulator with hidden archetypes testing early information-gain bonuses.

Result: Uniform heavy-tool policy raised average value but harmed low-health-literacy/high-self-efficacy users. Early information-gathering shortened trait identification and improved goal success and pass@3 metrics.

Conclusion: Proposes evaluation-first personalization: freeze generator, learn subgroup-aware decision heads on typed rewards, and report per-archetype metrics to surface subgroup harms.

Abstract: We study a web-deployed, tool-augmented LLM health coach with real users. In a pilot with seven users (280 rated turns), offline policy evaluation (OPE) over factorized decision heads (Tool/Style) shows that a uniform heavy-tool policy raises average value on logs but harms specific subgroups, most notably low-health-literacy/high-self-efficacy users. A lightweight simulator with hidden archetypes further shows that adding a small early information-gain bonus reliably shortens trait identification and improves goal success and pass@3. Together, these early findings indicate an evaluation-first path to personalization: freeze the generator, learn subgroup-aware decision heads on typed rewards (objective tool outcomes and satisfaction), and always report per-archetype metrics to surface subgroup harms that averages obscure.

cs.SD

[375] Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

Stavros Mitsis, Ermos Hadjikyriakos, Humaid Ibrahim, Savvas Neofytou, Shashwat Raman, James Myles, Eiman Kanjo

Main category: cs.SD

TL;DR: A hardware-aware multimodal emotion recognition system combining acoustic and linguistic features using late-fusion architecture optimized for Edge TPU, achieving real-time inference within 1.8MB memory budget and 21-23ms latency on microcontroller-class edge devices.

Details

Motivation: To address the challenge of deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private, especially for applications like tension monitoring and conflict de-escalation where cloud-based solutions are impractical.

Method: Combines acoustic and linguistic features using late-fusion architecture optimized for Edge TPU, integrating quantised transformer-based acoustic model with frozen keyword embeddings from DSResNet-SE network, with spectrogram alignment using MicroFrontend and MLTK.

Result: Achieves 6.3% macro F1 improvement over unimodal baselines on re-recorded, segmented IEMOCAP samples captured through Coral Dev Board Micro microphone, with real-time inference within 1.8MB memory budget and 21-23ms latency.

Conclusion: Accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.

Abstract: Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.

[376] ParaStyleTTS: Toward Efficient and Robust Paralinguistic Style Control for Expressive Text-to-Speech Generation

Haowei Lou, Hye-Young Paik, Wen Hu, Lina Yao

Main category: cs.SD

TL;DR: ParaStyleTTS is a lightweight TTS framework that enables expressive style control from text prompts without reference audio or large language models, achieving comparable quality to LLM-based systems while being 30x faster and more resource-efficient.

Details

Motivation: Existing TTS style control methods face limitations: reference audio approaches have privacy and accessibility issues, while LLM-based methods are computationally expensive, lack interpretability, and are sensitive to prompt phrasing.

Method: Proposes a two-level style adaptation architecture that separates prosodic and paralinguistic speech style modeling, enabling fine-grained control over factors like emotion, gender, and age from text prompts alone.

Result: ParaStyleTTS generates high-quality speech comparable to state-of-the-art LLM systems while being 30x faster, using 8x fewer parameters, and requiring 2.5x less CUDA memory. It also shows superior robustness and controllability over paralinguistic styles.

Conclusion: ParaStyleTTS provides a practical, efficient, and interpretable solution for style-controllable TTS that is well-suited for real-world applications including on-device and low-resource deployment.

Abstract: Controlling speaking style in text-to-speech (TTS) systems has become a growing focus in both academia and industry. While many existing approaches rely on reference audio to guide style generation, such methods are often impractical due to privacy concerns and limited accessibility. More recently, large language models (LLMs) have been used to control speaking style through natural language prompts; however, their high computational cost, lack of interpretability, and sensitivity to prompt phrasing limit their applicability in real-time and resource-constrained environments. In this work, we propose ParaStyleTTS, a lightweight and interpretable TTS framework that enables expressive style control from text prompts alone. ParaStyleTTS features a novel two-level style adaptation architecture that separates prosodic and paralinguistic speech style modeling. It allows fine-grained and robust control over factors such as emotion, gender, and age. Unlike LLM-based methods, ParaStyleTTS maintains consistent style realization across varied prompt formulations and is well-suited for real-world applications, including on-device and low-resource deployment. Experimental results show that ParaStyleTTS generates high-quality speech with performance comparable to state-of-the-art LLM-based systems while being 30x faster, using 8x fewer parameters, and requiring 2.5x less CUDA memory. Moreover, ParaStyleTTS exhibits superior robustness and controllability over paralinguistic speaking styles, providing a practical and efficient solution for style-controllable text-to-speech generation. Demo can be found at https://parastyletts.github.io/ParaStyleTTS_Demo/. Code can be found at https://github.com/haoweilou/ParaStyleTTS.

[377] SegTune: Structured and Fine-Grained Control for Song Generation

Pengfei Cai, Joanna Wang, Haorui Zheng, Xu Li, Zihao Ji, Teng Ma, Zhongliang Liu, Chen Zhang, Pengfei Wan

Main category: cs.SD

TL;DR: SegTune is a non-autoregressive framework for structured and controllable song generation that enables segment-level control through local musical descriptions aligned to song sections, using LLM-based duration prediction for precise lyric-to-music alignment.

Details

Motivation: Existing song generation systems lack the ability to model temporally varying attributes, limiting fine-grained control over musical structure and dynamics.

Method: Proposes SegTune framework with segment-level control via temporally broadcasted prompts, LLM-based duration predictor for timestamped lyrics in LRC format, and a large-scale data pipeline for aligned lyrics and prompts.

Result: SegTune achieves superior controllability and musical coherence compared to existing baselines, with new evaluation metrics showing improved segment-level alignment and vocal attribute consistency.

Conclusion: The proposed SegTune framework successfully addresses the limitations of existing systems by enabling structured and controllable song generation with precise segment-level control and alignment.

Abstract: Recent advancements in song generation have shown promising results in generating songs from lyrics and/or global text prompts. However, most existing systems lack the ability to model the temporally varying attributes of songs, limiting fine-grained control over musical structure and dynamics. In this paper, we propose SegTune, a non-autoregressive framework for structured and controllable song generation. SegTune enables segment-level control by allowing users or large language models to specify local musical descriptions aligned to song sections.The segmental prompts are injected into the model by temporally broadcasting them to corresponding time windows, while global prompts influence the whole song to ensure stylistic coherence. To obtain accurate segment durations and enable precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamped lyrics in LRC format. We further construct a large-scale data pipeline for collecting high-quality songs with aligned lyrics and prompts, and propose new evaluation metrics to assess segment-level alignment and vocal attribute consistency. Experimental results show that SegTune achieves superior controllability and musical coherence compared to existing baselines. See https://cai525.github.io/SegTune_demo for demos of our work.

[378] A Stage-Wise Learning Strategy with Fixed Anchors for Robust Speaker Verification

Bin Gu, Lipeng Dai, Huipeng Du, Haitao Zhao, Jibo Wei

Main category: cs.SD

TL;DR: Proposed anchor-based stage-wise learning for robust speaker representation, using base model training followed by anchor extraction and noise-robust fine-tuning with anchor regularization.

Details

Motivation: Learning robust speaker representations under noisy conditions requires handling both discriminative and noise-invariant properties, which presents significant challenges.

Method: Anchor-based stage-wise learning: 1) Train base model for discriminative speaker boundaries, 2) Extract anchor embeddings as stable references, 3) Fine-tune copy on noisy inputs with regularization to maintain proximity to fixed anchor embeddings.

Result: Strategy offers advantages over conventional joint optimization, particularly in maintaining discrimination while improving noise robustness. Consistent improvements across various noise conditions.

Conclusion: The method effectively handles boundary stabilization and variation suppression separately, leading to better noise-robust speaker representations.

Abstract: Learning robust speaker representations under noisy conditions presents significant challenges, which requires careful handling of both discriminative and noise-invariant properties. In this work, we proposed an anchor-based stage-wise learning strategy for robust speaker representation learning. Specifically, our approach begins by training a base model to establish discriminative speaker boundaries, and then extract anchor embeddings from this model as stable references. Finally, a copy of the base model is fine-tuned on noisy inputs, regularized by enforcing proximity to their corresponding fixed anchor embeddings to preserve speaker identity under distortion. Experimental results suggest that this strategy offers advantages over conventional joint optimization, particularly in maintaining discrimination while improving noise robustness. The proposed method demonstrates consistent improvements across various noise conditions, potentially due to its ability to handle boundary stabilization and variation suppression separately.

[379] Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification

Bin Gu, Lipeng Dai, Huipeng Du, Haitao Zhao, Jibo Wei

Main category: cs.SD

TL;DR: A noise-conditioned mixture-of-experts framework that decomposes feature space into specialized noise-aware subspaces for robust speaker verification under diverse noise conditions.

Details

Motivation: Robust speaker verification under noisy conditions remains challenging, and conventional methods learning unified representation spaces have limitations in handling diverse background noise.

Method: Proposes noise-conditioned expert routing mechanism, universal model-based expert specialization strategy, and SNR-decaying curriculum learning protocol to automatically route inputs to specialized expert networks based on noise characteristics.

Result: Comprehensive experiments demonstrate consistent superiority over baselines, showing significant enhancement in robustness without sacrificing verification accuracy.

Conclusion: Explicit noise-dependent feature modeling significantly improves speaker verification robustness under diverse noise conditions compared to unified representation approaches.

Abstract: Robust speaker verification under noisy conditions remains an open challenge. Conventional deep learning methods learn a robust unified speaker representation space against diverse background noise and achieve significant improvement. In contrast, this paper presents a noise-conditioned mixture-ofexperts framework that decomposes the feature space into specialized noise-aware subspaces for speaker verification. Specifically, we propose a noise-conditioned expert routing mechanism, a universal model based expert specialization strategy, and an SNR-decaying curriculum learning protocol, collectively improving model robustness and generalization under diverse noise conditions. The proposed method can automatically route inputs to expert networks based on noise information derived from the inputs, where each expert targets distinct noise characteristics while preserving speaker identity information. Comprehensive experiments demonstrate consistent superiority over baselines, confirming that explicit noise-dependent feature modeling significantly enhances robustness without sacrificing verification accuracy.

cs.LG

[380] From Noise to Laws: Regularized Time-Series Forecasting via Denoised Dynamic Graphs

Hongwei Ma, Junbin Gao, Minh-ngoc Tran

Main category: cs.LG

TL;DR: PRISM is a diffusion-based model for long-horizon multivariate time-series forecasting that combines score-based diffusion, dynamic graph encoding, and physics regularization to achieve state-of-the-art performance.

Details

Motivation: Long-horizon multivariate time-series forecasting faces challenges in denoising heterogeneous signals, tracking time-varying cross-series dependencies, and maintaining stability and physical plausibility over long horizons.

Method: PRISM couples a score-based diffusion preconditioner with a dynamic correlation-thresholded graph encoder and a forecast head regularized by generic physics penalties.

Result: On six standard benchmarks, PRISM achieves consistent state-of-the-art performance with strong MSE and MAE gains.

Conclusion: The model provides robust forecasting with proven contraction of horizon dynamics and Lipschitz bounds for graph blocks, explaining its robustness.

Abstract: Long-horizon multivariate time-series forecasting is challenging because realistic predictions must (i) denoise heterogeneous signals, (ii) track time-varying cross-series dependencies, and (iii) remain stable and physically plausible over long rollout horizons. We present PRISM, which couples a score-based diffusion preconditioner with a dynamic, correlation-thresholded graph encoder and a forecast head regularized by generic physics penalties. We prove contraction of the induced horizon dynamics under mild conditions and derive Lipschitz bounds for graph blocks, explaining the model’s robustness. On six standard benchmarks , PRISM achieves consistent SOTA with strong MSE and MAE gains.

[381] GRETEL: A Goal-driven Retrieval and Execution-based Trial Framework for LLM Tool Selection Enhancing

Zongze Wu, Yani Guo, Churong Liang, Runnan Li

Main category: cs.LG

TL;DR: GRETEL addresses the semantic-functional gap in tool retrieval by using execution-based validation instead of just semantic similarity, improving tool selection reliability.

Details

Motivation: Current tool retrieval methods rely on semantic similarity but often retrieve tools that are textually relevant but functionally inoperative due to parameter mismatches, authentication failures, and execution constraints.

Method: GRETEL implements an agentic workflow that processes semantically retrieved candidates through sandboxed plan-execute-evaluate cycles, generating execution-grounded evidence to distinguish functional tools from merely descriptive matches.

Result: Evaluation on ToolBench benchmark shows substantial improvements: Pass Rate (at 10) from 0.690 to 0.826, Recall (at 10) from 0.841 to 0.867, and NDCG (at 10) from 0.807 to 0.857.

Conclusion: Execution-based validation provides a more reliable foundation for tool selection than semantic similarity alone, enabling more robust agent performance in real-world applications.

Abstract: Despite remarkable advances in Large Language Model capabilities, tool retrieval for agent-based systems remains fundamentally limited by reliance on semantic similarity, which fails to capture functional viability. Current methods often retrieve textually relevant but functionally inoperative tools due to parameter mismatches, authentication failures, and execution constraints–a phenomenon we term the semantic-functional gap. We introduce GRETEL, to address this gap through systematic empirical validation. GRETEL implements an agentic workflow that processes semantically retrieved candidates through sandboxed plan-execute-evaluate cycles, generating execution-grounded evidence to distinguish truly functional tools from merely descriptive matches. Our comprehensive evaluation on the ToolBench benchmark demonstrates substantial improvements across all metrics: Pass Rate (at 10) increases from 0.690 to 0.826, Recall (at 10) improves from 0.841 to 0.867, and NDCG (at 10) rises from 0.807 to 0.857.. These results establish that execution-based validation provides a more reliable foundation for tool selection than semantic similarity alone, enabling more robust agent performance in real-world applications.

[382] CARLE: A Hybrid Deep-Shallow Learning Framework for Robust and Explainable RUL Estimation of Rolling Element Bearings

Waleed Razzaq, Yun-Bo Zhao

Main category: cs.LG

TL;DR: CARLE is a hybrid AI framework combining deep and shallow learning for robust Remaining Useful Life (RUL) estimation of bearings, outperforming state-of-the-art methods under dynamic operating conditions.

Details

Motivation: Existing RUL estimation methods often lack generalizability and robustness under changing operating conditions, limiting their practical application in Prognostic Health Management systems.

Method: Hybrid framework using Res-CNN and Res-LSTM blocks with multi-head attention and residual connections for spatial-temporal pattern capture, plus Random Forest Regressor for stable prediction. Preprocessing includes Gaussian filtering and Continuous Wavelet Transform for feature extraction.

Result: CARLE outperforms several state-of-the-art methods on XJTU-SY and PRONOSTIA datasets, showing superior performance especially under dynamic conditions. Ablation studies confirm component contributions, and noise/cross-domain experiments demonstrate robustness.

Conclusion: CARLE provides an effective solution for robust RUL estimation with improved generalizability under changing conditions, and model interpretability analysis enhances transparency and trustworthiness.

Abstract: Prognostic Health Management (PHM) systems monitor and predict equipment health. A key task is Remaining Useful Life (RUL) estimation, which predicts how long a component, such as a rolling element bearing, will operate before failure. Many RUL methods exist but often lack generalizability and robustness under changing operating conditions. This paper introduces CARLE, a hybrid AI framework that combines deep and shallow learning to address these challenges. CARLE uses Res-CNN and Res-LSTM blocks with multi-head attention and residual connections to capture spatial and temporal degradation patterns, and a Random Forest Regressor (RFR) for stable, accurate RUL prediction. A compact preprocessing pipeline applies Gaussian filtering for noise reduction and Continuous Wavelet Transform (CWT) for time-frequency feature extraction. We evaluate CARLE on the XJTU-SY and PRONOSTIA bearing datasets. Ablation studies measure each component’s contribution, while noise and cross-domain experiments test robustness and generalization. Comparative results show CARLE outperforms several state-of-the-art methods, especially under dynamic conditions. Finally, we analyze model interpretability with LIME and SHAP to assess transparency and trustworthiness.

[383] Shock-Aware Physics-Guided Fusion-DeepONet Operator for Rarefied Micro-Nozzle Flows

Ehsan Roohi, Amirmehran Mahdavi

Main category: cs.LG

TL;DR: A physics-aware deep learning framework for creating fast surrogate models of rarefied micro nozzle flows with shocks, validated on the viscous Burgers equation.

Details

Motivation: To develop accurate and efficient surrogate models for complex rarefied micro nozzle flows containing shocks, which are computationally expensive to simulate with traditional methods.

Method: Integration of three components: Fusion DeepONet operator learning for parameter dependencies, physics-guided feature space with shock-aligned coordinates, and two-phase curriculum strategy focusing on high-gradient regions.

Result: The framework successfully captures parameter dependencies and shock-like gradients, as validated on the viscous Burgers equation which exhibits similar advective steepening behavior.

Conclusion: The proposed physics-aware deep learning framework provides an effective approach for constructing fast and accurate surrogate models of complex shock-containing flows.

Abstract: We present a comprehensive, physics aware deep learning framework for constructing fast and accurate surrogate models of rarefied, shock containing micro nozzle flows. The framework integrates three key components, a Fusion DeepONet operator learning architecture for capturing parameter dependencies, a physics-guided feature space that embeds a shock-aligned coordinate system, and a two-phase curriculum strategy emphasizing high-gradient regions. To demonstrate the generality and inductive bias of the proposed framework, we first validate it on the canonical viscous Burgers equation, which exhibits advective steepening and shock like gradients.

[384] MIN-Merging: Merge the Important Neurons for Model Merging

Yunfei Liang

Main category: cs.LG

TL;DR: MIN-Merging is a router-based framework that selectively merges important neurons to reduce parameter conflicts in model merging, achieving consistent performance gains on domain-specific tasks while maintaining generalization.

Details

Motivation: Existing model merging approaches suffer from parameter conflicts that degrade performance on domain-specific tasks, despite the abundance of open-source models across diverse domains.

Method: Proposed MIN-Merging, a router-based framework that selectively merges the most important neurons to reduce parameter conflicts during model merging.

Result: Extensive experiments on CV and NLP benchmarks show consistent gains on in-domain tasks while retaining generalization ability on out-of-domain tasks.

Conclusion: MIN-Merging provides an effective practical solution to the parameter conflict problem in model merging, demonstrating its value for combining strengths of diverse open-source models.

Abstract: Recent advances in deep learning have led to a surge of open-source models across diverse domains. While model merging offers a promising way to combine their strengths, existing approaches often suffer from parameter conflicts that degrade performance on domain-specific tasks. We propose MIN-Merging, a router-based framework that selectively merges the most important neurons to reduce such conflicts. Extensive experiments on Computer Vision(CV) and Natural Language Processing(NLP) benchmarks show that MIN-Merging achieves consistent gains on in-domain tasks while retaining the generalization ability of pretrained models on out-of-domain tasks. These results highlight its effectiveness as a practical solution to the parameter conflict problem in model merging.

[385] Hierarchical Federated Unlearning for Large Language Models

Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu

Main category: cs.LG

TL;DR: A federated unlearning approach for LLMs that handles continuous, heterogeneous unlearning requests while preserving privacy and maintaining model utility through task-specific adapter learning and hierarchical merging.

Details

Motivation: Address privacy and security concerns in LLMs by enabling removal of undesirable knowledge, while handling practical challenges of continuous heterogeneous unlearning needs and decentralized sensitive data with asymmetric access.

Method: Proposes federated unlearning that decouples unlearning and retention via task-specific adapter learning, and employs hierarchical merging strategy to mitigate conflicting objectives and enable robust unlearning updates.

Result: Comprehensive experiments on WMDP, MUSE, and TOFU benchmarks showed effective handling of heterogeneous unlearning requests while maintaining strong LLM utility compared to baseline methods.

Conclusion: The proposed federated unlearning approach provides a scalable and privacy-preserving solution for removing undesirable knowledge from LLMs, effectively addressing inter-domain and intra-domain interference while balancing forgetting and retention performance.

Abstract: Large Language Models (LLMs) are increasingly integrated into real-world applications, raising concerns about privacy, security and the need to remove undesirable knowledge. Machine Unlearning has emerged as a promising solution, yet faces two key challenges: (1) practical unlearning needs are often continuous and heterogeneous, and (2) they involve decentralized, sensitive data with asymmetric access. These factors result in inter-domain and intra-domain interference, which further amplifies the dilemma of unbalanced forgetting and retaining performance. In response, we propose a federated unlearning approach for LLMs that is scalable and privacy preserving. Our method decouples unlearning and retention via task-specific adapter learning and employs a hierarchical merging strategy to mitigate conflicting objectives and enables robust, adaptable unlearning updates. Comprehensive experiments on benchmarks of WMDP, MUSE, and TOFU showed that our approach effectively handles heterogeneous unlearning requests while maintaining strong LLM utility compared with baseline methods.

[386] Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism

Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu

Main category: cs.LG

TL;DR: A unified benchmark for evaluating attention mechanisms in long-context LLM training, comparing kernel optimizations and context parallel strategies across attention patterns and sequence lengths.

Details

Motivation: Standard attention in Transformers has quadratic computation/memory costs that limit long-context training. Existing solutions (kernel optimizations and context parallel training) lack systematic evaluation and clear performance analysis across different contexts.

Method: Proposed a modular benchmark integrating representative attention kernels and context parallel mechanisms. Evaluated methods along two dimensions: attention mask patterns (affecting efficiency/scalability) and sequence length/distributed scale (for extreme long-context training).

Result: Comprehensive experiments on up to 96 GPUs enabled reproducible comparisons, revealed method-specific trade-offs, and provided practical guidance for attention mechanism design and deployment.

Conclusion: The benchmark addresses evaluation gaps in long-context LLM training, offering systematic analysis of attention optimization methods and their performance characteristics across different scenarios.

Abstract: Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular and extensible interface for evaluation. The benchmark evaluates methods along two critical dimensions: (1) attention mask patterns, which strongly affect efficiency, scalability, and usability, and (2) sequence length and distributed scale, which determine performance under extreme long-context training. Through comprehensive experiments on the cluster of up to 96 GPUs, our benchmark enables reproducible comparisons, highlights method-specific trade-offs, and provides practical guidance for designing and deploying attention mechanisms in long-context LLM training.

[387] L-MoE: End-to-End Training of a Lightweight Mixture of Low-Rank Adaptation Experts

Shihao Ji, Zihui Song

Main category: cs.LG

TL;DR: L-MoE combines Mixture of Experts (MoE) with Low-Rank Adaptation (LoRA) to create a lightweight, parameter-efficient framework where experts are task-specialized LoRA adapters dynamically composed by a trainable gating network.

Details

Motivation: To unify the computational efficiency of MoE architecture with the parameter efficiency of LoRA fine-tuning, creating a modular and scalable framework for specialized language models.

Method: Replace dense MoE experts with collections of LoRA adapters, use a lightweight gating network to compute weighted averages of adapter parameters per input token, and train the entire system end-to-end with differentiable routing.

Result: A highly parameter-efficient MoE model that enables dynamic skill composition and maintains modular design while being fully trainable from end-to-end.

Conclusion: L-MoE provides a new approach for building efficient, scalable, and specialized language models through the integration of MoE and LoRA paradigms with differentiable expert composition.

Abstract: The Mixture of Experts (MoE) architecture enables the scaling of Large Language Models (LLMs) to trillions of parameters by activating a sparse subset of weights for each input, maintaining constant computational cost during inference. Concurrently, Low-Rank Adaptation (LoRA) has emerged as a dominant technique for parameter-efficiently fine-tuning LLMs on specialized tasks. In this work, we unify these two paradigms into a novel, end-to-end trainable framework named L-MoE: a Lightweight Mixture of LoRA Experts. L-MoE redefines MoE experts not as dense feed-forward networks, but as a collection of task-specialized, low-rank adapters. A lightweight gating network, trained jointly with the experts, learns to dynamically compose these LoRA adapters by computing a weighted average of their parameters for each input token. This composition is fully differentiable, allowing gradients from a standard auto-regressive language modeling objective to flow back through the entire architecture, simultaneously refining both the expert adapters and the routing strategy. This approach creates a highly parameter-efficient MoE model that is modular by design, allows for dynamic skill composition, and is trainable from end-to-end. We present the formal mathematical framework for L-MoE, detailing the differentiable routing mechanism and the joint optimization objective, thereby providing a new path toward building more efficient, scalable, and specialized language models.

[388] Automated Algorithm Design for Auto-Tuning Optimizers

Floris-Jan Willemsen, Niki van Stein, Ben van Werkhoven

Main category: cs.LG

TL;DR: Using LLMs to automatically generate specialized optimization algorithms for auto-tuning problems, achieving significant performance improvements over traditional human-designed optimizers.

Details

Motivation: Traditional auto-tuning relies on established optimization algorithms, but no single method performs best across all tuning tasks. Manual optimizer design is challenging due to vast and irregular parameter spaces.

Method: A framework that prompts LLMs with problem descriptions and search-space characteristics to produce specialized optimization strategies, which are iteratively examined and improved.

Result: LLM-generated optimizers achieved average performance improvements of 30.7% and 14.6% when provided with application- and search space-specific information, respectively. Best-performing algorithms achieved 72.4% average improvement over state-of-the-art optimizers.

Conclusion: LLM-generated optimization algorithms can rival and often outperform existing human-designed algorithms, demonstrating the effectiveness of using LLMs for automatic optimizer generation in auto-tuning applications.

Abstract: Automatic performance tuning (auto-tuning) is essential for optimizing high-performance applications, where vast and irregular parameter spaces make manual exploration infeasible. Traditionally, auto-tuning relies on well-established optimization algorithms such as evolutionary algorithms, annealing methods, or surrogate model-based optimizers to efficiently find near-optimal configurations. However, designing effective optimizers remains challenging, as no single method performs best across all tuning tasks. In this work, we explore a new paradigm: using large language models (LLMs) to automatically generate optimization algorithms tailored to auto-tuning problems. We introduce a framework that prompts LLMs with problem descriptions and search-space characteristics results to produce specialized optimization strategies, which are iteratively examined and improved. These generated algorithms are evaluated on four real-world auto-tuning applications across six hardware platforms and compared against the state-of-the-art in optimization algorithms of two contemporary auto-tuning frameworks. The evaluation demonstrates that providing additional application- and search space-specific information in the generation stage results in an average performance improvement of 30.7% and 14.6%, respectively. In addition, our results show that LLM-generated optimizers can rival, and in various cases outperform, existing human-designed algorithms, with our best-performing generated optimization algorithms achieving, on average, 72.4% improvement over state-of-the-art optimizers for auto-tuning.

Alex Acero, Daniel M. Jimenez-Gutierrez, Dario Pighin, Enrique Zuazua, Joaquin Del Rio, Xabi Uribe-Etxebarria

Main category: cs.LG

TL;DR: SBVFL reduces Vertical Federated Learning communication by ~99% while maintaining accuracy and privacy.

Details

Motivation: Standard VFL requires excessive communications that compromise privacy, security, and energy efficiency, making training impractical in many sensitive domains.

Method: Sherpa.ai Blind Vertical Federated Learning (SBVFL) uses a distributed training mechanism that decouples most node updates from the server to dramatically reduce communication.

Result: Experiments show SBVFL reduces communication by approximately 99% compared to standard VFL while maintaining model accuracy and robustness.

Conclusion: SBVFL enables practical, privacy-preserving VFL across sensitive domains including healthcare, finance, manufacturing, aerospace, cybersecurity, and defense.

Abstract: Federated Learning (FL) enables collaborative decentralized training across multiple parties (nodes) while keeping raw data private. There are two main paradigms in FL: Horizontal FL (HFL), where all participant nodes share the same feature space but hold different samples, and Vertical FL (VFL), where participants hold complementary features for the same samples. While HFL is widely adopted, VFL is employed in domains where nodes hold complementary features about the same samples. Still, VFL presents a significant limitation: the vast number of communications required during training. This compromises privacy and security, and can lead to high energy consumption, and in some cases, make model training unfeasible due to the high number of communications. In this paper, we introduce Sherpa.ai Blind Vertical Federated Learning (SBVFL), a novel paradigm that leverages a distributed training mechanism enhanced for privacy and security. Decoupling the vast majority of node updates from the server dramatically reduces node-server communication. Experiments show that SBVFL reduces communication by ~99% compared to standard VFL while maintaining accuracy and robustness. Therefore, SBVFL enables practical, privacy-preserving VFL across sensitive domains, including healthcare, finance, manufacturing, aerospace, cybersecurity, and the defense industry.

[390] NeuCo-Bench: A Novel Benchmark Framework for Neural Embeddings in Earth Observation

Rikard Vinge, Isabelle Wittmann, Jannik Schneider, Michael Marszalek, Luis Gilch, Thomas Brunschwiler, Conrad M Albrecht

Main category: cs.LG

TL;DR: NeuCo-Bench is a benchmark framework for evaluating neural compression and representation learning in Earth Observation, featuring reusable embeddings, a hidden-task challenge mode, and balanced scoring.

Details

Motivation: To establish standardized evaluation for neural embeddings in Earth Observation that addresses pretraining bias and supports diverse downstream tasks.

Method: Uses fixed-size embeddings as task-agnostic representations, includes evaluation pipeline with reusable embeddings, hidden-task leaderboard to mitigate bias, and scoring balancing accuracy and stability.

Result: Initial results from CVPR 2025 EARTHVISION workshop challenge and ablations with state-of-the-art foundation models demonstrate the framework’s effectiveness.

Conclusion: NeuCo-Bench provides a foundation for community-driven, standardized evaluation of neural embeddings for Earth Observation and other domains.

Abstract: We introduce NeuCo-Bench, a novel benchmark framework for evaluating (lossy) neural compression and representation learning in the context of Earth Observation (EO). Our approach builds on fixed-size embeddings that act as compact, task-agnostic representations applicable to a broad range of downstream tasks. NeuCo-Bench comprises three core components: (i) an evaluation pipeline built around reusable embeddings, (ii) a new challenge mode with a hidden-task leaderboard designed to mitigate pretraining bias, and (iii) a scoring system that balances accuracy and stability. To support reproducibility, we release SSL4EO-S12-downstream, a curated multispectral, multitemporal EO dataset. We present initial results from a public challenge at the 2025 CVPR EARTHVISION workshop and conduct ablations with state-of-the-art foundation models. NeuCo-Bench provides a first step towards community-driven, standardized evaluation of neural embeddings for EO and beyond.

[391] Uncertainty-Aware Post-Hoc Calibration: Mitigating Confidently Incorrect Predictions Beyond Calibration Metrics

Hassan Gharoun, Mohammad Sadegh Khorshidi, Kasra Ranjbarigderi, Fang Chen, Amir H. Gandomi

Main category: cs.LG

TL;DR: A post-hoc calibration framework that uses prediction reliability assessment to improve both calibration quality and uncertainty-aware decision-making through instance-level adaptive calibration.

Details

Motivation: Existing calibration methods apply global transformations uniformly, ignoring the heterogeneous reliability of individual predictions, and the relationship between improved calibration and effective uncertainty-aware decision-making is underexplored.

Method: Uses proximity-based conformal prediction to stratify samples into correct/incorrect groups based on semantic similarity, then applies dual calibration: standard isotonic regression for correct predictions and underconfidence-regularized isotonic regression for incorrect predictions.

Result: Achieves lower confidently incorrect predictions and competitive Expected Calibration Error compared to isotonic and focal-loss baselines on CIFAR-10 and CIFAR-100 with BiT and CoAtNet backbones.

Conclusion: Bridges calibration and uncertainty quantification through instance-level adaptivity, offering a practical post-hoc solution that improves both probability alignment and uncertainty-aware decision-making without model retraining.

Abstract: Despite extensive research on neural network calibration, existing methods typically apply global transformations that treat all predictions uniformly, overlooking the heterogeneous reliability of individual predictions. Furthermore, the relationship between improved calibration and effective uncertainty-aware decision-making remains largely unexplored. This paper presents a post-hoc calibration framework that leverages prediction reliability assessment to jointly enhance calibration quality and uncertainty-aware decision-making. The framework employs proximity-based conformal prediction to stratify calibration samples into putatively correct and putatively incorrect groups based on semantic similarity in feature space. A dual calibration strategy is then applied: standard isotonic regression calibrated confidence in putatively correct predictions, while underconfidence-regularized isotonic regression reduces confidence toward uniform distributions for putatively incorrect predictions, facilitating their identification for further investigations. A comprehensive evaluation is conducted using calibration metrics, uncertainty-aware performance measures, and empirical conformal coverage. Experiments on CIFAR-10 and CIFAR-100 with BiT and CoAtNet backbones show that the proposed method achieves lower confidently incorrect predictions, and competitive Expected Calibration Error compared with isotonic and focal-loss baselines. This work bridges calibration and uncertainty quantification through instance-level adaptivity, offering a practical post-hoc solution that requires no model retraining while improving both probability alignment and uncertainty-aware decision-making.

[392] Data Unlearning Beyond Uniform Forgetting via Diffusion Time and Frequency Selection

Jinseong Park, Mijung Park

Main category: cs.LG

TL;DR: This paper proposes a time-frequency selective approach for data unlearning in diffusion models, addressing quality degradation issues by focusing on specific time-frequency ranges during training rather than treating all diffusion steps equally.

Details

Motivation: Data unlearning in diffusion models is underexplored and suffers from quality degradation or incomplete forgetting. Existing methods treat all diffusion time steps equally, leading to poor generation quality.

Method: A time-frequency selective approach that focuses on specific time-frequency ranges during training, applied to gradient-based and preference optimization objectives across image-level and text-to-image tasks.

Result: The method achieves samples with higher aesthetic quality and lower noise compared to existing approaches, and the authors propose a normalized version of SSCD for better evaluation.

Conclusion: The analysis establishes a clearer understanding of data unlearning challenges in diffusion models and provides practical strategies to improve both evaluation and unlearning performance.

Abstract: Data unlearning aims to remove the influence of specific training samples from a trained model without requiring full retraining. Unlike concept unlearning, data unlearning in diffusion models remains underexplored and often suffers from quality degradation or incomplete forgetting. To address this, we first observe that most existing methods attempt to unlearn the samples at all diffusion time steps equally, leading to poor-quality generation. We argue that forgetting occurs disproportionately across time and frequency, depending on the model and scenarios. By selectively focusing on specific time-frequency ranges during training, we achieve samples with higher aesthetic quality and lower noise. We validate this improvement by applying our time-frequency selective approach to diverse settings, including gradient-based and preference optimization objectives, as well as both image-level and text-to-image tasks. Finally, to evaluate both deletion and quality of unlearned data samples, we propose a simple normalized version of SSCD. Together, our analysis and methods establish a clearer understanding of the unique challenges in data unlearning for diffusion models, providing practical strategies to improve both evaluation and unlearning performance.

[393] Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

Chenwei Tang, Jingyu Xing, Xinyu Liu, Wei Ju, Jiancheng Lv, Deng Xiong, Ziyue Qiao

Main category: cs.LG

TL;DR: COMPASS is a novel test-time RL method that enables LLMs to learn from unlabeled data by combining answer and reasoning path rewards without external supervision.

Details

Motivation: Current RL methods for LLMs face scalability bottlenecks due to reliance on human-curated preference data or labeled datasets for reward modeling.

Method: COMPASS integrates two components: DCAR (stabilizes training via confidence and credibility calibration for pseudo-labels) and DPR (optimizes reasoning process quality beyond outcome supervision).

Result: Extensive experiments show COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures.

Conclusion: COMPASS advances a more scalable direction for LLMs to learn from continuous experience without external supervision.

Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without external supervision. COMPASS integrates two complementary components: the Dual-Calibration Answer Reward (DCAR), which stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and the Decisive Path Reward (DPR), which directly optimizes the reasoning process quality beyond mere outcome supervision. By jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains, the COMPASS systematically enhances the model’s analytical capabilities. Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing a more scalable direction for LLMs to learn from continuous experience.

[394] EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning

He Du, Bowen Li, Aijun Yang, Siyang He, Qipeng Guo, Dacheng Tao

Main category: cs.LG

TL;DR: The paper introduces an evolutionary, task-agnostic framework for synthesizing verifiable data that improves language model training across math, coding, and agentic tasks.

Details

Motivation: Existing methods for creating synthetic verifiable data suffer from hallucination-prone generation and weak verification artifacts that don't effectively distinguish strong from weak solutions, lacking a universal evaluator of verifiability.

Method: An evolutionary, task-agnostic, strategy-guided framework that synthesizes problems, solutions, and verification artifacts from minimal seed supervision, using a consistency-based evaluator to enforce agreement between human-annotated and strategy-induced checks.

Result: Training with the synthesized data yields significant improvements on both LiveCodeBench and AgentBench-OS tasks, demonstrating robust generalization across domains.

Conclusion: The framework successfully upgrades filtering into principled synthesis, reliably assembling coherent verifiable training instances without domain-specific rules, enabling effective training under both RLVR and model distillation paradigms.

Abstract: Reliable verifiable data has become a key driver of capability gains in modern language models, enabling stable reinforcement learning with verifiable rewards and effective distillation that transfers competence across math, coding, and agentic tasks. Yet constructing generalizable synthetic verifiable data remains difficult due to hallucination-prone generation, and weak or trivial verification artifacts that fail to separate strong from weak solutions. Existing approaches often rely on task-specific heuristics or post-hoc filters that do not transfer across domains and lack a principled, universal evaluator of verifiability. In this work, we introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework that, from minimal seed supervision, jointly synthesizes problems, diverse candidate solutions, and verification artifacts, and iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks. This pipeline upgrades filtering into principled synthesis: it reliably assembles coherent, verifiable training instances and generalizes without domain-specific rules. Our experiments demonstrate the effectiveness of the proposed approach under both RLVR and model distillation training paradigms. The results show that training with our synthesized data yields significant improvements on both the LiveCodeBench and AgentBench-OS tasks, highlighting the robust generalization of our framework.

[395] From Observations to Parameters: Detecting Changepoint in Nonlinear Dynamics with Simulation-based Inference

Xiangbo Deng, Cheng Chen, Peng Yang

Main category: cs.LG

TL;DR: Param-CPD detects regime shifts in chaotic time series by first inferring governing parameters using neural posterior estimation, then applying changepoint detection to parameter trajectories, outperforming observation-space methods.

Details

Motivation: Detecting regime shifts in chaotic time series is challenging because observation signals are entangled with intrinsic variability, making it hard to distinguish true parameter changes from natural system fluctuations.

Method: Two-stage framework: 1) Amortized Bayesian inference of governing parameters using neural posterior estimator trained by simulation-based inference, 2) Standard changepoint detection algorithm applied to the resulting parameter trajectory.

Result: On Lorenz-63 with piecewise-constant parameters, Param-CPD improves F1 score, reduces localization error, and lowers false positives compared to observation-space baselines. Shows consistent gains across tolerance, window length, and noise variations.

Conclusion: Operating in physically interpretable parameter space enables accurate and interpretable changepoint detection in nonlinear dynamical systems, providing cleaner detection signals than observation space.

Abstract: Detecting regime shifts in chaotic time series is hard because observation-space signals are entangled with intrinsic variability. We propose Parameter–Space Changepoint Detection (Param–CPD), a two–stage framework that first amortizes Bayesian inference of governing parameters with a neural posterior estimator trained by simulation-based inference, and then applies a standard CPD algorithm to the resulting parameter trajectory. On Lorenz–63 with piecewise-constant parameters, Param–CPD improves F1, reduces localization error, and lowers false positives compared to observation–space baselines. We further verify identifiability and calibration of the inferred posteriors on stationary trajectories, explaining why parameter space offers a cleaner detection signal. Robustness analyses over tolerance, window length, and noise indicate consistent gains. Our results show that operating in a physically interpretable parameter space enables accurate and interpretable changepoint detection in nonlinear dynamical systems.

[396] UniRL-Zero: Reinforcement Learning on Unified Models with Joint Language Model and Diffusion Model Experts

Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park

Main category: cs.LG

TL;DR: UniRL-Zero is a unified reinforcement learning framework that enhances multimodal language model understanding/reasoning and diffusion model multimedia generation within a single model, with six defined scenarios and systematic baselines.

Details

Motivation: To create a unified framework that can handle both multimodal understanding/reasoning and multimedia generation tasks within a single reinforcement learning model, addressing the need for integrated AI capabilities.

Method: Developed UniRL-Zero framework with six defined scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation models.

Result: The framework successfully integrates multimodal language model understanding/reasoning with diffusion model multimedia generation capabilities in a unified reinforcement learning approach.

Conclusion: UniRL-Zero provides a comprehensive unified reinforcement learning framework that enables beneficial interaction between multimodal understanding and generation capabilities, establishing systematic baselines for future research in this area.

Abstract: We present UniRL-Zero, a unified reinforcement learning (RL) framework that boosts, multimodal language model understanding and reasoning, diffusion model multimedia generation, and their beneficial interaction capabilities within a unified model. Our work defines six scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation model. Our code is available at https://github.com/G-U-N/UniRL.

[397] Demystifying Transition Matching: When and Why It Can Beat Flow Matching

Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Youngsuk Park

Main category: cs.LG

TL;DR: TM outperforms FM when target distributions have well-separated modes and non-negligible variances, due to TM’s stochastic updates preserving target covariance that FM underestimates.

Details

Motivation: To understand when and why Transition Matching (TM) achieves higher quality with fewer sampling steps than Flow Matching (FM), which underpins many state-of-the-art generative models.

Method: Theoretical analysis comparing TM and FM: (1) unimodal Gaussian distributions - proving TM attains strictly lower KL divergence and characterizing convergence rates; (2) Gaussian mixtures - identifying local-unimodality regimes where TM outperforms FM, with approximation error decreasing as component means separate.

Result: TM achieves strictly lower KL divergence than FM for finite steps in unimodal Gaussian case, with faster convergence under fixed compute budget. In Gaussian mixtures, TM outperforms FM when modes are well separated, but advantage diminishes as target variance approaches zero.

Conclusion: TM outperforms FM when target distributions have well-separated modes and non-negligible variances, validated through controlled experiments on Gaussian distributions and extended to real-world image and video generation applications.

Abstract: Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.

[398] Attention-Guided Deep Adversarial Temporal Subspace Clustering (A-DATSC) Model for multivariate spatiotemporal data

Francis Ndikum Nji, Vandana Janeja, Jianwu Wang

Main category: cs.LG

TL;DR: A-DATSC is a novel deep subspace clustering model that combines a U-Net-inspired generator with a quality-verifying discriminator to handle complex 4D spatiotemporal data, addressing limitations of existing methods through attention mechanisms and temporal modeling.

Details

Motivation: Existing deep subspace clustering methods have limitations: they use shallow autoencoders, ignore clustering errors, emphasize global features over local structure, fail to model long-range dependencies and positional information, and rarely handle 4D spatiotemporal data.

Method: Proposed A-DATSC combines a deep subspace clustering generator (U-Net-inspired with TimeDistributed ConvLSTM2D layers) and a quality-verifying discriminator. Uses graph attention transformer for self-expressive network to capture local spatial relationships, global dependencies, and both short- and long-range correlations.

Result: Experiments on three real-world multivariate spatiotemporal datasets show A-DATSC achieves substantially superior clustering performance compared to state-of-the-art deep subspace clustering models.

Conclusion: A-DATSC effectively addresses key limitations of existing methods and demonstrates superior performance for complex multivariate spatiotemporal data clustering tasks.

Abstract: Deep subspace clustering models are vital for applications such as snowmelt detection, sea ice tracking, crop health monitoring, infectious disease modeling, network load prediction, and land-use planning, where multivariate spatiotemporal data exhibit complex temporal dependencies and reside on multiple nonlinear manifolds beyond the capability of traditional clustering methods. These models project data into a latent space where samples lie in linear subspaces and exploit the self-expressiveness property to uncover intrinsic relationships. Despite their success, existing methods face major limitations: they use shallow autoencoders that ignore clustering errors, emphasize global features while neglecting local structure, fail to model long-range dependencies and positional information, and are rarely applied to 4D spatiotemporal data. To address these issues, we propose A-DATSC (Attention-Guided Deep Adversarial Temporal Subspace Clustering), a model combining a deep subspace clustering generator and a quality-verifying discriminator. The generator, inspired by U-Net, preserves spatial and temporal integrity through stacked TimeDistributed ConvLSTM2D layers, reducing parameters and enhancing generalization. A graph attention transformer based self-expressive network captures local spatial relationships, global dependencies, and both short- and long-range correlations. Experiments on three real-world multivariate spatiotemporal datasets show that A-DATSC achieves substantially superior clustering performance compared to state-of-the-art deep subspace clustering models.

[399] Benchmarking Probabilistic Time Series Forecasting Models on Neural Activity

Ziyu Lu, Anna J. Li, Alexander E. Ladd, Pascha Matveev, Aditya Deole, Eric Shea-Brown, J. Nathan Kutz, Nicholas A. Steinmetz

Main category: cs.LG

TL;DR: Systematic evaluation of probabilistic deep learning models for neural activity forecasting shows they outperform classical statistical methods, with best models providing informative predictions up to 1.5 seconds ahead.

Details

Motivation: Neural activity forecasting is crucial for understanding neural systems and enabling closed-loop control, but deep learning applications in this domain remain limited despite advances in time series forecasting.

Method: Evaluated eight probabilistic deep learning models (including two foundation models) against four classical statistical models and two baseline methods on spontaneous neural activity recorded from mouse cortex via widefield imaging.

Result: Several deep learning models consistently outperformed classical approaches across prediction horizons, with the best model producing informative forecasts up to 1.5 seconds into the future.

Conclusion: The findings support future control applications and open new avenues for probing the intrinsic temporal structure of neural activity.

Abstract: Neural activity forecasting is central to understanding neural systems and enabling closed-loop control. While deep learning has recently advanced the state-of-the-art in the time series forecasting literature, its application to neural activity forecasting remains limited. To bridge this gap, we systematically evaluated eight probabilistic deep learning models, including two foundation models, that have demonstrated strong performance on general forecasting benchmarks. We compared them against four classical statistical models and two baseline methods on spontaneous neural activity recorded from mouse cortex via widefield imaging. Across prediction horizons, several deep learning models consistently outperformed classical approaches, with the best model producing informative forecasts up to 1.5 seconds into the future. Our findings point toward future control applications and open new avenues for probing the intrinsic temporal structure of neural activity.

[400] Cross-Domain Long-Term Forecasting: Radiation Dose from Sparse Neutron Sensor via Spatio-Temporal Operator Network

Jay Phil Yoo, Kazuma Kobayashi, Souvik Chakraborty, Syed Bahauddin Alam

Main category: cs.LG

TL;DR: STONe is a non-autoregressive neural operator that learns stable functional mappings between heterogeneous domains, enabling cross-domain forecasting of physical quantities from sparse sensor data without requiring domain alignment or iterative recurrence.

Details

Motivation: Existing neural operators fail in real-world systems where sensing and prediction occur on distinct physical manifolds and over long timescales, requiring dense input-output fields and short temporal contexts.

Method: STONe directly infers high-altitude radiation dose fields from sparse ground-based neutron measurements, defining a nonlinear operator between sensor and target manifolds that remains stable over long forecasting horizons without iterative recurrence.

Result: Trained on 23 years of global neutron data, STONe achieves accurate 180-day forecasts with millisecond inference latency.

Conclusion: The framework establishes a general principle for cross-domain operator inference, enabling real-time prediction of complex spatiotemporal fields in physics, climate, and energy systems.

Abstract: Forecasting unobservable physical quantities from sparse, cross-domain sensor data is a central unsolved problem in scientific machine learning. Existing neural operators and large-scale forecasters rely on dense, co-located input-output fields and short temporal contexts, assumptions that fail in real-world systems where sensing and prediction occur on distinct physical manifolds and over long timescales. We introduce the Spatio-Temporal Operator Network (STONe), a non-autoregressive neural operator that learns a stable functional mapping between heterogeneous domains. By directly inferring high-altitude radiation dose fields from sparse ground-based neutron measurements, STONe demonstrates that operator learning can generalize beyond shared-domain settings. It defines a nonlinear operator between sensor and target manifolds that remains stable over long forecasting horizons without iterative recurrence. This challenges the conventional view that operator learning requires domain alignment or autoregressive propagation. Trained on 23 years of global neutron data, STONe achieves accurate 180-day forecasts with millisecond inference latency. The framework establishes a general principle for cross-domain operator inference, enabling real-time prediction of complex spatiotemporal fields in physics, climate, and energy systems.

[401] Measure-Theoretic Anti-Causal Representation Learning

Arman Behnam, Binghui Wang

Main category: cs.LG

TL;DR: ACIA is a measure-theoretic framework for anti-causal representation learning that uses two-level representations to capture label-to-observation generation and stable causal patterns, achieving superior OOD generalization without requiring explicit causal structures.

Details

Motivation: Causal representation learning in anti-causal settings (where labels cause features) presents unique challenges that require specialized approaches beyond standard causal learning methods.

Method: ACIA employs a two-level representation design: low-level captures how labels generate observations, high-level learns stable causal patterns across environments. It uses interventional kernels to handle perfect and imperfect interventions, works without explicit causal structures, and handles high-dimensional data.

Result: Experiments on synthetic and real-world medical datasets show ACIA consistently outperforms state-of-the-art methods in both accuracy and invariance metrics.

Conclusion: ACIA provides theoretical guarantees for out-of-distribution generalization with tight bounds on performance gaps between training and unseen environments, confirming its efficacy for robust anti-causal learning.

Abstract: Causal representation learning in the anti-causal setting (labels cause features rather than the reverse) presents unique challenges requiring specialized approaches. We propose Anti-Causal Invariant Abstractions (ACIA), a novel measure-theoretic framework for anti-causal representation learning. ACIA employs a two-level design, low-level representations capture how labels generate observations, while high-level representations learn stable causal patterns across environment-specific variations. ACIA addresses key limitations of existing approaches by accommodating prefect and imperfect interventions through interventional kernels, eliminating dependency on explicit causal structures, handling high-dimensional data effectively, and providing theoretical guarantees for out-of-distribution generalization. Experiments on synthetic and real-world medical datasets demonstrate that ACIA consistently outperforms state-of-the-art methods in both accuracy and invariance metrics. Furthermore, our theoretical results establish tight bounds on performance gaps between training and unseen environments, confirming the efficacy of our approach for robust anti-causal learning.

[402] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

Jiajun Fan, Tong Wei, Chaoran Cheng, Yuxin Chen, Ge Liu

Main category: cs.LG

TL;DR: ADRPO is an adaptive divergence regularization method for RL fine-tuning that automatically adjusts regularization strength based on advantage estimates, enabling better balance between exploration and exploitation across various generative models.

Details

Motivation: Existing RL fine-tuning approaches use fixed divergence regularization, creating a dilemma where strong regularization limits reward optimization while weak regularization risks instability or reward hacking.

Method: ADRPO automatically adjusts regularization strength based on advantage estimates - reducing regularization for high-value samples and applying stronger regularization to poor samples, implemented with Wasserstein-2 regularization for flow matching generative models.

Result: Achieved remarkable results on text-to-image generation, enabling 2B parameter SD3 model to surpass larger 4.8B and 12B models in attribute binding, semantic consistency, and compositional control. Also improved LLM fine-tuning and multi-modal reasoning, with 7B model outperforming commercial models like Gemini 2.5 Pro and GPT-4o Audio.

Conclusion: ADRPO provides an effective plug-and-play solution to the exploration-exploitation challenge across diverse generative architectures and modalities, enhancing existing online RL methods while maintaining generation diversity.

Abstract: Balancing exploration and exploitation during reinforcement learning fine-tuning of generative models presents a critical challenge, as existing approaches rely on fixed divergence regularization that creates an inherent dilemma: strong regularization preserves model capabilities but limits reward optimization, while weak regularization enables greater alignment but risks instability or reward hacking. We introduce Adaptive Divergence Regularized Policy Optimization (ADRPO), which automatically adjusts regularization strength based on advantage estimates-reducing regularization for high-value samples while applying stronger regularization to poor samples, enabling policies to navigate between exploration and aggressive exploitation according to data quality. Our implementation with Wasserstein-2 regularization for flow matching generative models achieves remarkable results on text-to-image generation, achieving better semantic alignment and diversity than offline methods like DPO and online methods with fixed regularization like ORW-CFM-W2. ADRPO enables a 2B parameter SD3 model to surpass much larger models with 4.8B and 12B parameters in attribute binding, semantic consistency, artistic style transfer, and compositional control while maintaining generation diversity. ADRPO generalizes to KL-regularized fine-tuning of both text-only LLMs and multi-modal reasoning models, enhancing existing online RL methods like GRPO. In LLM fine-tuning, ADRPO demonstrates an emergent ability to escape local optima through active exploration, while in multi-modal audio reasoning, it outperforms GRPO through superior step-by-step reasoning, enabling a 7B model to outperform substantially larger commercial models including Gemini 2.5 Pro and GPT-4o Audio, offering an effective plug-and-play solution to the exploration-exploitation challenge across diverse generative architectures and modalities.

[403] SPACeR: Self-Play Anchoring with Centralized Reference Models

Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka, Yihan Hu, Wei Zhan

Main category: cs.LG

TL;DR: SPACeR is a framework that combines pretrained tokenized motion models with self-play RL to create human-like, scalable autonomous vehicle policies that are faster and smaller than large generative models.

Details

Motivation: Current methods for developing AV policies either use computationally expensive imitation learning that struggles in reactive scenarios, or self-play RL that diverges from human norms. There's a need for policies that are both human-like and scalable.

Method: SPACeR uses a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play RL, providing likelihood rewards and KL divergence to anchor policies to human driving distribution.

Result: The method achieves competitive performance with imitation-learned policies while being 10x faster at inference and 50x smaller in parameter size. It effectively measures planner quality in closed-loop ego planning tasks.

Conclusion: SPACeR establishes a new paradigm for testing autonomous driving policies through fast and scalable traffic simulation that maintains human-like behavior.

Abstract: Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.

[404] Fine-tuning Flow Matching Generative Models with Intermediate Feedback

Jiajun Fan, Chaoran Cheng, Shuaike Shen, Xiangxin Zhou, Ge Liu

Main category: cs.LG

TL;DR: AC-Flow is an actor-critic framework for fine-tuning flow-based generative models with intermediate feedback, addressing credit assignment and training stability issues through reward shaping, dual-stability mechanisms, and scalable critic weighting.

Details

Motivation: Fine-tuning flow-based generative models with intermediate feedback is challenging due to credit assignment problems and training instabilities in existing approaches that rely solely on outcome rewards or direct reward regression.

Method: Three key innovations: (1) reward shaping for normalized learning signals, (2) dual-stability mechanism with advantage clipping and warm-up phase, (3) scalable generalized critic weighting with Wasserstein regularization.

Result: AC-Flow achieves state-of-the-art performance in text-to-image alignment tasks and generalization to unseen human preference models on Stable Diffusion 3, maintaining generative quality, diversity, and stability with computationally efficient critic models.

Conclusion: The framework enables robust fine-tuning of flow models without compromising performance, demonstrating that efficient critic models can effectively guide flow-based generative model training.

Abstract: Flow-based generative models have shown remarkable success in text-to-image generation, yet fine-tuning them with intermediate feedback remains challenging, especially for continuous-time flow matching models. Most existing approaches solely learn from outcome rewards, struggling with the credit assignment problem. Alternative methods that attempt to learn a critic via direct regression on cumulative rewards often face training instabilities and model collapse in online settings. We present AC-Flow, a robust actor-critic framework that addresses these challenges through three key innovations: (1) reward shaping that provides well-normalized learning signals to enable stable intermediate value learning and gradient control, (2) a novel dual-stability mechanism that combines advantage clipping to prevent destructive policy updates with a warm-up phase that allows the critic to mature before influencing the actor, and (3) a scalable generalized critic weighting scheme that extends traditional reward-weighted methods while preserving model diversity through Wasserstein regularization. Through extensive experiments on Stable Diffusion 3, we demonstrate that AC-Flow achieves state-of-the-art performance in text-to-image alignment tasks and generalization to unseen human preference models. Our results demonstrate that even with a computationally efficient critic model, we can robustly finetune flow models without compromising generative quality, diversity, or stability.

[405] R2L: Reliable Reinforcement Learning: Guaranteed Return & Reliable Policies in Reinforcement Learning

Nadir Farhi

Main category: cs.LG

TL;DR: This paper proposes a reliable RL formulation that maximizes the probability of cumulative return exceeding a threshold, enabling use of standard RL algorithms for risk-sensitive applications.

Details

Motivation: Many real-world applications like routing and resource allocation require strategies that ensure guaranteed probability of success, not just high average performance, due to uncertainty and risk considerations.

Method: Reformulate the reliable RL problem via state-augmented representation into a standard RL problem, allowing use of existing algorithms like Q-learning and Dueling Double DQN without new frameworks.

Result: Numerical experiments on reliable routing show the approach produces policies that effectively balance efficiency and reliability, with theoretical equivalence established between formulations.

Conclusion: The reliable RL framework enables practical risk-sensitive decision-making in stochastic and safety-critical environments using standard RL methods.

Abstract: In this work, we address the problem of determining reliable policies in reinforcement learning (RL), with a focus on optimization under uncertainty and the need for performance guarantees. While classical RL algorithms aim at maximizing the expected return, many real-world applications - such as routing, resource allocation, or sequential decision-making under risk - require strategies that ensure not only high average performance but also a guaranteed probability of success. To this end, we propose a novel formulation in which the objective is to maximize the probability that the cumulative return exceeds a prescribed threshold. We demonstrate that this reliable RL problem can be reformulated, via a state-augmented representation, into a standard RL problem, thereby allowing the use of existing RL and deep RL algorithms without the need for entirely new algorithmic frameworks. Theoretical results establish the equivalence of the two formulations and show that reliable strategies can be derived by appropriately adapting well-known methods such as Q-learning or Dueling Double DQN. To illustrate the practical relevance of the approach, we consider the problem of reliable routing, where the goal is not to minimize the expected travel time but rather to maximize the probability of reaching the destination within a given time budget. Numerical experiments confirm that the proposed formulation leads to policies that effectively balance efficiency and reliability, highlighting the potential of reliable RL for applications in stochastic and safety-critical environments.

[406] Batch Distillation Data for Developing Machine Learning Anomaly Detection Methods

Justus Arweiler, Indra Jungjohann, Aparna Muraleedharan, Heike Leitte, Jakob Burger, Kerstin Münnemann, Fabian Jirasek, Hans Hasse

Main category: cs.LG

TL;DR: Created an extensive experimental database for machine learning-based anomaly detection in chemical processes, including fault-free and anomaly-induced experiments with sensor data, NMR spectroscopy, video/audio recordings, and expert annotations.

Details

Motivation: Address the lack of openly available experimental data that hinders the development of ML-based anomaly detection methods in chemical processes.

Method: Set up a laboratory-scale batch distillation plant to conduct 119 experiments across various operating conditions and mixtures, intentionally inducing anomalies and pairing them with fault-free experiments. Collected time-series sensor data, NMR spectroscopy concentration profiles, video/audio recordings, and developed an ontology for anomaly annotations.

Result: Generated a comprehensive database with extensive metadata, expert annotations, and measurement uncertainty estimates, made freely available via doi.org/10.5281/zenodo.17395544.

Conclusion: This database enables the development of advanced ML-based anomaly detection methods, including interpretable and explainable approaches, and supports methods for anomaly mitigation in chemical processes.

Abstract: Machine learning (ML) holds great potential to advance anomaly detection (AD) in chemical processes. However, the development of ML-based methods is hindered by the lack of openly available experimental data. To address this gap, we have set up a laboratory-scale batch distillation plant and operated it to generate an extensive experimental database, covering fault-free experiments and experiments in which anomalies were intentionally induced, for training advanced ML-based AD methods. In total, 119 experiments were conducted across a wide range of operating conditions and mixtures. Most experiments containing anomalies were paired with a corresponding fault-free one. The database that we provide here includes time-series data from numerous sensors and actuators, along with estimates of measurement uncertainty. In addition, unconventional data sources – such as concentration profiles obtained via online benchtop NMR spectroscopy and video and audio recordings – are provided. Extensive metadata and expert annotations of all experiments are included. The anomaly annotations are based on an ontology developed in this work. The data are organized in a structured database and made freely available via doi.org/10.5281/zenodo.17395544. This new database paves the way for the development of advanced ML-based AD methods. As it includes information on the causes of anomalies, it further enables the development of interpretable and explainable ML approaches, as well as methods for anomaly mitigation.

[407] MEG-GPT: A transformer-based foundation model for magnetoencephalography data

Rukuang Huang, Sungjun Cho, Chetan Gohil, Oiwi Parker Jones, Mark Woolrich

Main category: cs.LG

TL;DR: MEG-GPT is a transformer-based foundation model for magnetoencephalography (MEG) data that uses time-attention and next time-point prediction with a novel data-driven tokenizer, enabling realistic brain dynamics generation and improved decoding performance.

Details

Motivation: Traditional methods fail to capture complex spatiotemporal patterns in large-scale brain dynamics from modalities like MEG, while recent deep learning advances in foundation models have shown success in other domains like language and vision.

Method: Developed MEG-GPT transformer model with time-attention and next time-point prediction, plus a novel data-driven tokenizer for continuous MEG data that preserves temporal resolution without lossy transformations. Trained on tokenized brain region time-courses from large MEG dataset (N=612).

Result: Model generates data with realistic spatio-spectral properties including transient events and population variability. Improves downstream decoding tasks with better zero-shot generalization across sessions (accuracy 0.54→0.59) and subjects (accuracy 0.41→0.49) compared to baselines. Efficiently fine-tunes on smaller labeled datasets.

Conclusion: Establishes a powerful foundation model for electrophysiological data, paving the way for applications in computational neuroscience and neural decoding.

Abstract: Modelling the complex spatiotemporal patterns of large-scale brain dynamics is crucial for neuroscience, but traditional methods fail to capture the rich structure in modalities such as magnetoencephalography (MEG). Recent advances in deep learning have enabled significant progress in other domains, such as language and vision, by using foundation models at scale. Here, we introduce MEG-GPT, a transformer based foundation model that uses time-attention and next time-point prediction. To facilitate this, we also introduce a novel data-driven tokeniser for continuous MEG data, which preserves the high temporal resolution of continuous MEG signals without lossy transformations. We trained MEG-GPT on tokenised brain region time-courses extracted from a large-scale MEG dataset (N=612, eyes-closed rest, Cam-CAN data), and show that the learnt model can generate data with realistic spatio-spectral properties, including transient events and population variability. Critically, it performs well in downstream decoding tasks, improving downstream supervised prediction task, showing improved zero-shot generalisation across sessions (improving accuracy from 0.54 to 0.59) and subjects (improving accuracy from 0.41 to 0.49) compared to a baseline methods. Furthermore, we show the model can be efficiently fine-tuned on a smaller labelled dataset to boost performance in cross-subject decoding scenarios. This work establishes a powerful foundation model for electrophysiological data, paving the way for applications in computational neuroscience and neural decoding.

[408] Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek, Bo Li, Xiaojun Xu

Main category: cs.LG

TL;DR: ADA is an inference-time defense method that reintroduces alignment tokens mid-generation to enable LLMs to refuse harmful content at any point, achieving near-100% refusal rates against adversarial attacks while preserving utility.

Details

Motivation: LLMs exhibit shallow alignment where they refuse harmful queries initially but this protection collapses once generation is underway, creating safety vulnerabilities that need to be addressed at arbitrary generation depths.

Method: Any-Depth Alignment (ADA) reintroduces assistant header tokens mid-stream during generation, leveraging the model’s strong alignment priors concentrated in these tokens to reassess harmfulness and recover refusals at any point.

Result: Across multiple model families, ADA achieves near-100% refusal rate against adversarial prefill attacks, reduces success rate of prominent adversarial attacks to below 3%, and maintains resilience even after instruction tuning while preserving utility on benign tasks.

Conclusion: ADA provides an effective inference-time defense that unlocks LLMs’ innate shallow alignment to ensure safety at arbitrary generation depths with negligible overhead and without requiring model parameter changes.

Abstract: Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model’s strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model’s parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).

[409] Provably Optimal Reinforcement Learning under Safety Filtering

Donggeon David Oh, Duy P. Nguyen, Haimin Hu, Jaime F. Fisac

Main category: cs.LG

TL;DR: Safety filters in RL don’t inherently sacrifice performance - with sufficiently permissive filters, asymptotic performance is maintained while ensuring categorical safety.

Details

Motivation: Address the perceived safety-performance tradeoff in RL safety filters and prove that safety enforcement doesn't degrade asymptotic performance when using permissive filters.

Method: Formalize RL safety with Safety-Critical MDP (SC-MDP) and define filtered MDP where safety filter is part of environment. Prove theoretical guarantees for safety and performance.

Result: Zero violations during training on Safety Gymnasium tasks, with final performance matching or exceeding unfiltered baselines.

Conclusion: Safety filters can provide categorical safety without performance degradation when sufficiently permissive, enabling safe RL training and deployment.

Abstract: Recent advances in reinforcement learning (RL) enable its use on increasingly complex tasks, but the lack of formal safety guarantees still limits its application in safety-critical settings. A common practical approach is to augment the RL policy with a safety filter that overrides unsafe actions to prevent failures during both training and deployment. However, safety filtering is often perceived as sacrificing performance and hindering the learning process. We show that this perceived safety-performance tradeoff is not inherent and prove, for the first time, that enforcing safety with a sufficiently permissive safety filter does not degrade asymptotic performance. We formalize RL safety with a safety-critical Markov decision process (SC-MDP), which requires categorical, rather than high-probability, avoidance of catastrophic failure states. Additionally, we define an associated filtered MDP in which all actions result in safe effects, thanks to a safety filter that is considered to be a part of the environment. Our main theorem establishes that (i) learning in the filtered MDP is safe categorically, (ii) standard RL convergence carries over to the filtered MDP, and (iii) any policy that is optimal in the filtered MDP-when executed through the same filter-achieves the same asymptotic return as the best safe policy in the SC-MDP, yielding a complete separation between safety enforcement and performance optimization. We validate the theory on Safety Gymnasium with representative tasks and constraints, observing zero violations during training and final performance matching or exceeding unfiltered baselines. Together, these results shed light on a long-standing question in safety-filtered learning and provide a simple, principled recipe for safe RL: train and deploy RL policies with the most permissive safety filter that is available.

[410] Enhancing mortality prediction in cardiac arrest ICU patients through meta-modeling of structured clinical data from MIMIC-IV

Nursultan Mamatov, Philipp Kellmeyer

Main category: cs.LG

TL;DR: Machine learning models combining structured clinical data and unstructured text from MIMIC-IV database significantly improve in-hospital mortality prediction in ICUs, with AUC increasing from 0.753 to 0.918 (22% relative improvement).

Details

Motivation: Accurate early prediction of in-hospital mortality in ICUs is essential for timely clinical intervention and efficient resource allocation.

Method: Used LASSO and XGBoost for feature selection, followed by multivariate logistic regression on top features. Incorporated textual features using TF-IDF and BERT embeddings from discharge summaries and radiology reports.

Result: Final logistic regression model combining structured and textual input achieved AUC of 0.918 vs 0.753 with structured data alone (22% relative improvement). Decision curve analysis showed superior standardized net benefit across threshold probabilities 0.2-0.8.

Conclusion: Unstructured clinical notes provide significant added prognostic value and should be integrated into interpretable feature-driven risk prediction models for ICU patients.

Abstract: Accurate early prediction of in-hospital mortality in intensive care units (ICUs) is essential for timely clinical intervention and efficient resource allocation. This study develops and evaluates machine learning models that integrate both structured clinical data and unstructured textual information, specifically discharge summaries and radiology reports, from the MIMIC-IV database. We used LASSO and XGBoost for feature selection, followed by a multivariate logistic regression trained on the top features identified by both models. Incorporating textual features using TF-IDF and BERT embeddings significantly improved predictive performance. The final logistic regression model, which combined structured and textual input, achieved an AUC of 0.918, compared to 0.753 when using structured data alone, a relative improvement 22%. The analysis of the decision curve demonstrated a superior standardized net benefit in a wide range of threshold probabilities (0.2-0.8), confirming the clinical utility of the model. These results underscore the added prognostic value of unstructured clinical notes and support their integration into interpretable feature-driven risk prediction models for ICU patients.

[411] Latent Discrete Diffusion Models

Dario Shariatian, Alain Durmus, Stefano Peluchetti

Main category: cs.LG

TL;DR: Latent Discrete Diffusion Models (LDDMs) combine masked discrete diffusion over tokens with continuous diffusion over latent embeddings to improve joint structure and generation quality in few-step generation.

Details

Motivation: To address limitations of masked denoisers where reverse transitions factorize across positions, weakening joint structure and degrading quality in few-step generation.

Method: Propose LDDMs with two variants: FUJI-LDDMs (fully joint denoising of tokens and latents) and SEQ-LDDMs (sequential resolution of latent then discrete chain conditionally on it), using ELBO-style objectives and design choices for informative latents.

Result: LDDMs yield improvements on unconditional generation metrics compared to state-of-the-art masked discrete diffusion baselines, and are effective at lower sampling budgets where unmasking many tokens per step is desirable.

Conclusion: LDDMs successfully address the factorization limitation of masked denoisers by incorporating latent embeddings that provide cross-token dependencies and softer signals for better generation quality.

Abstract: We study discrete diffusion for language and other categorical data and focus on a common limitation of masked denoisers: reverse transitions typically factorize across positions, which can weaken joint structure and degrade quality in few-step generation. We propose \emph{Latent Discrete Diffusion Models} (LDDMs), which couple a masked discrete diffusion over tokens with a continuous diffusion over latent embeddings. The latent channel provides a softer signal and carries cross-token dependencies that help resolve ambiguities. We present two instantiations: (i) FUJI-LDDMs, which perform fully joint denoising of tokens and latents, and (ii) SEQ-LDDMs, which sequentially resolve the latent and then the discrete chain conditionally on it. For both variants we derive ELBO-style objectives and discuss design choices to learn informative latents yet amenable to diffusoin modeling. In experiments, LDDMs yield improvements on unconditional generation metrics as compared to state-of-the-art masked discrete diffusion baselines, and are effective at lower sampling budgets, where unmasking many tokens per step is desirable.

[412] Gradient Variance Reveals Failure Modes in Flow-Based Generative Models

Teodora Reu, Sixtine Dromigny, Michael Bronstein, Francisco Vargas

Main category: cs.LG

TL;DR: Rectified Flows can fail by memorizing training pairings instead of learning meaningful transport, especially under deterministic training. The straight-path objective converges to memorizing vector fields even when interpolants intersect.

Details

Motivation: To understand fundamental failure modes in Rectified Flows where the straight-path objective leads to memorization of arbitrary training pairings rather than learning proper transport mappings.

Method: Analyzed Gaussian-to-Gaussian transport using gradient variance analysis across stochastic and deterministic regimes. Proved existence of memorizing vector fields and convergence to them. Validated on CelebA dataset with deterministic vs noisy interpolants.

Result: Deterministic training causes memorization of exact training pairings at inference. Small noise injection restores generalization. The straight-path objective converges to ill-defined memorizing vector fields even when interpolating lines intersect.

Conclusion: Rectified Flows’ straight-path objective has inherent memorization risks under deterministic training. Stochasticity is crucial for proper generalization, as deterministic interpolants reproduce exact training pairings rather than meaningful transport.

Abstract: Rectified Flows learn ODE vector fields whose trajectories are straight between source and target distributions, enabling near one-step inference. We show that this straight-path objective conceals fundamental failure modes: under deterministic training, low gradient variance drives memorization of arbitrary training pairings, even when interpolant lines between pairs intersect. To analyze this mechanism, we study Gaussian-to-Gaussian transport and use the loss gradient variance across stochastic and deterministic regimes to characterize which vector fields optimization favors in each setting. We then show that, in a setting where all interpolating lines intersect, applying Rectified Flow yields the same specific pairings at inference as during training. More generally, we prove that a memorizing vector field exists even when training interpolants intersect, and that optimizing the straight-path objective converges to this ill-defined field. At inference, deterministic integration reproduces the exact training pairings. We validate our findings empirically on the CelebA dataset, confirming that deterministic interpolants induce memorization, while the injection of small noise restores generalization.

[413] Efficient Long-context Language Model Training by Core Attention Disaggregation

Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang

Main category: cs.LG

TL;DR: CAD is a technique that improves long-context LLM training by separating core attention computation from other layers and executing it on dedicated devices, addressing load imbalance issues caused by attention’s quadratic complexity.

Details

Motivation: Existing systems suffer from load imbalance and stragglers in data/pipeline parallel groups because core attention's quadratic compute growth at long contexts outpaces the near-linear growth of other components.

Method: CAD decouples core attention computation and dispatches token-level tasks to dedicated attention servers, using dynamic rebatching to equalize compute while maintaining kernel efficiency. Implemented in DistCA system with ping-pong execution for communication-computation overlap.

Result: On 512 H200 GPUs with context lengths up to 512k tokens, DistCA improves training throughput by up to 1.35x, eliminates stragglers, and achieves near-perfect compute and memory balance.

Conclusion: CAD effectively addresses the fundamental load imbalance problem in long-context LLM training by disaggregating core attention computation, enabling more efficient scaling to very long sequences.

Abstract: We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.

[414] HyperDiffusionFields (HyDiF): Diffusion-Guided Hypernetworks for Learning Implicit Molecular Neural Fields

Sudarshan Babu, Phillip Lo, Xiao Zhang, Aadi Srivastava, Ali Davariashtiyani, Jason Perera, Michael Maire, Aly A. Khan

Main category: cs.LG

TL;DR: HyperDiffusionFields (HyDiF) introduces a field-based approach for 3D molecular modeling using Molecular Directional Fields (MDFs) represented by neural implicit fields, with a hypernetwork-based diffusion model for generation and property prediction.

Details

Motivation: To overcome limitations of discrete atomic representations (coordinates/graphs) by modeling molecules as continuous fields, enabling better spatial feature extraction and scaling to larger biomolecules.

Method: Uses Molecular Directional Fields (MDFs) as vector fields mapping space to nearest atom directions, represented by Molecular Neural Fields (MNFs). A shared hypernetwork conditioned on molecules generates MNF weights, trained as a denoising diffusion model for generative capabilities, with masked diffusion for structure-conditioned generation.

Result: The framework supports molecular generation, structure-conditioned tasks like molecular inpainting, and spatially fine-grained feature extraction for property prediction. It scales effectively to larger biomolecules.

Conclusion: HyDiF demonstrates a promising field-based approach for molecular modeling that enables continuous representations, generative capabilities, and improved spatial feature extraction compared to traditional discrete methods.

Abstract: We introduce HyperDiffusionFields (HyDiF), a framework that models 3D molecular conformers as continuous fields rather than discrete atomic coordinates or graphs. At the core of our approach is the Molecular Directional Field (MDF), a vector field that maps any point in space to the direction of the nearest atom of a particular type. We represent MDFs using molecule-specific neural implicit fields, which we call Molecular Neural Fields (MNFs). To enable learning across molecules and facilitate generalization, we adopt an approach where a shared hypernetwork, conditioned on a molecule, generates the weights of the given molecule’s MNF. To endow the model with generative capabilities, we train the hypernetwork as a denoising diffusion model, enabling sampling in the function space of molecular fields. Our design naturally extends to a masked diffusion mechanism to support structure-conditioned generation tasks, such as molecular inpainting, by selectively noising regions of the field. Beyond generation, the localized and continuous nature of MDFs enables spatially fine-grained feature extraction for molecular property prediction, something not easily achievable with graph or point cloud based methods. Furthermore, we demonstrate that our approach scales to larger biomolecules, illustrating a promising direction for field-based molecular modeling.

[415] Rethinking PCA Through Duality

Jan Quan, Johan Suykens, Panagiotis Patrinos

Main category: cs.LG

TL;DR: The paper revisits PCA fundamentals using the difference-of-convex framework, showing connections between self-attention and kernel PCA, and presents new theoretical insights and algorithms.

Details

Motivation: Motivated by the connection between self-attention and kernel PCA, the authors aim to revisit PCA fundamentals and provide new theoretical insights using the difference-of-convex framework.

Method: Using the difference-of-convex (DC) framework, the authors present novel formulations for PCA, show kernelizability and out-of-sample applicability, and develop new algorithms including a kernelizable dual formulation for robust PCA with l1 deviation.

Result: The paper demonstrates that simultaneous iteration (connected to the classical QR algorithm) is an instance of DCA, provides optimization perspective on this method, and empirically compares new PCA algorithms with state-of-the-art methods.

Conclusion: The DC framework provides new theoretical insights into PCA fundamentals, connects classical methods like QR algorithm to optimization, and enables development of novel kernelizable and robust PCA formulations.

Abstract: Motivated by the recently shown connection between self-attention and (kernel) principal component analysis (PCA), we revisit the fundamentals of PCA. Using the difference-of-convex (DC) framework, we present several novel formulations and provide new theoretical insights. In particular, we show the kernelizability and out-of-sample applicability for a PCA-like family of problems. Moreover, we uncover that simultaneous iteration, which is connected to the classical QR algorithm, is an instance of the difference-of-convex algorithm (DCA), offering an optimization perspective on this longstanding method. Further, we describe new algorithms for PCA and empirically compare them with state-of-the-art methods. Lastly, we introduce a kernelizable dual formulation for a robust variant of PCA that minimizes the $l_1$ deviation of the reconstruction errors.

[416] Nash Policy Gradient: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

Eason Yu, Tzu Hao Liu, Yunke Wang, Clément L. Canonne, Nguyen H. Tran, Chang Xu

Main category: cs.LG

TL;DR: The paper proposes Nash Policy Gradient (NashPG), a method that converges to exact Nash equilibria in imperfect-information games by fixing regularization strength and iteratively refining reference policies, avoiding the instability of shrinking regularization approaches.

Details

Motivation: Existing regularization-based methods for finding Nash equilibria require shrinking regularization strength to zero, which leads to unstable learning in practice. The authors aim to develop a more robust approach that maintains fixed regularization while still achieving convergence.

Method: NashPG fixes regularization strength at a large value for robustness and achieves convergence by iteratively refining the reference policy. It preserves the generalizability of policy gradient methods while relying only on current and reference policies.

Result: Theoretical analysis shows strictly monotonic improvement and convergence to exact Nash equilibria in two-player zero-sum games without uniqueness assumptions. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods and scales to large domains like Battleship and No-Limit Texas Hold’em with higher Elo ratings.

Conclusion: NashPG provides a robust alternative to shrinking regularization methods, enabling stable convergence to exact Nash equilibria while maintaining practical scalability and performance across various game domains.

Abstract: Finding Nash equilibria in imperfect-information games remains a central challenge in multi-agent reinforcement learning. While regularization-based methods have recently achieved last-iteration convergence to a regularized equilibrium, they require the regularization strength to shrink toward zero to approximate a Nash equilibrium, often leading to unstable learning in practice. Instead, we fix the regularization strength at a large value for robustness and achieve convergence by iteratively refining the reference policy. Our main theoretical result shows that this procedure guarantees strictly monotonic improvement and convergence to an exact Nash equilibrium in two-player zero-sum games, without requiring a uniqueness assumption. Building on this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which preserves the generalizability of policy gradient methods while relying solely on the current and reference policies. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold’em, where NashPG consistently attains higher Elo ratings.

[417] ActivationReasoning: Logical Reasoning in Latent Activation Spaces

Lukas Helff, Ruben Härle, Wolfgang Stammer, Felix Friedrich, Manuel Brack, Antonia Wüst, Hikaru Shindo, Patrick Schramowski, Kristian Kersting

Main category: cs.LG

TL;DR: ActivationReasoning (AR) embeds logical reasoning into LLM latent spaces using sparse autoencoders, enabling systematic reasoning and model control through three stages: finding latent concepts, activating propositions, and logical reasoning.

Details

Motivation: LLMs generate fluent text but lack transparent internal reasoning and control mechanisms. Sparse autoencoders provide interpretable features but are fragile and passive, offering no systematic reasoning capabilities.

Method: Three-stage framework: (1) Find latent concept representations via SAEs, (2) Detect activating concepts and map to logical propositions at inference, (3) Apply logical rules to infer structures, compose concepts, and steer model behavior.

Result: AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. Evaluated on multi-hop reasoning (PrOntoQA), abstraction (Rail2Country), natural language reasoning (ProverQA), and context-sensitive safety (BeaverTails).

Conclusion: Grounding logical structure in latent activations improves transparency, enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.

Abstract: Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.

[418] Ensemble based Closed-Loop Optimal Control using Physics-Informed Neural Networks

Jostein Barry-Straume, Adwait D. Verulkar, Arash Sarshar, Andrey A. Popov, Adrian Sandu

Main category: cs.LG

TL;DR: A multistage ensemble framework using physics-informed neural networks (PINNs) to solve the Hamilton-Jacobi-Bellman (HJB) equation for optimal control system design, enabling both singular and ensemble control policies without stabilizer terms.

Details

Motivation: Traditional numerical solutions to the HJB equation are computationally intensive, and analytical solutions are often unavailable. PINNs offer an alternative approach to alleviate these difficulties in optimal control system design.

Method: Multistage ensemble framework using PINNs to learn the optimal cost-to-go and corresponding control signal through the HJB equation, without using stabilizer terms. Can produce either singular learned control signals or ensemble control signal policies.

Result: Successfully demonstrated closed-loop control of a steady-state time-invariant two-state continuous nonlinear system with infinite time horizon, handling noisy perturbed system states and varying initial conditions using both ensemble and singular control approaches.

Conclusion: The proposed multistage ensemble PINN framework provides an effective alternative to traditional HJB equation solving methods, enabling robust optimal control without stabilizer terms and handling various real-world challenges like noise and varying conditions.

Abstract: The objective of designing a control system is to steer a dynamical system with a control signal, guiding it to exhibit the desired behavior. The Hamilton-Jacobi-Bellman (HJB) partial differential equation offers a framework for optimal control system design. However, numerical solutions to this equation are computationally intensive, and analytical solutions are frequently unavailable. Knowledge-guided machine learning methodologies, such as physics-informed neural networks (PINNs), offer new alternative approaches that can alleviate the difficulties of solving the HJB equation numerically. This work presents a multistage ensemble framework to learn the optimal cost-to-go, and subsequently the corresponding optimal control signal, through the HJB equation. Prior PINN-based approaches rely on a stabilizing the HJB enforcement during training. Our framework does not use stabilizer terms and offers a means of controlling the nonlinear system, via either a singular learned control signal or an ensemble control signal policy. Success is demonstrated in closed-loop control, using both ensemble- and singular-control, of a steady-state time-invariant two-state continuous nonlinear system with an infinite time horizon, accounting of noisy, perturbed system states and varying initial conditions.

[419] Joint Optimization of Cooperation Efficiency and Communication Covertness for Target Detection with AUVs

Xueyao Zhang, Bo Yang, Zhiwen Yu, Xuelin Cao, Wei Xiang, Bin Guo, Liang Wang, Billy Pik Lik Lau, George C. Alexandropoulos, Jun Luo, Mérouane Debbah, Zhu Han, Chau Yuen

Main category: cs.LG

TL;DR: This paper presents a hierarchical framework for underwater cooperative target detection using AUVs, balancing cooperation efficiency with communication covertness through joint trajectory and power control optimization.

Details

Motivation: To address the critical trade-off between cooperation efficiency and communication covertness in underwater target detection using multiple autonomous underwater vehicles (AUVs).

Method: A hierarchical action management framework with macro-level strategic task allocation using proximal policy optimization (PPO) for agent selection, and micro-level decentralized decision-making using multi-agent PPO for trajectory and power control based on local observations.

Result: The framework enables adaptive covert cooperation while satisfying energy and mobility constraints, providing both theoretical insights and practical solutions for efficient and secure operation of multiple AUVs.

Conclusion: The proposed approach offers significant implications for underwater covert communication tasks by enabling efficient and secure cooperative target detection through hierarchical optimization of trajectory and power control.

Abstract: This paper investigates underwater cooperative target detection using autonomous underwater vehicles (AUVs), with a focus on the critical trade-off between cooperation efficiency and communication covertness. To tackle this challenge, we first formulate a joint trajectory and power control optimization problem, and then present an innovative hierarchical action management framework to solve it. According to the hierarchical formulation, at the macro level, the master AUV models the agent selection process as a Markov decision process and deploys the proximal policy optimization algorithm for strategic task allocation. At the micro level, each selected agent’s decentralized decision-making is modeled as a partially observable Markov decision process, and a multi-agent proximal policy optimization algorithm is used to dynamically adjust its trajectory and transmission power based on its local observations. Under the centralized training and decentralized execution paradigm, our target detection framework enables adaptive covert cooperation while satisfying both energy and mobility constraints. By comprehensively modeling the considered system, the involved signals and tasks, as well as energy consumption, theoretical insights and practical solutions for the efficient and secure operation of multiple AUVs are provided, offering significant implications for the execution of underwater covert communication tasks.

[420] Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations

Zhendong Mi, Qitao Tan, Grace Li Zhang, Zhaozhuo Xu, Geng Yuan, Shaoyi Huang

Main category: cs.LG

TL;DR: P-GAP is a zeroth-order optimization method for fine-tuning LLMs that reduces gradient estimation variance through projected gradient-aligned perturbations, achieving faster convergence and better performance with fewer resources.

Details

Motivation: Existing zeroth-order optimization methods for LLM fine-tuning suffer from high variance in gradient estimation, leading to slow convergence and suboptimal performance on large-scale models.

Method: P-GAP estimates a low-dimensional gradient space and aligns perturbations in the projected gradients’ direction within this space, reducing the number of perturbed parameters and decreasing variance.

Result: P-GAP achieves up to 6% higher accuracy on classification tasks and 12% higher accuracy on generation tasks, with 81% fewer training iterations and 70% less GPU hours compared to baselines.

Conclusion: P-GAP enables fast, scalable, and resource-efficient zeroth-order optimization for LLM fine-tuning.

Abstract: Fine-tuning large language models (LLMs) using zeroth-order (ZO) optimization has emerged as a promising alternative to traditional gradient-based methods due to its reduced memory footprint requirement. However, existing ZO methods suffer from high variance in gradient estimation, leading to slow convergence and suboptimal performance on large-scale models. In this work, we propose P-GAP, a fast LLM fine-tuning approach through zeroth-order optimization with Projected Gradient-Aligned Perturbations. Specifically, we first estimate a low-dimensional gradient space and then align perturbations in projected gradients’ direction within the space. This approach enables reduced the number of perturbed parameters and decreased variance, therefore accelerated convergence for LLM fine-tuning. Experiments on LLMs show that P-GAP consistently surpasses the baselines, achieving up to 6% increase in accuracy on classification tasks and up to 12% higher accuracy on generation tasks, with up to about 81% less training iterations and 70% less GPU hours. These results demonstrate that P-GAP enables fast, scalable, and resource-efficient ZO LLM fine-tuning.

[421] ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control

Yuzheng Hu, Ryan McKenna, Da Yu, Shanshan Wu, Han Zhao, Zheng Xu, Peter Kairouz

Main category: cs.LG

TL;DR: The paper introduces ACTG-ARL, a hierarchical framework for differentially private synthetic text generation that decomposes the task into feature learning and conditional text generation, achieving improved quality and control over generation.

Details

Motivation: Prior work on DP synthetic text generation fails to preserve statistical attributes, suffers utility loss from DP noise, and lacks fine-grained control over generation, necessitating a better approach.

Method: Proposes ACTG (Attribute-Conditioned Text Generation) using a hierarchical framework with rich tabular features, DP tabular synthesizer, and DP fine-tuned conditional generator, plus Anchored RL for improved instruction-following.

Result: ACTG-ARL advances DP synthetic text quality by +20% MAUVE over prior work while maintaining strong privacy guarantees and improving control over conditional generation.

Conclusion: The hierarchical decomposition approach with ACTG-ARL effectively addresses limitations of prior DP text generation methods, achieving both high-quality synthetic text and fine-grained control under differential privacy constraints.

Abstract: Generating high-quality synthetic text under differential privacy (DP) is critical for training and evaluating language models without compromising user privacy. Prior work on synthesizing DP datasets often fail to preserve key statistical attributes, suffer utility loss from the noise required by DP, and lack fine-grained control over generation. To address these challenges, we make two contributions. First, we introduce a hierarchical framework that decomposes DP synthetic text generation into two subtasks: feature learning and conditional text generation. This design explicitly incorporates learned features into the generation process and simplifies the end-to-end synthesis task. Through systematic ablations, we identify the most effective configuration: a rich tabular schema as feature, a DP tabular synthesizer, and a DP fine-tuned conditional generator, which we term ACTG (Attribute-Conditioned Text Generation). Second, we propose Anchored RL (ARL), a post-training method that improves the instruction-following ability of ACTG for conditional generation. ARL combines RL to boost control with an SFT anchor on best-of-$N$ data to prevent reward hacking. Together, these components form our end-to-end algorithm ACTG-ARL, which advances both the quality of DP synthetic text (+20% MAUVE over prior work) and the control of the conditional generator under strong privacy guarantees.

Bryan Wilder, Angela Zhou

Main category: cs.LG

TL;DR: Current AI/ML review criteria overly prioritize projects that combine deployment with methodological novelty, creating incentives that undermine sustainable social impact research.

Details

Motivation: To address the problematic incentive structure in AI/ML for social impact research, where review guidelines disproportionately reward projects that achieve both deployment and methodological innovation simultaneously.

Method: Proposes adopting two key approaches: expanding the conception of social impacts beyond just deployment, and implementing more rigorous evaluations of deployed systems’ actual impact.

Result: Identifies that current review practices create unsustainable incentives that may not align with partner needs or support a diverse research ecosystem.

Conclusion: Researchers and reviewers should embrace a broader definition of social impact contributions and conduct more thorough impact assessments to foster a sustainable social impact research ecosystem.

Abstract: There has been increasing research interest in AI/ML for social impact, and correspondingly more publication venues have refined review criteria for practice-driven AI/ML research. However, these review guidelines tend to most concretely recognize projects that simultaneously achieve deployment and novel ML methodological innovation. We argue that this introduces incentives for researchers that undermine the sustainability of a broader research ecosystem of social impact, which benefits from projects that make contributions on single front (applied or methodological) that may better meet project partner needs. Our position is that researchers and reviewers in machine learning for social impact must simultaneously adopt: 1) a more expansive conception of social impacts beyond deployment and 2) more rigorous evaluations of the impact of deployed systems.

Haobin Li, Yijie Lin, Peng Hu, Mouxing Yang, Xi Peng

Main category: cs.LG

TL;DR: The paper introduces RULE, a robust framework for multi-modal entity alignment that addresses Dual-level Noisy Correspondence (DNC) - misalignments in both intra-entity and inter-graph correspondences in multi-modal knowledge graphs.

Details

Motivation: Existing MMEA methods assume faultless correspondences, but real-world MMKGs often have noisy correspondences due to expert annotation errors, creating a practical problem that needs robust solutions.

Method: RULE estimates correspondence reliability via a two-fold principle, mitigates intra-entity noise during attribute fusion, prevents overfitting to noisy inter-graph correspondences, and includes a correspondence reasoning module for uncovering attribute-attribute connections.

Result: Extensive experiments on five benchmarks show RULE effectively handles DNC and outperforms seven state-of-the-art methods.

Conclusion: RULE provides a robust solution for multi-modal entity alignment in noisy real-world scenarios by addressing dual-level correspondence noise through reliability estimation and correspondence reasoning.

Abstract: Multi-modal entity alignment (MMEA) aims to identify equivalent entities across heterogeneous multi-modal knowledge graphs (MMKGs), where each entity is described by attributes from various modalities. Existing methods typically assume that both intra-entity and inter-graph correspondences are faultless, which is often violated in real-world MMKGs due to the reliance on expert annotations. In this paper, we reveal and study a highly practical yet under-explored problem in MMEA, termed Dual-level Noisy Correspondence (DNC). DNC refers to misalignments in both intra-entity (entity-attribute) and inter-graph (entity-entity and attribute-attribute) correspondences. To address the DNC problem, we propose a robust MMEA framework termed RULE. RULE first estimates the reliability of both intra-entity and inter-graph correspondences via a dedicated two-fold principle. Leveraging the estimated reliabilities, RULE mitigates the negative impact of intra-entity noise during attribute fusion and prevents overfitting to noisy inter-graph correspondences during inter-graph discrepancy elimination. Beyond the training-time designs, RULE further incorporates a correspondence reasoning module that uncovers the underlying attribute-attribute connection across graphs, guaranteeing more accurate equivalent entity identification. Extensive experiments on five benchmarks verify the effectiveness of our method against the DNC compared with seven state-of-the-art methods.The code is available at \href{https://github.com/XLearning-SCU/RULE}{XLearning-SCU/RULE}

[424] Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Song Bian, Tao Yu, Shivaram Venkataraman, Youngsuk Park

Main category: cs.LG

TL;DR: The paper introduces a conditional scaling law that incorporates architectural factors to optimize the trade-off between model accuracy and inference efficiency in large language models.

Details

Motivation: As LLMs grow larger and more widely deployed, inference costs become a major concern, but the trade-off between accuracy and efficiency remains underexplored.

Method: Developed a conditional scaling law that augments Chinchilla framework with architectural information, and created a search framework to identify inference-efficient architectures. Trained over 200 models from 80M to 3B parameters with 8B to 100B training tokens.

Result: The conditional scaling law reliably predicts optimal architectural choices, with optimized models achieving up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2 under the same training budget.

Conclusion: Architectural optimization using the proposed conditional scaling law can significantly improve both accuracy and inference efficiency in LLMs, addressing the critical trade-off between model performance and deployment costs.

Abstract: Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

[425] NTKMTL: Mitigating Task Imbalance in Multi-Task Learning from Neural Tangent Kernel Perspective

Xiaohan Qin, Xiaoxing Wang, Ning Liao, Junchi Yan

Main category: cs.LG

TL;DR: NTKMTL addresses task imbalance in multi-task learning by using Neural Tangent Kernel theory to analyze training dynamics and balance task convergence speeds through spectral analysis.

Details

Motivation: Task imbalance remains a major challenge in multi-task learning, and accurately characterizing training dynamics and convergence speeds of multiple tasks is highly challenging.

Method: Proposes NTKMTL using Neural Tangent Kernel theory with an extended NTK matrix for MTL and spectral analysis to balance task convergence speeds. Also introduces NTKMTL-SR for efficient training via shared representation approximation.

Result: Extensive experiments show state-of-the-art performance across multiple benchmarks in both multi-task supervised learning and multi-task reinforcement learning.

Conclusion: The proposed NTKMTL methods effectively mitigate task imbalance in multi-task learning and achieve competitive performance with training efficiency.

Abstract: Multi-Task Learning (MTL) enables a single model to learn multiple tasks simultaneously, leveraging knowledge transfer among tasks for enhanced generalization, and has been widely applied across various domains. However, task imbalance remains a major challenge in MTL. Although balancing the convergence speeds of different tasks is an effective approach to address this issue, it is highly challenging to accurately characterize the training dynamics and convergence speeds of multiple tasks within the complex MTL system. To this end, we attempt to analyze the training dynamics in MTL by leveraging Neural Tangent Kernel (NTK) theory and propose a new MTL method, NTKMTL. Specifically, we introduce an extended NTK matrix for MTL and adopt spectral analysis to balance the convergence speeds of multiple tasks, thereby mitigating task imbalance. Based on the approximation via shared representation, we further propose NTKMTL-SR, achieving training efficiency while maintaining competitive performance. Extensive experiments demonstrate that our methods achieve state-of-the-art performance across a wide range of benchmarks, including both multi-task supervised learning and multi-task reinforcement learning. Source code is available at https://github.com/jianke0604/NTKMTL.

[426] From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation

Ziwei Huang, Ying Shu, Hao Fang, Quanyu Long, Wenya Wang, Qiushi Guo, Tiezheng Ge, Leilei Gan

Main category: cs.LG

TL;DR: Customized-GRPO addresses the fidelity-editability trade-off in subject-driven image generation by introducing Synergy-Aware Reward Shaping and Time-Aware Dynamic Weighting to overcome competitive degradation in naive GRPO applications.

Details

Motivation: Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability), and naive application of GRPO leads to competitive degradation due to conflicting gradient signals and misalignment with diffusion process dynamics.

Method: Proposes Customized-GRPO with two innovations: Synergy-Aware Reward Shaping (SARS) that penalizes conflicted reward signals and amplifies synergistic ones, and Time-Aware Dynamic Weighting (TDW) that aligns optimization pressure with temporal dynamics by prioritizing prompt-following early and identity preservation later.

Result: Extensive experiments show the method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation and achieving superior balance between identity preservation and prompt adherence.

Conclusion: Customized-GRPO successfully addresses the fidelity-editability trade-off in subject-driven image generation, generating images that preserve key identity features while accurately adhering to complex textual prompts.

Abstract: Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model’s temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.

[427] Online Time Series Forecasting with Theoretical Guarantees

Zijian Li, Changze Zhou, Minghao Fu, Sanjay Manjunath, Fan Feng, Guangyi Chen, Yingyao Hu, Ruichu Cai, Kun Zhang

Main category: cs.LG

TL;DR: TOT framework for online time series forecasting with theoretical guarantees on handling unknown distribution shifts through latent variables.

Details

Motivation: To address unknown distribution shifts in online time series forecasting where latent variables influence the mapping from historical to future observations.

Method: Propose TOT framework with theoretical analysis, use temporal decoder to match observed variable distribution, and employ two independent noise estimators for latent variable causal inference and observed variable mixing.

Result: Theoretical proofs show latent variables tighten Bayes risk, experiments on synthetic data support claims, and plug-in implementations on baselines show general improvement across benchmarks.

Conclusion: TOT framework provides effective automated online time series forecasting with theoretical guarantees and practical improvements in real-world applications.

Abstract: This paper is concerned with online time series forecasting, where unknown distribution shifts occur over time, i.e., latent variables influence the mapping from historical to future observations. To develop an automated way of online time series forecasting, we propose a Theoretical framework for Online Time-series forecasting (TOT in short) with theoretical guarantees. Specifically, we prove that supplying a forecaster with latent variables tightens the Bayes risk, the benefit endures under estimation uncertainty of latent variables and grows as the latent variables achieve a more precise identifiability. To better introduce latent variables into online forecasting algorithms, we further propose to identify latent variables with minimal adjacent observations. Based on these results, we devise a model-agnostic blueprint by employing a temporal decoder to match the distribution of observed variables and two independent noise estimators to model the causal inference of latent variables and mixing procedures of observed variables, respectively. Experiment results on synthetic data support our theoretical claims. Moreover, plug-in implementations built on several baselines yield general improvement across multiple benchmarks, highlighting the effectiveness in real-world applications.

[428] Physics-Informed Parametric Bandits for Beam Alignment in mmWave Communications

Hao Qin, Thang Duong, Ming Li, Chicheng Zhang

Main category: cs.LG

TL;DR: Proposes two physics-informed bandit algorithms (pretc and prgreedy) for mmWave beam alignment that exploit sparse multipath properties, outperforming existing methods in diverse channel environments.

Details

Motivation: Traditional bandit algorithms require long convergence times in large beam spaces, and existing methods rely on unimodality/multimodality assumptions that often don't hold in practice, leading to suboptimal beam selection.

Method: Developed pretc and prgreedy algorithms that treat path parameters as black boxes and maintain optimal estimates based on sampled historical rewards. pretc uses random exploration then commits to optimal beam, while prgreedy performs online estimation and beam selection.

Result: Both algorithms outperform existing approaches across diverse channel environments using synthetic DeepMIMO and real-world DeepSense6G datasets, demonstrating generalizability and robustness.

Conclusion: The proposed physics-informed bandit algorithms effectively exploit sparse multipath properties of mmWave channels for beam alignment, providing superior performance without relying on restrictive structural assumptions.

Abstract: In millimeter wave (mmWave) communications, beam alignment and tracking are crucial to combat the significant path loss. As scanning the entire directional space is inefficient, designing an efficient and robust method to identify the optimal beam directions is essential. Since traditional bandit algorithms require a long time horizon to converge under large beam spaces, many existing works propose efficient bandit algorithms for beam alignment by relying on unimodality or multimodality assumptions on the reward function’s structure. However, such assumptions often do not hold (or cannot be strictly satisfied) in practice, which causes such algorithms to converge to choosing suboptimal beams. In this work, we propose two physics-informed bandit algorithms \textit{pretc} and \textit{prgreedy} that exploit the sparse multipath property of mmWave channels - a generic but realistic assumption - which is connected to the Phase Retrieval Bandit problem. Our algorithms treat the parameters of each path as black boxes and maintain optimal estimates of them based on sampled historical rewards. \textit{pretc} starts with a random exploration phase and then commits to the optimal beam under the estimated reward function. \textit{prgreedy} performs such estimation in an online manner and chooses the best beam under current estimates. Our algorithms can also be easily adapted to beam tracking in the mobile setting. Through experiments using both the synthetic DeepMIMO dataset and the real-world DeepSense6G dataset, we demonstrate that both algorithms outperform existing approaches in a wide range of scenarios across diverse channel environments, showing their generalizability and robustness.

[429] Towards Identifiability of Hierarchical Temporal Causal Representation Learning

Zijian Li, Minghao Fu, Junxian Huang, Yifan Shen, Ruichu Cai, Yuewen Sun, Guangyi Chen, Kun Zhang

Main category: cs.LG

TL;DR: CHiLD framework uniquely identifies hierarchical latent dynamics using three conditionally independent observations, enabling recovery of joint distribution of multi-layer latent variables from single-timestep observed variables.

Details

Motivation: Existing temporal causal representation learning methods fail to capture hierarchical latent dynamics because they cannot recover the joint distribution of hierarchical latent variables from single-timestep observed variables.

Method: Uses temporal contextual observed variables to identify joint distribution of multi-layer latent variables, exploits natural sparsity of hierarchical structure to identify latent variables per layer, and develops variational inference model with contextual encoder and flow-based hierarchical prior networks.

Result: Empirical evaluations on synthetic and real-world datasets validate theoretical claims and demonstrate effectiveness in modeling hierarchical latent dynamics.

Conclusion: CHiLD framework successfully identifies and models hierarchical latent dynamics, overcoming limitations of existing methods through theoretical insights and practical implementation.

Abstract: Modeling hierarchical latent dynamics behind time series data is critical for capturing temporal dependencies across multiple levels of abstraction in real-world tasks. However, existing temporal causal representation learning methods fail to capture such dynamics, as they fail to recover the joint distribution of hierarchical latent variables from \textit{single-timestep observed variables}. Interestingly, we find that the joint distribution of hierarchical latent variables can be uniquely determined using three conditionally independent observations. Building on this insight, we propose a Causally Hierarchical Latent Dynamic (CHiLD) identification framework. Our approach first employs temporal contextual observed variables to identify the joint distribution of multi-layer latent variables. Sequentially, we exploit the natural sparsity of the hierarchical structure among latent variables to identify latent variables within each layer. Guided by the theoretical results, we develop a time series generative model grounded in variational inference. This model incorporates a contextual encoder to reconstruct multi-layer latent variables and normalize flow-based hierarchical prior networks to impose the independent noise condition of hierarchical latent dynamics. Empirical evaluations on both synthetic and real-world datasets validate our theoretical claims and demonstrate the effectiveness of CHiLD in modeling hierarchical latent dynamics.

[430] Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

Brady Bhalla, Honglu Fan, Nancy Chen, Tony Yue YU

Main category: cs.LG

TL;DR: Transformers trained on bubble-sort tasks develop structured internal world models where embedding dimension size affects representation quality - larger dimensions yield more faithful, consistent, and interpretable internal representations.

Details

Motivation: To understand how embedding dimension affects the emergence of internal world models in transformers performing algorithmic tasks like bubble sort, and to quantitatively measure representation quality beyond just end performance.

Method: Train transformers with reinforcement learning to perform bubble-sort-style adjacent swaps, systematically varying embedding dimensions across hundreds of experiments and analyzing internal attention mechanisms.

Result: Models achieve high accuracy even with small embeddings, but larger dimensions produce more robust internal representations. Two consistent mechanisms emerge: last attention row encodes global token ordering, and transpositions align with largest adjacent differences of these encoded values.

Conclusion: Transformers build structured internal world models for algorithmic tasks, and model size (embedding dimension) improves representation quality and interpretability in addition to end performance.

Abstract: We investigate how embedding dimension affects the emergence of an internal “world model” in a transformer trained with reinforcement learning to perform bubble-sort-style adjacent swaps. Models achieve high accuracy even with very small embedding dimensions, but larger dimensions yield more faithful, consistent, and robust internal representations. In particular, higher embedding dimensions strengthen the formation of structured internal representation and lead to better interpretability. After hundreds of experiments, we observe two consistent mechanisms: (1) the last row of the attention weight matrix monotonically encodes the global ordering of tokens; and (2) the selected transposition aligns with the largest adjacent difference of these encoded values. Our results provide quantitative evidence that transformers build structured internal world models and that model size improves representation quality in addition to end performance. We release our metrics and analyses, which can be used to probe similar algorithmic tasks.

[431] Uncertainty Estimation by Flexible Evidential Deep Learning

Taeseong Yoon, Heeyoung Kim

Main category: cs.LG

TL;DR: The paper proposes Flexible Evidential Deep Learning (F-EDL), which extends traditional EDL by using flexible Dirichlet distributions for more expressive uncertainty quantification, improving generalization and reliability in challenging scenarios.

Details

Motivation: Current evidential deep learning methods have limited robustness due to restrictive Dirichlet distribution assumptions, which can lead to poor uncertainty quantification in complex or unforeseen situations.

Method: F-EDL extends EDL by predicting flexible Dirichlet distributions (a generalization of standard Dirichlet distributions) over class probabilities, providing more expressive uncertainty representation.

Result: F-EDL achieves state-of-the-art uncertainty quantification performance across diverse evaluation settings including classical, long-tailed, and noisy in-distribution scenarios.

Conclusion: The flexible Dirichlet approach significantly enhances uncertainty quantification generalization and reliability, making it more suitable for high-stakes applications where overconfident predictions could have serious consequences.

Abstract: Uncertainty quantification (UQ) is crucial for deploying machine learning models in high-stakes applications, where overconfident predictions can lead to serious consequences. An effective UQ method must balance computational efficiency with the ability to generalize across diverse scenarios. Evidential deep learning (EDL) achieves efficiency by modeling uncertainty through the prediction of a Dirichlet distribution over class probabilities. However, the restrictive assumption of Dirichlet-distributed class probabilities limits EDL’s robustness, particularly in complex or unforeseen situations. To address this, we propose \textit{flexible evidential deep learning} ($\mathcal{F}$-EDL), which extends EDL by predicting a flexible Dirichlet distribution – a generalization of the Dirichlet distribution – over class probabilities. This approach provides a more expressive and adaptive representation of uncertainty, significantly enhancing UQ generalization and reliability under challenging scenarios. We theoretically establish several advantages of $\mathcal{F}$-EDL and empirically demonstrate its state-of-the-art UQ performance across diverse evaluation settings, including classical, long-tailed, and noisy in-distribution scenarios.

[432] Scalable, Explainable and Provably Robust Anomaly Detection with One-Step Flow Matching

Zhong Li, Qi Huang, Yuxuan Zhu, Lincen Yang, Mohammad Mohammadi Amiri, Niki van Stein, Matthijs van Leeuwen

Main category: cs.LG

TL;DR: TCCM is a semi-supervised anomaly detection method for tabular data that uses time-conditioned contraction matching to learn velocity fields toward a fixed target, offering efficient training, fast inference, and theoretical robustness guarantees.

Details

Motivation: To address the inference bottleneck and computational cost of existing continuous-time anomaly detection models like DTE, while maintaining high accuracy and providing explainable results.

Method: TCCM simplifies flow matching by predicting time-conditioned contraction vectors toward the origin, eliminating the need for solving ODEs during training/inference. It uses one time-step deviation scoring for efficient anomaly detection.

Result: Extensive experiments on ADBench show TCCM achieves favorable balance between detection accuracy and inference cost, outperforming state-of-the-art methods, especially on high-dimensional and large-scale datasets.

Conclusion: TCCM provides an efficient, scalable, and theoretically robust approach to anomaly detection with explainable feature-wise attribution and Lipschitz-continuous scoring.

Abstract: We introduce Time-Conditioned Contraction Matching (TCCM), a novel method for semi-supervised anomaly detection in tabular data. TCCM is inspired by flow matching, a recent generative modeling framework that learns velocity fields between probability distributions and has shown strong performance compared to diffusion models and generative adversarial networks. Instead of directly applying flow matching as originally formulated, TCCM builds on its core idea – learning velocity fields between distributions – but simplifies the framework by predicting a time-conditioned contraction vector toward a fixed target (the origin) at each sampled time step. This design offers three key advantages: (1) a lightweight and scalable training objective that removes the need for solving ordinary differential equations during training and inference; (2) an efficient scoring strategy called one time-step deviation, which quantifies deviation from expected contraction behavior in a single forward pass, addressing the inference bottleneck of existing continuous-time models such as DTE (a diffusion-based model with leading anomaly detection accuracy but heavy inference cost); and (3) explainability and provable robustness, as the learned velocity field operates directly in input space, making the anomaly score inherently feature-wise attributable; moreover, the score function is Lipschitz-continuous with respect to the input, providing theoretical guarantees under small perturbations. Extensive experiments on the ADBench benchmark show that TCCM strikes a favorable balance between detection accuracy and inference cost, outperforming state-of-the-art methods – especially on high-dimensional and large-scale datasets. The source code is available at our GitHub repository.

[433] Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

Jongmin Lee, Ernest K. Ryu

Main category: cs.LG

TL;DR: This paper provides convergence analysis for policy gradient methods in undiscounted total-reward MDPs (γ=1), addressing a gap in existing theory that primarily assumes γ<1.

Details

Motivation: Modern policy-based RL for large language models uses undiscounted settings (γ=1), but existing theoretical analyses assume γ<1, making them inapplicable to these practical applications.

Method: The analysis uses two key insights: (1) state classification into recurrent/transient states is invariant for policies with strictly positive action probabilities, and (2) replaces the classical state visitation measure with a new ’transient visitation measure’ that works when γ=1.

Result: The paper establishes theoretical foundations for policy gradient methods in undiscounted infinite-horizon MDPs, providing convergence guarantees that were previously unavailable for γ=1 settings.

Conclusion: This work bridges the theory-practice gap by extending policy gradient analysis to undiscounted settings, making rigorous theoretical support available for modern RL applications like large language model training.

Abstract: The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $\gamma < 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $\gamma = 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $\gamma = 1$) can be replaced with a new object that we call the transient visitation measure.

[434] Computable universal online learning

Dariusz Kalociński, Tomasz Steifer

Main category: cs.LG

TL;DR: The paper examines when universal online learning can be implemented as computable programs, showing that theoretical learnability doesn’t guarantee computable implementation, and provides characterizations for computable agnostic and proper universal online learning.

Details

Motivation: To bridge the gap between abstract mathematical learning theory and practical computable implementations, addressing when learning algorithms can actually be implemented as computer programs rather than just existing as mathematical objects.

Method: Analyzes universal online learning in the context of computability theory, examining when finite-mistake strategies can be implemented as computer programs. Studies computable universal online learning, its agnostic variant, and proper universal online learning.

Result: Shows that universal online learning does not imply computable universal online learning, even for relatively simple hypothesis classes. Provides exact characterizations for when computable agnostic universal online learning and proper universal online learning are possible.

Conclusion: The results provide a more realistic perspective on online binary classification theory by connecting abstract learnability with practical computability constraints, highlighting important distinctions between theoretical and implementable learning.

Abstract: Understanding when learning is possible is a fundamental task in the theory of machine learning. However, many characterizations known from the literature deal with abstract learning as a mathematical object and ignore the crucial question: when can learning be implemented as a computer program? We address this question for universal online learning, a generalist theoretical model of online binary classification, recently characterized by Bousquet et al. (STOC'21). In this model, there is no hypothesis fixed in advance; instead, Adversary – playing the role of Nature – can change their mind as long as local consistency with the given class of hypotheses is maintained. We require Learner to achieve a finite number of mistakes while using a strategy that can be implemented as a computer program. We show that universal online learning does not imply computable universal online learning, even if the class of hypotheses is relatively easy from a computability-theoretic perspective. We then study the agnostic variant of computable universal online learning and provide an exact characterization of classes that are learnable in this sense. We also consider a variant of proper universal online learning and show exactly when it is possible. Together, our results give a more realistic perspective on the existing theory of online binary classification and the related problem of inductive inference.

[435] Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

Firas Gabetni, Giuseppe Curci, Andrea Pilzer, Subhankar Roy, Elisa Ricci, Gianni Franchi

Main category: cs.LG

TL;DR: Hydra Ensembles is an efficient transformer-based ensemble method that prunes attention heads to create diverse members and merges them using grouped fully-connected layers, achieving strong uncertainty quantification performance with inference speed close to a single network.

Details

Motivation: Deep Ensembles provide strong uncertainty quantification but have high computational and memory costs that limit scalability to large models. There is a need for efficient ensemble methods that maintain UQ performance while being computationally practical.

Method: Prune attention heads in transformers to create diverse ensemble members, then merge them using a new multi-head attention mechanism with grouped fully-connected layers. This creates a compact model without requiring retraining from scratch.

Result: Hydra Ensembles match or surpass Deep Ensembles in UQ performance while having inference speed close to a single network. In zero-shot classification on ImageNet-1k, it surpasses state-of-the-art methods without additional training. Experiments across image and text classification tasks with various architectures show consistent gains.

Conclusion: Hydra Ensembles provide an efficient alternative to Deep Ensembles for uncertainty quantification, achieving strong performance with minimal computational overhead. The method demonstrates that careful pruning strategies can preserve calibration while naive approaches may harm it.

Abstract: Uncertainty quantification (UQ) is essential for deploying deep neural networks in safety-critical settings. Although methods like Deep Ensembles achieve strong UQ performance, their high computational and memory costs hinder scalability to large models. We introduce Hydra Ensembles, an efficient transformer-based ensemble that prunes attention heads to create diverse members and merges them via a new multi-head attention with grouped fully-connected layers. This yields a compact model with inference speed close to a single network, matching or surpassing Deep Ensembles in UQ performance without retraining from scratch. We also provide an in-depth analysis of pruning, showing that naive approaches can harm calibration, whereas Hydra Ensembles preserves robust uncertainty. Experiments on image and text classification tasks, with various architectures, show consistent gains over Deep Ensembles. Remarkably, in zero-shot classification on ImageNet-1k, our approach surpasses state of the art methods, even without requiring additional training.

[436] Learning to Flow from Generative Pretext Tasks for Neural Architecture Encoding

Sunwoo Kim, Hyunjin Hwang, Kijung Shin

Main category: cs.LG

TL;DR: FGP is a novel pre-training method that trains neural architecture encoders to capture information flow using a flow surrogate representation, achieving up to 106% performance improvement over supervised-only training.

Details

Motivation: Current flow-based neural architecture encoders are slow due to complex structures, creating practical challenges despite their effectiveness in capturing information flow for architecture performance prediction.

Method: Proposed FGP pre-training method that trains encoders to reconstruct a flow surrogate representation of neural architecture’s information flow, eliminating the need for specialized model structures.

Result: FGP boosts encoder performance by up to 106% in Precision-1% compared to the same encoder trained solely with supervised learning.

Conclusion: FGP provides an effective pre-training approach that captures information flow efficiently without complex model structures, significantly improving neural architecture encoding performance.

Abstract: The performance of a deep learning model on a specific task and dataset depends heavily on its neural architecture, motivating considerable efforts to rapidly and accurately identify architectures suited to the target task and dataset. To achieve this, researchers use machine learning models-typically neural architecture encoders-to predict the performance of a neural architecture. Many state-of-the-art encoders aim to capture information flow within a neural architecture, which reflects how information moves through the forward pass and backpropagation, via a specialized model structure. However, due to their complicated structures, these flow-based encoders are significantly slower to process neural architectures compared to simpler encoders, presenting a notable practical challenge. To address this, we propose FGP, a novel pre-training method for neural architecture encoding that trains an encoder to capture the information flow without requiring specialized model structures. FGP trains an encoder to reconstruct a flow surrogate, our proposed representation of the neural architecture’s information flow. Our experiments show that FGP boosts encoder performance by up to 106% in Precision-1%, compared to the same encoder trained solely with supervised learning.

[437] Towards Unsupervised Open-Set Graph Domain Adaptation via Dual Reprogramming

Zhen Zhang, Bingsheng He

Main category: cs.LG

TL;DR: GraphRTA is a novel framework for unsupervised open-set graph domain adaptation that performs both graph and model reprogramming to handle unknown classes in target graphs.

Details

Motivation: Existing graph domain adaptation models assume closed-set settings with identical label spaces, but real-world scenarios often involve target domains with classes not present in the source domain.

Method: GraphRTA conducts dual reprogramming: (1) graph reprogramming by modifying target graph structure and node features for better known/unknown class separation, and (2) model reprogramming by pruning domain-specific parameters while preserving transferable patterns. It also extends the classifier with an extra dimension for unknown class recognition.

Result: Comprehensive experiments on public datasets show that GraphRTA achieves satisfactory performance compared with state-of-the-art baselines.

Conclusion: The proposed GraphRTA framework effectively addresses the open-set graph domain adaptation problem by combining graph and model reprogramming techniques, eliminating the need for manual threshold specification in unknown class recognition.

Abstract: Unsupervised Graph Domain Adaptation has become a promising paradigm for transferring knowledge from a fully labeled source graph to an unlabeled target graph. Existing graph domain adaptation models primarily focus on the closed-set setting, where the source and target domains share the same label spaces. However, this assumption might not be practical in the real-world scenarios, as the target domain might include classes that are not present in the source domain. In this paper, we investigate the problem of unsupervised open-set graph domain adaptation, where the goal is to not only correctly classify target nodes into the known classes, but also recognize previously unseen node types into the unknown class. Towards this end, we propose a novel framework called GraphRTA, which conducts reprogramming on both the graph and model sides. Specifically, we reprogram the graph by modifying target graph structure and node features, which facilitates better separation of known and unknown classes. Meanwhile, we also perform model reprogramming by pruning domain-specific parameters to reduce bias towards the source graph while preserving parameters that capture transferable patterns across graphs. Additionally, we extend the classifier with an extra dimension for the unknown class, thus eliminating the need of manually specified threshold in open-set recognition. Comprehensive experiments on several public datasets demonstrate that our proposed model can achieve satisfied performance compared with recent state-of-the-art baselines. Our source codes and datasets are publicly available at https://github.com/cszhangzhen/GraphRTA.

[438] Training Diverse Graph Experts for Ensembles: A Systematic Empirical Study

Gangda Deng, Yuxin Yang, Ömer Faruk Akgül, Hanqing Zeng, Yinglong Xia, Rajgopal Kannan, Viktor Prasanna

Main category: cs.LG

TL;DR: Systematic empirical study of 20 diversification techniques for GNN ensembles across 14 node classification benchmarks, analyzing expert diversity and ensemble performance.

Details

Motivation: Single GNN performance is limited by graph heterogeneity, and Mixture-of-Experts frameworks show that assembling diverse GNNs can significantly improve performance.

Method: Evaluated 20 diversification strategies including random re-initialization, hyperparameter tuning, architectural variation, directionality modeling, and training data partitioning across 14 benchmarks, constructing over 200 ensemble variants.

Result: Comprehensive evaluation examined each technique in terms of expert diversity, complementarity, and ensemble performance, uncovering mechanistic insights into training maximally diverse experts.

Conclusion: Findings provide actionable guidance for expert training and the design of effective Mixture-of-Experts frameworks on graph data.

Abstract: Graph Neural Networks (GNNs) have become essential tools for learning on relational data, yet the performance of a single GNN is often limited by the heterogeneity present in real-world graphs. Recent advances in Mixture-of-Experts (MoE) frameworks demonstrate that assembling multiple, explicitly diverse GNNs with distinct generalization patterns can significantly improve performance. In this work, we present the first systematic empirical study of expert-level diversification techniques for GNN ensembles. Evaluating 20 diversification strategies – including random re-initialization, hyperparameter tuning, architectural variation, directionality modeling, and training data partitioning – across 14 node classification benchmarks, we construct and analyze over 200 ensemble variants. Our comprehensive evaluation examines each technique in terms of expert diversity, complementarity, and ensemble performance. We also uncovers mechanistic insights into training maximally diverse experts. These findings provide actionable guidance for expert training and the design of effective MoE frameworks on graph data. Our code is available at https://github.com/Hydrapse/bench-gnn-diversification.

[439] Approximation Rates of Shallow Neural Networks: Barron Spaces, Activation Functions and Optimality Analysis

Jian Lu, Xiaohuang Huang

Main category: cs.LG

TL;DR: Analysis of shallow neural networks with ReLU^k activation functions, showing optimal approximation rates cannot be achieved under certain conditions and confirming the curse of dimensionality.

Details

Motivation: To understand the approximation properties of shallow neural networks with power-of-exponential activation functions and their dependence on dimension and function smoothness.

Method: Examining approximation rates of ReLU^k activation functions within Barron function space, analyzing conditions under which optimal rates cannot be achieved.

Result: Optimal approximation rates cannot be achieved under ℓ¹-bounded coefficients or insufficient smoothness. Established optimal rates in various norms for Barron and Sobolev spaces, confirming curse of dimensionality.

Conclusion: Clarifies limits of shallow neural networks’ approximation capabilities and provides insights for selecting activation functions and network structures.

Abstract: This paper investigates the approximation properties of shallow neural networks with activation functions that are powers of exponential functions. It focuses on the dependence of the approximation rate on the dimension and the smoothness of the function being approximated within the Barron function space. We examine the approximation rates of ReLU$^{k}$ activation functions, proving that the optimal rate cannot be achieved under $\ell^{1}$-bounded coefficients or insufficient smoothness conditions. We also establish optimal approximation rates in various norms for functions in Barron spaces and Sobolev spaces, confirming the curse of dimensionality. Our results clarify the limits of shallow neural networks’ approximation capabilities and offer insights into the selection of activation functions and network structures.

[440] Learning from N-Tuple Data with M Positive Instances: Unbiased Risk Estimation and Theoretical Guarantees

Miao Zhang, Junpeng Li, ChangChun HUa, Yana Yang

Main category: cs.LG

TL;DR: The paper proposes a method for weakly supervised learning where training examples are n-tuples containing exactly m positives, with only the count m per tuple observed. The approach uses an unbiased risk estimator derived from tuple counts and shows strong performance across benchmarks.

Details

Motivation: Weakly supervised learning often deals with coarse aggregate signals rather than instance labels. The NTMP (N-tuple with M positives) setting arises in practical scenarios like image classification with region proposals and multi-instance measurements, where only tuple-level counts are available.

Method: The method derives a trainable unbiased risk estimator (URE) by linking tuple-generation to latent instance marginals. It handles fixed/variable tuple sizes and counts, with ReLU corrections for finite-sample stability while preserving asymptotic correctness.

Result: The approach consistently outperforms representative weak-supervision baselines across benchmarks converted to NTMP tasks, yielding favorable precision-recall and F1 trade-offs. It remains robust under class-prior imbalance and diverse tuple configurations.

Conclusion: Count-only supervision can be effectively exploited through a theoretically grounded and practically stable objective, demonstrating that tuple counts provide sufficient information for effective learning in weakly supervised settings.

Abstract: Weakly supervised learning often operates with coarse aggregate signals rather than instance labels. We study a setting where each training example is an $n$-tuple containing exactly m positives, while only the count m per tuple is observed. This NTMP (N-tuple with M positives) supervision arises in, e.g., image classification with region proposals and multi-instance measurements. We show that tuple counts admit a trainable unbiased risk estimator (URE) by linking the tuple-generation process to latent instance marginals. Starting from fixed (n,m), we derive a closed-form URE and extend it to variable tuple sizes, variable counts, and their combination. Identification holds whenever the effective mixing rate is separated from the class prior. We establish generalization bounds via Rademacher complexity and prove statistical consistency with standard rates under mild regularity assumptions. To improve finite-sample stability, we introduce simple ReLU corrections to the URE that preserve asymptotic correctness. Across benchmarks converted to NTMP tasks, the approach consistently outperforms representative weak-supervision baselines and yields favorable precision-recall and F1 trade-offs. It remains robust under class-prior imbalance and across diverse tuple configurations, demonstrating that count-only supervision can be exploited effectively through a theoretically grounded and practically stable objective.

[441] Provable Generalization Bounds for Deep Neural Networks with Adaptive Regularization

Adeel Safder

Main category: cs.LG

TL;DR: MAGDrop is a novel regularization method that dynamically adjusts dropout rates based on gradients and momentum, achieving improved generalization with theoretical justification and empirical validation on MNIST and CIFAR-10.

Details

Motivation: Deep neural networks often suffer from overfitting due to high capacity, requiring better regularization methods to enhance generalization in non-convex optimization landscapes.

Method: Momentum-Adaptive Gradient Dropout (MAGDrop) dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, with theoretical analysis using a tightened PAC-Bayes generalization bound.

Result: MAGDrop outperforms baseline regularization techniques by 1-2% in test accuracy (MNIST: 99.52%, CIFAR-10: 90.63%) with generalization gaps of 0.48% and 7.14% respectively, and achieves up to 20% sharper theoretical bounds.

Conclusion: The work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization suitable for high-stakes applications.

Abstract: Deep neural networks (DNNs) achieve remarkable performance but often suffer from overfitting due to their high capacity. We introduce Momentum-Adaptive Gradient Dropout (MAGDrop), a novel regularization method that dynamically adjusts dropout rates on activations based on current gradients and accumulated momentum, enhancing stability in non-convex optimization landscapes. To theoretically justify MAGDrop’s effectiveness, we derive a tightened PAC-Bayes generalization bound that accounts for its adaptive nature, achieving up to 20% sharper bounds compared to standard approaches by leveraging momentum-driven perturbation control. Empirically, the activation-based MAGDrop outperforms baseline regularization techniques, including standard dropout and adaptive gradient regularization, by 1-2% in test accuracy on MNIST (99.52%) and CIFAR-10 (90.63%), with generalization gaps of 0.48% and 7.14%, respectively. Our work bridges theoretical insights and practical advancements, offering a robust framework for enhancing DNN generalization suitable for high-stakes applications.

[442] Learning Boltzmann Generators via Constrained Mass Transport

Christopher von Klitzing, Denis Blessing, Henrik Schopmans, Pascal Friederich, Gerhard Neumann

Main category: cs.LG

TL;DR: CMT is a variational framework for sampling from high-dimensional Boltzmann distributions that uses constraints on KL divergence and entropy decay between steps to improve distributional overlap and avoid mode collapse, outperforming state-of-the-art methods.

Details

Motivation: Existing methods for sampling Boltzmann distributions suffer from mode collapse (variational approaches) or mass teleportation and schedule tuning issues (annealing-based methods), requiring a more robust approach.

Method: Constrained Mass Transport (CMT) generates intermediate distributions with constraints on both KL divergence and entropy decay between successive steps to enhance distributional overlap and prevent premature convergence.

Result: CMT consistently outperforms state-of-the-art variational methods across benchmarks, achieving more than 2.5x higher effective sample size while avoiding mode collapse, even on the largest system studied without molecular dynamics samples.

Conclusion: CMT provides a superior variational framework for Boltzmann generators that effectively addresses key limitations of existing methods through constrained intermediate distributions.

Abstract: Efficient sampling from high-dimensional and multimodal unnormalized probability distributions is a central challenge in many areas of science and machine learning. We focus on Boltzmann generators (BGs) that aim to sample the Boltzmann distribution of physical systems, such as molecules, at a given temperature. Classical variational approaches that minimize the reverse Kullback-Leibler divergence are prone to mode collapse, while annealing-based methods, commonly using geometric schedules, can suffer from mass teleportation and rely heavily on schedule tuning. We introduce Constrained Mass Transport (CMT), a variational framework that generates intermediate distributions under constraints on both the KL divergence and the entropy decay between successive steps. These constraints enhance distributional overlap, mitigate mass teleportation, and counteract premature convergence. Across standard BG benchmarks and the here introduced ELIL tetrapeptide, the largest system studied to date without access to samples from molecular dynamics, CMT consistently surpasses state-of-the-art variational methods, achieving more than 2.5x higher effective sample size while avoiding mode collapse.

[443] Simple and Efficient Heterogeneous Temporal Graph Neural Network

Yili Wang, Tairan Huang, Changlong He, Qiutong Li, Jianliang Gao

Main category: cs.LG

TL;DR: SE-HTGNN is a novel heterogeneous temporal graph neural network that integrates temporal modeling into spatial learning via dynamic attention, achieving 10x speed-up while maintaining state-of-the-art forecasting accuracy.

Details

Motivation: Existing methods for heterogeneous temporal graphs use decoupled temporal and spatial learning, which weakens spatio-temporal interactions and leads to high model complexity.

Method: Integrates temporal modeling into spatial learning via dynamic attention that retains historical attention information, and leverages large language models to capture implicit node type properties as prior knowledge.

Result: Achieves up to 10x speed-up over state-of-the-art baselines while maintaining best forecasting accuracy in extensive experiments.

Conclusion: SE-HTGNN provides a simple and efficient learning paradigm for heterogeneous temporal graphs that effectively bridges the gap in spatio-temporal information interactions.

Abstract: Heterogeneous temporal graphs (HTGs) are ubiquitous data structures in the real world. Recently, to enhance representation learning on HTGs, numerous attention-based neural networks have been proposed. Despite these successes, existing methods rely on a decoupled temporal and spatial learning paradigm, which weakens interactions of spatio-temporal information and leads to a high model complexity. To bridge this gap, we propose a novel learning paradigm for HTGs called Simple and Efficient Heterogeneous Temporal Graph N}eural Network (SE-HTGNN). Specifically, we innovatively integrate temporal modeling into spatial learning via a novel dynamic attention mechanism, which retains attention information from historical graph snapshots to guide subsequent attention computation, thereby improving the overall discriminative representations learning of HTGs. Additionally, to comprehensively and adaptively understand HTGs, we leverage large language models to prompt SE-HTGNN, enabling the model to capture the implicit properties of node types as prior knowledge. Extensive experiments demonstrate that SE-HTGNN achieves up to 10x speed-up over the state-of-the-art and latest baseline while maintaining the best forecasting accuracy.

[444] Benchmarking Fairness-aware Graph Neural Networks in Knowledge Graphs

Yuya Sasaki

Main category: cs.LG

TL;DR: This paper presents the first benchmarking study of fairness-aware graph neural networks on knowledge graphs, revealing different trends from existing datasets and key insights about trade-offs between accuracy and fairness.

Details

Motivation: Graph neural networks often produce biased predictions, but no prior studies have evaluated fairness-aware GNNs on knowledge graphs, which are important in applications like recommender systems.

Method: Generated new graphs from three knowledge graphs (YAGO, DBpedia, Wikidata), benchmarked inprocessing and preprocessing methods with different GNN backbones and early stopping conditions.

Result: Knowledge graphs show clearer trade-offs between accuracy and fairness than other graphs; performance affected by GNN backbones and early stopping; preprocessing improves fairness while inprocessing improves accuracy.

Conclusion: This study provides important insights for fairness-aware GNNs on knowledge graphs and highlights the need for specialized approaches in this domain.

Abstract: Graph neural networks (GNNs) are powerful tools for learning from graph-structured data but often produce biased predictions with respect to sensitive attributes. Fairness-aware GNNs have been actively studied for mitigating biased predictions. However, no prior studies have evaluated fairness-aware GNNs on knowledge graphs, which are one of the most important graphs in many applications, such as recommender systems. Therefore, we introduce a benchmarking study on knowledge graphs. We generate new graphs from three knowledge graphs, YAGO, DBpedia, and Wikidata, that are significantly larger than the existing graph datasets used in fairness studies. We benchmark inprocessing and preprocessing methods in different GNN backbones and early stopping conditions. We find several key insights: (i) knowledge graphs show different trends from existing datasets; clearer trade-offs between prediction accuracy and fairness metrics than other graphs in fairness-aware GNNs, (ii) the performance is largely affected by not only fairness-aware GNN methods but also GNN backbones and early stopping conditions, and (iii) preprocessing methods often improve fairness metrics, while inprocessing methods improve prediction accuracy.

[445] Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation

Daniel Bethell, Simos Gerasimou, Radu Calinescu, Calum Imrie

Main category: cs.LG

TL;DR: The paper introduces Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation into critic training to enable safe reinforcement learning exploration while maintaining competitive task performance.

Details

Motivation: Existing RL safety methods struggle to balance safety and performance - tight safety enforcement cripples task performance, while reward-focused approaches frequently violate safety constraints and create diffuse cost landscapes that stall policy improvement.

Method: USC integrates uncertainty-aware modulation and refinement into critic training, concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas to enable effective reward-safety trade-offs.

Result: USC reduces safety violations by ~40% while maintaining competitive or higher rewards, and reduces error between predicted and true cost gradients by ~83%.

Conclusion: USC breaks the prevailing trade-off between safety and performance, paving the way for scalable safe RL.

Abstract: Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.

[446] Learning to Navigate Under Imperfect Perception: Conformalised Segmentation for Safe Reinforcement Learning

Daniel Bethell, Simos Gerasimou, Radu Calinescu, Calum Imrie

Main category: cs.LG

TL;DR: COPPOL is a conformal-driven perception-to-policy learning approach that integrates finite-sample safety guarantees into semantic segmentation for reliable navigation in safety-critical environments.

Details

Motivation: Existing approaches assume perfect hazard detection capabilities, while uncertainty-aware perception methods lack finite-sample guarantees, creating reliability gaps in safety-critical navigation.

Method: Integrates distribution-free, finite-sample safety guarantees into semantic segmentation to produce calibrated hazard maps with rigorous bounds for missed detections, which then induce risk-aware cost fields for downstream RL planning.

Result: Increases hazard coverage up to 6x compared to baselines, achieves near-complete detection of unsafe regions, reduces hazardous violations during navigation by approximately 50%, and remains robust to distributional shift.

Conclusion: COPPOL provides a principled approach for reliable navigation by combining calibrated perception with rigorous safety guarantees, maintaining both safety and efficiency even under distributional shifts.

Abstract: Reliable navigation in safety-critical environments requires both accurate hazard perception and principled uncertainty handling to strengthen downstream safety handling. Despite the effectiveness of existing approaches, they assume perfect hazard detection capabilities, while uncertainty-aware perception approaches lack finite-sample guarantees. We present COPPOL, a conformal-driven perception-to-policy learning approach that integrates distribution-free, finite-sample safety guarantees into semantic segmentation, yielding calibrated hazard maps with rigorous bounds for missed detections. These maps induce risk-aware cost fields for downstream RL planning. Across two satellite-derived benchmarks, COPPOL increases hazard coverage (up to 6x) compared to comparative baselines, achieving near-complete detection of unsafe regions while reducing hazardous violations during navigation (up to approx 50%). More importantly, our approach remains robust to distributional shift, preserving both safety and efficiency.

[447] Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen

Main category: cs.LG

TL;DR: RL causes less catastrophic forgetting than SFT in language model post-training while achieving similar or better target task performance, due to RL’s mode-seeking nature from on-policy data.

Details

Motivation: To understand and mitigate catastrophic forgetting when adapting language models to new tasks through post-training methods like SFT and RL.

Method: Systematically compared forgetting patterns of SFT and RL across different LM families and tasks, analyzed using a simplified mixture model, and investigated the role of on-policy data.

Result: RL consistently leads to less forgetting than SFT across all tested scenarios while maintaining comparable or superior target task performance. On-policy data was identified as the key factor for RL’s robustness to forgetting.

Conclusion: RL’s mode-seeking nature from on-policy data enables better preservation of prior knowledge, suggesting that approximately on-policy data could be an efficient approach to mitigate catastrophic forgetting in practical applications.

Abstract: Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities – a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

[448] Alibaba International E-commerce Product Search Competition DILAB Team Technical Report

Hyewon Lee, Junghyun Oh, Minkyung Song, Soyoung Park, Seunghoon Han

Main category: cs.LG

TL;DR: DILAB team’s multilingual e-commerce search system achieved 5th place with 0.8819 score using a multi-stage pipeline with data refinement, preprocessing, and adaptive modeling.

Details

Motivation: To address challenges in multilingual query-item understanding for e-commerce search systems.

Method: Multi-stage pipeline integrating data refinement, lightweight preprocessing, and adaptive modeling with multiple architectures and fine-tuning strategies.

Result: Achieved 5th place with 0.8819 overall score, demonstrating stable and high-performing results across evaluation metrics.

Conclusion: The framework exhibited robustness and adaptability across languages and domains, highlighting effectiveness of systematic data curation and iterative evaluation.

Abstract: This study presents the multilingual e-commerce search system developed by the DILAB team, which achieved 5th place on the final leaderboard with a competitive overall score of 0.8819, demonstrating stable and high-performing results across evaluation metrics. To address challenges in multilingual query-item understanding, we designed a multi-stage pipeline integrating data refinement, lightweight preprocessing, and adaptive modeling. The data refinement stage enhanced dataset consistency and category coverage, while language tagging and noise filtering improved input quality. In the modeling phase, multiple architectures and fine-tuning strategies were explored, and hyperparameters optimized using curated validation sets to balance performance across query-category (QC) and query-item (QI) tasks. The proposed framework exhibited robustness and adaptability across languages and domains, highlighting the effectiveness of systematic data curation and iterative evaluation for multilingual search systems. The source code is available at https://github.com/2noweyh/DILAB-Alibaba-Ecommerce-Search.

[449] Partial VOROS: A Cost-aware Performance Metric for Binary Classifiers with Precision and Capacity Constraints

Christopher Ratigan, Kyle Heuton, Carissa Wang, Lenore Cowen, Michael C. Hughes

Main category: cs.LG

TL;DR: The paper introduces a new ROC analysis framework that incorporates precision constraints, capacity limits, and asymmetric costs for hospital alert systems, defining a partial area metric (partial VOROS) that better ranks classifiers in practical deployment scenarios.

Details

Motivation: Conventional ROC analysis fails to capture crucial deployment factors for hospital alert systems, including minimum precision constraints to avoid false alarm fatigue, capacity limits on predicted positives, and asymmetric costs for false positives/negatives.

Method: The authors represent classifiers meeting precision and capacity constraints as a feasible region in ROC space, establish its geometry, define the partial area of lesser classifiers metric, and average it over cost parameters to create partial VOROS.

Result: Experiments on MIMIC-IV dataset for mortality risk prediction show that the proposed cost-aware partial VOROS metric outperforms alternatives for ranking classifiers in hospital alert applications.

Conclusion: The new framework provides a more practical approach to ROC analysis for real-world applications like hospital monitoring systems by incorporating deployment constraints and asymmetric costs into classifier evaluation.

Abstract: The ROC curve is widely used to assess binary classification performance. Yet for some applications such as alert systems for hospitalized patient monitoring, conventional ROC analysis cannot capture crucial factors that impact deployment, such as enforcing a minimum precision constraint to avoid false alarm fatigue or imposing an upper bound on the number of predicted positives to represent the capacity of hospital staff. The usual area under the curve metric also does not reflect asymmetric costs for false positives and false negatives. In this paper we address all three of these issues. First, we show how the subset of classifiers that meet given precision and capacity constraints can be represented as a feasible region in ROC space. We establish the geometry of this feasible region. We then define the partial area of lesser classifiers, a performance metric that is monotonic with cost and only accounts for the feasible portion of ROC space. Averaging this area over a desired range of cost parameters results in the partial volume over the ROC surface, or partial VOROS. In experiments predicting mortality risk using vital sign history on the MIMIC-IV dataset, we show this cost-aware metric is better than alternatives for ranking classifiers in hospital alert applications.

[450] Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation

Giovanni De Muri, Mark Vero, Robin Staab, Martin Vechev

Main category: cs.LG

TL;DR: This paper investigates security risks in knowledge distillation from backdoored LLM teacher models, introduces T-MTB backdooring technique that creates transferable backdoors using common tokens, and demonstrates successful transfer across multiple attack scenarios and model families.

Details

Motivation: LLMs are increasingly used as teacher models for knowledge distillation, but if these teachers come from untrusted sources, distillation could introduce security risks. Existing backdooring methods don't transfer well to students, underestimating the actual security threat.

Method: The authors introduce T-MTB (Transferable Multi-Token Backdoor), which constructs composite triggers using multiple specific tokens that frequently occur in distillation datasets. This makes the poisoned teacher stealthy while enabling backdoor transfer during distillation.

Result: T-MTB successfully creates transferable backdoors that work across two attack scenarios (jailbreaking and content modulation) and four different LLM model families, demonstrating significant security risks in knowledge distillation.

Conclusion: Knowledge distillation from potentially backdoored teacher models poses serious security risks that were previously underestimated. The T-MTB technique shows that carefully constructed triggers can successfully transfer backdoors to student models, highlighting the need for security measures in distillation processes.

Abstract: LLMs are often used by downstream users as teacher models for knowledge distillation, compressing their capabilities into memory-efficient models. However, as these teacher models may stem from untrusted parties, distillation can raise unexpected security risks. In this paper, we investigate the security implications of knowledge distillation from backdoored teacher models. First, we show that prior backdoors mostly do not transfer onto student models. Our key insight is that this is because existing LLM backdooring methods choose trigger tokens that rarely occur in usual contexts. We argue that this underestimates the security risks of knowledge distillation and introduce a new backdooring technique, T-MTB, that enables the construction and study of transferable backdoors. T-MTB carefully constructs a composite backdoor trigger, made up of several specific tokens that often occur individually in anticipated distillation datasets. As such, the poisoned teacher remains stealthy, while during distillation the individual presence of these tokens provides enough signal for the backdoor to transfer onto the student. Using T-MTB, we demonstrate and extensively study the security risks of transferable backdoors across two attack scenarios, jailbreaking and content modulation, and across four model families of LLMs.

[451] RAISE: A Unified Framework for Responsible AI Scoring and Evaluation

Loc Phuc Truong Nguyen, Hung Thanh Do

Main category: cs.LG

TL;DR: RAISE framework evaluates AI models across explainability, fairness, robustness, and sustainability, showing trade-offs between different model types.

Details

Motivation: As AI systems enter high-stakes domains, evaluation must extend beyond predictive accuracy to include explainability, fairness, robustness, and sustainability.

Method: Introduced RAISE (Responsible AI Scoring and Evaluation) framework that quantifies model performance across four dimensions and aggregates them into a single Responsibility Score. Evaluated three deep learning models (MLP, Tabular ResNet, Feature Tokenizer Transformer) on structured datasets from finance, healthcare, and socioeconomics.

Result: MLP demonstrated strong sustainability and robustness, Transformer excelled in explainability and fairness but with high environmental cost, and Tabular ResNet offered a balanced profile. No single model dominated across all responsibility criteria.

Conclusion: Highlights the necessity of multi-dimensional evaluation for responsible model selection, as different models excel in different responsibility dimensions.

Abstract: As AI systems enter high-stakes domains, evaluation must extend beyond predictive accuracy to include explainability, fairness, robustness, and sustainability. We introduce RAISE (Responsible AI Scoring and Evaluation), a unified framework that quantifies model performance across these four dimensions and aggregates them into a single, holistic Responsibility Score. We evaluated three deep learning models: a Multilayer Perceptron (MLP), a Tabular ResNet, and a Feature Tokenizer Transformer, on structured datasets from finance, healthcare, and socioeconomics. Our findings reveal critical trade-offs: the MLP demonstrated strong sustainability and robustness, the Transformer excelled in explainability and fairness at a very high environmental cost, and the Tabular ResNet offered a balanced profile. These results underscore that no single model dominates across all responsibility criteria, highlighting the necessity of multi-dimensional evaluation for responsible model selection. Our implementation is available at: https://github.com/raise-framework/raise.

[452] HeFS: Helper-Enhanced Feature Selection via Pareto-Optimized Genetic Search

Yusi Fan, Tian Wang, Zhiying Yan, Chang Liu, Qiong Zhou, Qi Lu, Zhehao Guo, Ziqi Deng, Wenyu Zhu, Ruochi Zhang, Fengfeng Zhou

Main category: cs.LG

TL;DR: HeFS framework refines feature subsets by searching for complementary Helper Sets using genetic algorithm with biased initialization and ratio-guided mutation, achieving superior performance across multiple domains.

Details

Motivation: Conventional feature selection methods often fail to capture subtle yet informative features due to premature convergence and inability to handle complex feature relationships in high-dimensional datasets.

Method: HeFS employs a genetic algorithm with biased initialization and ratio-guided mutation mechanism, using Pareto-based multi-objective optimization to maximize both predictive accuracy and feature complementarity.

Result: Experiments on 18 benchmark datasets show HeFS consistently identifies overlooked informative features and outperforms state-of-the-art methods in domains including gastric cancer classification, drug toxicity prediction, and computer science applications.

Conclusion: The HeFS framework effectively enhances existing feature selection algorithms by systematically identifying complementary feature sets, demonstrating robust performance across diverse application domains.

Abstract: Feature selection is a combinatorial optimization problem that is NP-hard. Conventional approaches often employ heuristic or greedy strategies, which are prone to premature convergence and may fail to capture subtle yet informative features. This limitation becomes especially critical in high-dimensional datasets, where complex and interdependent feature relationships prevail. We introduce the HeFS (Helper-Enhanced Feature Selection) framework to refine feature subsets produced by existing algorithms. HeFS systematically searches the residual feature space to identify a Helper Set - features that complement the original subset and improve classification performance. The approach employs a biased initialization scheme and a ratio-guided mutation mechanism within a genetic algorithm, coupled with Pareto-based multi-objective optimization to jointly maximize predictive accuracy and feature complementarity. Experiments on 18 benchmark datasets demonstrate that HeFS consistently identifies overlooked yet informative features and achieves superior performance over state-of-the-art methods, including in challenging domains such as gastric cancer classification, drug toxicity prediction, and computer science applications. The code and datasets are available at https://healthinformaticslab.org/supp/.

[453] Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing

Chia-Hsuan Lu, Tony Tan, Michael Benedikt

Main category: cs.LG

TL;DR: RobLight improves GNN structural robustness verification by using efficient partial solvers instead of powerful constraint solvers, achieving better performance while maintaining accuracy.

Details

Motivation: Existing adversarial robustness techniques for GNNs rely on powerful constraint solvers like mixed integer programming, which can be computationally expensive. There's a need for more efficient verification methods.

Method: Replace powerful constraint solvers with efficient partial solvers that run in polynomial time but may be incomplete, focusing on structural robustness verification for GNNs.

Result: RobLight achieves improved performance over state-of-the-art techniques on diverse GNN variants and datasets while maintaining verification accuracy.

Conclusion: Using efficient partial solvers instead of powerful constraint solvers is an effective approach for GNN structural robustness verification, offering better performance without sacrificing accuracy.

Abstract: Graph neural networks (GNNs) are the predominant architecture for learning over graphs. As with any machine learning model, and important issue is the detection of adversarial attacks, where an adversary can change the output with a small perturbation of the input. Techniques for solving the adversarial robustness problem - determining whether such an attack exists - were originally developed for image classification, but there are variants for many other machine learning architectures. In the case of graph learning, the attack model usually considers changes to the graph structure in addition to or instead of the numerical features of the input, and the state of the art techniques in the area proceed via reduction to constraint solving, working on top of powerful solvers, e.g. for mixed integer programming. We show that it is possible to improve on the state of the art in structural robustness by replacing the use of powerful solvers by calls to efficient partial solvers, which run in polynomial time but may be incomplete. We evaluate our tool RobLight on a diverse set of GNN variants and datasets.

[454] Unrolled-SINDy: A Stable Explicit Method for Non linear PDE Discovery from Sparsely Sampled Data

Fayad Ali Banna, Antoine Caradot, Eduardo Brandao, Jean-Philippe Colombier, Rémi Emonet, Marc Sebban

Main category: cs.LG

TL;DR: Unrolled-SINDy improves PDE discovery from sparsely sampled data by decorrelating numerical time step size from data sampling rate, enabling recovery of equation parameters that would be missed by traditional SINDy methods.

Details

Motivation: Traditional SINDy approaches fail for real-world problems with sparsely sampled time data due to large local truncation errors when numerical time step size is tied to data sampling rate.

Method: Leverages an unrolling scheme to decouple numerical time step from data sampling rate, implemented through either iterative closed-form approach or gradient descent scheme, compatible with various numerical methods like Euler and RK4.

Result: The method successfully tackles problems inaccessible to non-unrolled methods, working effectively with both traditional SINDy and noise-robust iNeuralSINDy across different numerical schemes.

Conclusion: Unrolled-SINDy provides a versatile approach that significantly improves the stability and applicability of explicit methods for PDE discovery from sparsely sampled observational data.

Abstract: Identifying from observation data the governing differential equations of a physical dynamics is a key challenge in machine learning. Although approaches based on SINDy have shown great promise in this area, they still fail to address a whole class of real world problems where the data is sparsely sampled in time. In this article, we introduce Unrolled-SINDy, a simple methodology that leverages an unrolling scheme to improve the stability of explicit methods for PDE discovery. By decorrelating the numerical time step size from the sampling rate of the available data, our approach enables the recovery of equation parameters that would not be the minimizers of the original SINDy optimization problem due to large local truncation errors. Our method can be exploited either through an iterative closed-form approach or by a gradient descent scheme. Experiments show the versatility of our method. On both traditional SINDy and state-of-the-art noise-robust iNeuralSINDy, with different numerical schemes (Euler, RK4), our proposed unrolling scheme allows to tackle problems not accessible to non-unrolled methods.

[455] A Rectification-Based Approach for Distilling Boosted Trees into Decision Trees

Gilles Audemard, Sylvie Coste-Marquis, Pierre Marquis, Mehdi Sabiri, Nicolas Szczepanski

Main category: cs.LG

TL;DR: A new method for distilling boosted trees into decision trees to balance predictive performance and interpretability using rectification.

Details

Motivation: To create machine learning models that offer a good compromise between predictive performance and interpretability by distilling complex boosted trees into simpler decision trees.

Method: Uses a correction approach called rectification to implement the distillation process from boosted trees to decision trees.

Result: Empirical results show that this approach provides interesting results compared to distillation achieved by retraining the model.

Conclusion: The rectification-based distillation approach offers a promising method for creating interpretable models while maintaining good predictive performance from complex boosted tree ensembles.

Abstract: We present a new approach for distilling boosted trees into decision trees, in the objective of generating an ML model offering an acceptable compromise in terms of predictive performance and interpretability. We explain how the correction approach called rectification can be used to implement such a distillation process. We show empirically that this approach provides interesting results, in comparison with an approach to distillation achieved by retraining the model.

[456] Hardness of Learning Regular Languages in the Next Symbol Prediction Setting

Satwik Bhattamishra, Phil Blunsom, Varun Kanade

Main category: cs.LG

TL;DR: The paper analyzes the learnability of languages in the Next Symbol Prediction (NSP) setting, showing that despite richer labeling information, learning DFAs and Boolean formulas remains computationally hard under cryptographic assumptions.

Details

Motivation: To understand the theoretical learnability of languages in the NSP setting, which is used empirically for neural sequence models and can help learn the support of language models.

Method: Formalizes the NSP setting for PAC-learning analysis and constructs a reduction from conventional learning problems to NSP learning, making most additional labels uninformative.

Result: Learning concept classes like DFAs and Boolean formulas remains computationally hard in the NSP setting, even with richer labeling information.

Conclusion: The NSP setting does not make learning DFAs and similar concept classes easier, as computational hardness persists under cryptographic assumptions.

Abstract: We study the learnability of languages in the Next Symbol Prediction (NSP) setting, where a learner receives only positive examples from a language together with, for every prefix, (i) whether the prefix itself is in the language and (ii) which next symbols can lead to an accepting string. This setting has been used in prior works to empirically analyze neural sequence models, and additionally, we observe that efficient algorithms for the NSP setting can be used to learn the (truncated) support of language models. We formalize the setting so as to make it amenable to PAC-learning analysis. While the setting provides a much richer set of labels than the conventional classification setting, we show that learning concept classes such as DFAs and Boolean formulas remains computationally hard. The proof is via a construction that makes almost all additional labels uninformative, yielding a reduction from the conventional learning problem to learning with NSP labels. Under cryptographic assumptions, the reduction implies that the problem of learning DFAs is computationally hard in the NSP setting.

[457] Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

Yanna Ding, Songtao Lu, Yingdong Lu, Tomasz Nowicki, Jianxi Gao

Main category: cs.LG

TL;DR: This paper analyzes how transformers perform in-context learning for Markovian function learning, revealing NP-hardness in recovering optimal parameters for single-layer linear self-attention and providing a novel interpretation of multilayer LSA as preconditioned gradient descent.

Details

Motivation: To understand how transformers express in-context learning when modeling dynamics-driven functions, particularly Markovian function learning, since existing theoretical studies have mainly focused on linear regression tasks with i.i.d. inputs.

Method: The authors investigate Markovian function learning through a structured ICL setup, characterize the loss landscape, provide closed-form expressions for global minimizers in enlarged parameter space, prove NP-hardness of parameter recovery, and interpret multilayer LSA as preconditioned gradient descent.

Result: The paper shows that recovering transformer parameters that realize the optimal solution is NP-hard in general for one-layer LSA, revealing a fundamental limitation in representing structured dynamical functions. They also provide theoretical interpretations that are numerically validated.

Conclusion: The study reveals fundamental limitations of single-layer linear self-attention in representing structured dynamical functions through in-context learning, while providing new interpretations of multilayer architectures as optimization algorithms.

Abstract: Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.

[458] Informed Learning for Estimating Drought Stress at Fine-Scale Resolution Enables Accurate Yield Prediction

Miro Miranda, Marcela Charfuelan, Matias Valdenegro Toro, Andreas Dengel

Main category: cs.LG

TL;DR: This paper proposes a physics-informed machine learning approach that couples crop simulation models with ML to predict crop yield based on water availability, achieving better accuracy than state-of-the-art models while maintaining explainability.

Details

Motivation: Crop simulation models offer explainability but poor performance, while ML models are powerful but lack physical principles. This study bridges the gap between these two approaches for better crop yield prediction.

Method: Formulate crop yield as a function of temporal water scarcity, predict crop drought stress and water sensitivity, use physics-informed loss function, leverage satellite imagery and meteorological data, and employ deep ensemble approach for uncertainty.

Result: Achieved R²-score of up to 0.82 in crop yield prediction, surpassing state-of-the-art models like LSTM and Transformers while maintaining high explainability.

Conclusion: The method provides accurate and explainable crop yield prediction, offering valuable decision support for building resilient agriculture in changing climate conditions.

Abstract: Water is essential for agricultural productivity. Assessing water shortages and reduced yield potential is a critical factor in decision-making for ensuring agricultural productivity and food security. Crop simulation models, which align with physical processes, offer intrinsic explainability but often perform poorly. Conversely, machine learning models for crop yield modeling are powerful and scalable, yet they commonly operate as black boxes and lack adherence to the physical principles of crop growth. This study bridges this gap by coupling the advantages of both worlds. We postulate that the crop yield is inherently defined by the water availability. Therefore, we formulate crop yield as a function of temporal water scarcity and predict both the crop drought stress and the sensitivity to water scarcity at fine-scale resolution. Sequentially modeling the crop yield response to water enables accurate yield prediction. To enforce physical consistency, a novel physics-informed loss function is proposed. We leverage multispectral satellite imagery, meteorological data, and fine-scale yield data. Further, to account for the uncertainty within the model, we build upon a deep ensemble approach. Our method surpasses state-of-the-art models like LSTM and Transformers in crop yield prediction with a coefficient of determination ($R^2$-score) of up to 0.82 while offering high explainability. This method offers decision support for industry, policymakers, and farmers in building a more resilient agriculture in times of changing climate conditions.

[459] Learning Time-Varying Turn-Taking Behavior in Group Conversations

Madeline Navarro, Lisa O’Bryan, Santiago Segarra

Main category: cs.LG

TL;DR: A probabilistic model for predicting turn-taking in group conversations using individual characteristics and past speaking behavior, with generalization across different groups.

Details

Motivation: Existing conversation models lack generalizability beyond single groups and use universal formulations that may not suit all groups, motivating a more flexible approach.

Method: Generalization of prior conversation models that predicts speaking turns based on individual personality traits and prior speaking behavior, with novel ability to learn how speaking inclination varies based on when individuals last spoke.

Result: Applied to synthetic and real-world conversation data, the model verifies the approach and characterizes real group interactions, showing previous behavioral models may not always be realistic.

Conclusion: The proposed data-driven yet theoretically grounded approach provides better generalization and more realistic modeling of conversation dynamics across different groups.

Abstract: We propose a flexible probabilistic model for predicting turn-taking patterns in group conversations based solely on individual characteristics and past speaking behavior. Many models of conversation dynamics cannot yield insights that generalize beyond a single group. Moreover, past works often aim to characterize speaking behavior through a universal formulation that may not be suitable for all groups. We thus develop a generalization of prior conversation models that predicts speaking turns among individuals in any group based on their individual characteristics, that is, personality traits, and prior speaking behavior. Importantly, our approach provides the novel ability to learn how speaking inclination varies based on when individuals last spoke. We apply our model to synthetic and real-world conversation data to verify the proposed approach and characterize real group interactions. Our results demonstrate that previous behavioral models may not always be realistic, motivating our data-driven yet theoretically grounded approach.

Mustafa Fuad Rifet Ibrahim, Tunc Alkanat, Maurice Meijer, Felix Manthey, Alexander Schlaefer, Peer Stelldinger

Main category: cs.LG

TL;DR: Deep learning model for ECG/PCG classification on medical edge devices with 1000x reduction in memory/compute while maintaining competitive accuracy.

Details

Motivation: Early detection of cardiovascular diseases using wearable sensors requires robust, efficient analysis. Deep learning can automate interpretation while reducing clinician workload.

Method: Proposed convolutional neural network with early fusion of synchronized ECG and PCG data, trained on Physionet Challenge 2016 dataset for binary classification.

Result: Achieved three orders of magnitude reduction in memory footprint and compute cost compared to state-of-the-art while maintaining competitive accuracy. Demonstrated energy-efficient on-device inference on microcontrollers.

Conclusion: On-device inference with the proposed model is more energy-efficient than continuous data streaming, making it feasible for resource-constrained medical edge devices.

Abstract: The vast majority of cardiovascular diseases may be preventable if early signs and risk factors are detected. Cardiovascular monitoring with body-worn sensor devices like sensor patches allows for the detection of such signs while preserving the freedom and comfort of patients. However, the analysis of the sensor data must be robust, reliable, efficient, and highly accurate. Deep learning methods can automate data interpretation, reducing the workload of clinicians. In this work, we analyze the feasibility of applying deep learning models to the classification of synchronized electrocardiogram (ECG) and phonocardiogram (PCG) recordings on resource-constrained medical edge devices. We propose a convolutional neural network with early fusion of data to solve a binary classification problem. We train and validate our model on the synchronized ECG and PCG recordings from the Physionet Challenge 2016 dataset. Our approach reduces memory footprint and compute cost by three orders of magnitude compared to the state-of-the-art while maintaining competitive accuracy. We demonstrate the applicability of our proposed model on medical edge devices by analyzing energy consumption on a microcontroller and an experimental sensor device setup, confirming that on-device inference can be more energy-efficient than continuous data streaming.

[461] Reasoning Language Model Inference Serving Unveiled: An Empirical Study

Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang, Yuhan Chen, Shaohuai Shi, Xiaowen Chu

Main category: cs.LG

TL;DR: This paper conducts the first comprehensive study of reasoning large language model (RLLM) serving performance, revealing distinct serving behaviors and evaluating existing inference optimization techniques for RLLMs.

Details

Motivation: While RLLMs have shown competitive performance in complex reasoning tasks, their serving performance and behavior remain unexplored, which may hinder real-world deployment and utilization.

Method: The authors perform a pilot study comparing RLLM and traditional LLM serving performance, investigate the effectiveness of existing inference optimization techniques (model quantization, speculative decoding, prefix caching, KV cache quantization), and evaluate under real-world workloads modeled by Gamma distribution.

Result: Key findings include: (1) RLLMs show significant memory usage fluctuations and straggler requests; (2) Model quantization and speculative decoding improve efficiency with small accuracy compromise; (3) Prefix caching and KV cache quantization may degrade accuracy or performance for small RLLMs; (4) Real-world workload evaluation confirms these findings across different datasets.

Conclusion: The study provides insights for advancing RLLM inference serving and highlights the need for specialized optimization techniques tailored to RLLM characteristics.

Abstract: The reasoning large language model (RLLM) has been proven competitive in solving complex reasoning tasks such as mathematics, coding, compared to general LLM. However, the serving performance and behavior of RLLM remains unexplored, which may undermine the deployment and utilization of RLLM in real-world scenario. To close this gap, in this paper, we conduct a comprehensive study of RLLM service. We first perform a pilot study on comparing the serving performance between RLLM and traditional LLM and reveal that there are several distinct differences regarding serving behavior: (1) significant memory usage and fluctuations; (2) straggler requests; (3) adaptive running time; (4) domain preference. Then we further investigate whether existing inference optimization techniques are valid for RLLM. Our main takeaways are that model quantization methods and speculative decoding can improve service system efficiency with small compromise to RLLM accuracy, while prefix caching, KV cache quantization may even degrade accuracy or serving performance for small RLLM. Lastly, we conduct evaluation under real world workload modeled by Gamma distribution to verify our findings. Empirical results of real world workload evaluation across different dataset are aligned with our main findings regarding RLLM serving. We hope our work can provide the research community and industry with insights to advance RLLM inference serving.

[462] Learning Task-Agnostic Representations through Multi-Teacher Distillation

Philippe Formont, Maxime Darrin, Banafsheh Karimian, Jackie CK Cheung, Eric Granger, Ismail Ben Ayed, Mohammadhadi Shateri, Pablo Piantanida

Main category: cs.LG

TL;DR: A task-agnostic multi-teacher distillation framework using a ‘majority vote’ objective that leverages teacher diversity without requiring task-specific labels, improving performance across text, vision, and molecular modeling tasks.

Details

Motivation: Existing multi-teacher distillation methods are often task-specific and don't fully leverage the diversity of different embedding models with varying architectures, loss functions, and input modalities.

Method: Proposes a ‘majority vote’ objective function bounded by mutual information between student and teacher embeddings, creating a task-agnostic distillation loss that doesn’t depend on labels or prior knowledge.

Result: The method effectively leverages teacher diversity across text, vision models, and molecular modeling, producing representations that enable better performance in downstream tasks like classification, clustering, and regression.

Conclusion: The framework successfully creates state-of-the-art embedding models that enhance downstream performance across various modalities without task-specific requirements.

Abstract: Casting complex inputs into tractable representations is a critical step across various fields. Diverse embedding models emerge from differences in architectures, loss functions, input modalities and datasets, each capturing unique aspects of the input. Multi-teacher distillation leverages this diversity to enrich representations but often remains tailored to specific tasks. In this paper, we introduce a task-agnostic framework based on a ``majority vote" objective function. We demonstrate that this function is bounded by the mutual information between student and teachers’ embeddings, leading to a task-agnostic distillation loss that eliminates dependence on task-specific labels or prior knowledge. Our evaluations across text, vision models, and molecular modeling show that our method effectively leverages teacher diversity, resulting in representations enabling better performance for a wide range of downstream tasks such as classification, clustering, or regression. Additionally, we train and release state-of-the-art embedding models, enhancing downstream performance in various modalities.

[463] Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach

Chenbei Lu, Zaiwei Chen, Tongxin Li, Chenye Wu, Adam Wierman

Main category: cs.LG

TL;DR: The paper introduces a novel RL framework that leverages multi-step predictions to improve decision-making in real-world applications, addressing the curse of dimensionality through Bayesian value functions and a two-stage algorithm called BOLA.

Details

Motivation: Traditional RL assumes one-step transitions, but many real applications (energy management, stock investment) have access to multi-step predictions. Naively incorporating these predictions leads to exponential state space growth and dimensionality issues.

Method: Three key innovations: 1) Bayesian value function for tractable prediction-aware policies, 2) Bellman-Jensen Gap analysis for imperfect predictions, 3) BOLA algorithm with offline Bayesian value learning and online adaptation to real-time predictions.

Result: Theoretical proof that BOLA remains sample-efficient even under imperfect predictions. Validation on synthetic MDPs and real-world wind energy storage control problem.

Conclusion: The proposed framework successfully addresses the challenges of multi-step predictions in RL, providing both theoretical guarantees and practical effectiveness for real-world applications.

Abstract: Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman-Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.

[464] OmniCast: A Masked Latent Diffusion Model for Weather Forecasting Across Time Scales

Tung Nguyen, Tuan Pham, Troy Arcomano, Veerabhadra Kotamarthi, Ian Foster, Sandeep Madireddy, Aditya Grover

Main category: cs.LG

TL;DR: OmniCast is a unified probabilistic weather forecasting model that uses VAE encoding and diffusion-based transformers to generate accurate forecasts across multiple timescales, outperforming existing methods especially at subseasonal-to-seasonal horizons.

Details

Motivation: Current deep learning weather forecasting methods work well for medium-range predictions but struggle at longer subseasonal-to-seasonal timescales due to error accumulation in autoregressive approaches.

Method: Uses VAE to encode weather data into lower-dimensional latent space, then employs diffusion-based transformer with token masking/unmasking to generate future sequences, avoiding autoregressive error compounding.

Result: Competes with leading probabilistic methods at medium-range (10-20x faster) and achieves state-of-the-art performance at subseasonal-to-seasonal scale across accuracy, physics-based, and probabilistic metrics. Can generate stable 100-year rollouts.

Conclusion: OmniCast provides a scalable, unified approach for weather forecasting across timescales, overcoming limitations of autoregressive methods through joint sampling in latent space and enabling long-term stable predictions.

Abstract: Accurate weather forecasting across time scales is critical for anticipating and mitigating the impacts of climate change. Recent data-driven methods based on deep learning have achieved significant success in the medium range, but struggle at longer subseasonal-to-seasonal (S2S) horizons due to error accumulation in their autoregressive approach. In this work, we propose OmniCast, a scalable and skillful probabilistic model that unifies weather forecasting across timescales. OmniCast consists of two components: a VAE model that encodes raw weather data into a continuous, lower-dimensional latent space, and a diffusion-based transformer model that generates a sequence of future latent tokens given the initial conditioning tokens. During training, we mask random future tokens and train the transformer to estimate their distribution given conditioning and visible tokens using a per-token diffusion head. During inference, the transformer generates the full sequence of future tokens by iteratively unmasking random subsets of tokens. This joint sampling across space and time mitigates compounding errors from autoregressive approaches. The low-dimensional latent space enables modeling long sequences of future latent states, allowing the transformer to learn weather dynamics beyond initial conditions. OmniCast performs competitively with leading probabilistic methods at the medium-range timescale while being 10x to 20x faster, and achieves state-of-the-art performance at the subseasonal-to-seasonal scale across accuracy, physics-based, and probabilistic metrics. Furthermore, we demonstrate that OmniCast can generate stable rollouts up to 100 years ahead. Code and model checkpoints are available at https://github.com/tung-nd/omnicast.

[465] Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

Joongkyu Lee, Seouh-won Yi, Min-hwan Oh

Main category: cs.LG

TL;DR: M-AUPO algorithm improves online preference-based RL with ranking feedback using Plackett-Luce model, achieving better sample efficiency with larger action subsets.

Details

Motivation: Existing PbRL works focus on pairwise comparisons and fail to leverage richer ranking feedback, with performance not improving or even deteriorating with longer feedback despite more information.

Method: Proposed M-AUPO algorithm that selects multiple actions by maximizing average uncertainty within offered subsets, using Plackett-Luce model for ranking feedback.

Result: Achieves suboptimality gap of Õ(d/T √∑(1/|S_t|)) where |S_t| is subset size, showing direct improvement with larger subsets and avoiding exponential dependence on unknown parameter norm.

Conclusion: First theoretical result in PbRL with ranking feedback that explicitly demonstrates improved sample efficiency as function of subset size, with matching lower bound.

Abstract: We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL’s recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter’s norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $\Omega \left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.

[466] Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference

Harry Amad, Zhaozhi Qian, Dennis Frauen, Julianna Piskorz, Stefan Feuerriegel, Mihaela van der Schaar

Main category: cs.LG

TL;DR: STEAM is a novel method for generating synthetic medical data specifically optimized for treatment effect analysis, addressing limitations of existing generative models by preserving key causal inference properties.

Details

Motivation: Real-world medical datasets are difficult to access due to regulatory barriers, making synthetic data valuable for causal inference and method development. Existing generative models don't consider the unique challenges of downstream causal inference tasks focused on treatments.

Method: Proposed STEAM method that mimics the data-generating process of treatment data and optimizes for three desiderata: preserving covariate distribution, treatment assignment mechanism, and outcome generation mechanism. Also introduced evaluation metrics to assess synthetic data quality.

Result: STEAM achieves state-of-the-art performance across the proposed metrics compared to existing generative models, particularly as the complexity of the true data-generating process increases.

Conclusion: STEAM provides an effective solution for generating synthetic medical data that maintains the necessary properties for reliable treatment effect analysis, outperforming existing methods especially in complex scenarios.

Abstract: Causal inference is essential for developing and evaluating medical interventions, yet real-world medical datasets are often difficult to access due to regulatory barriers. This makes synthetic data a potentially valuable asset that enables these medical analyses, along with the development of new inference methods themselves. Generative models can produce synthetic data that closely approximate real data distributions, yet existing methods do not consider the unique challenges that downstream causal inference tasks, and specifically those focused on treatments, pose. We establish a set of desiderata that synthetic data containing treatments should satisfy to maximise downstream utility: preservation of (i) the covariate distribution, (ii) the treatment assignment mechanism, and (iii) the outcome generation mechanism. Based on these desiderata, we propose a set of evaluation metrics to assess such synthetic data. Finally, we present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine that mimics the data-generating process of data containing treatments and optimises for our desiderata. We empirically demonstrate that STEAM achieves state-of-the-art performance across our metrics as compared to existing generative models, particularly as the complexity of the true data-generating process increases.

[467] Enhancing Fractional Gradient Descent with Learned Optimizers

Jan Sobotka, Petr Šimánek, Pavel Kordík

Main category: cs.LG

TL;DR: L2O-CFGD meta-learns dynamic hyperparameter tuning for Caputo Fractional Gradient Descent to address convergence and hyperparameter challenges in non-convex optimization.

Details

Motivation: Fractional Gradient Descent shows promise for accelerating optimization but faces challenges with convergence behavior, hyperparameter selection, and scheduling difficulties in non-convex settings like neural network training.

Method: Proposed Learning to Optimize Caputo Fractional Gradient Descent (L2O-CFGD) which meta-learns dynamic hyperparameter tuning for CFGD, enabling adaptive scheduling of hyperparameters.

Result: L2O-CFGD’s meta-learned schedule outperforms CFGD with static hyperparameters from extensive search, and achieves performance comparable to fully black-box meta-learned optimizers in some tasks.

Conclusion: L2O-CFGD serves as a powerful tool for identifying high-performing hyperparameters and gaining insights into leveraging the history-dependence of fractional differentials in optimization.

Abstract: Fractional Gradient Descent (FGD) offers a novel and promising way to accelerate optimization by incorporating fractional calculus into machine learning. Although FGD has shown encouraging initial results across various optimization tasks, it faces significant challenges with convergence behavior and hyperparameter selection. Moreover, the impact of its hyperparameters is not fully understood, and scheduling them is particularly difficult in non-convex settings such as neural network training. To address these issues, we propose a novel approach called Learning to Optimize Caputo Fractional Gradient Descent (L2O-CFGD), which meta-learns how to dynamically tune the hyperparameters of Caputo FGD (CFGD). Our method’s meta-learned schedule outperforms CFGD with static hyperparameters found through an extensive search and, in some tasks, achieves performance comparable to a fully black-box meta-learned optimizer. L2O-CFGD can thus serve as a powerful tool for researchers to identify high-performing hyperparameters and gain insights on how to leverage the history-dependence of the fractional differential in optimization.

[468] CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

Soroush Tabesh, Mher Safaryan, Dan Alistarh

Main category: cs.LG

TL;DR: CAGE introduces a curvature-aware gradient correction to QAT that reduces the accuracy gap between quantized and native training by counteracting quantization-induced loss increase.

Details

Motivation: There is still a large accuracy gap between low-bit quantization-aware training techniques and native training that needs to be addressed.

Method: CAGE augments the straight-through estimator gradient with a curvature-aware correction derived from a multi-objective view of QAT, balancing loss minimization with quantization constraints.

Result: CAGE recovers over 10% of quantization-induced loss increase in W4A4 regime for Llama-style models up to 800M parameters, outperforming outlier-mitigation methods.

Conclusion: Curvature-aware gradient corrections can bridge the remaining performance gap beyond current outlier-handling methods in quantization-aware training.

Abstract: Despite significant work on low-bit quantization-aware training (QAT), there is still a large accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with adherence to quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. When pre-training Llama-style models of up to 800M-parameters, CAGE recovers over 10% of the quantization-induced loss increase in the W4A4 regime over outlier-mitigation methods. These results indicate that curvature-aware gradient corrections can bridge the remaining performance gap beyond current outlier-handling methods.

[469] Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams

Federica Granese, Serena Villata, Charles Bouveyron

Main category: cs.LG

TL;DR: SB-SETM is an online topic model that extends the Embedded Topic Model to handle data streams by merging models from successive document batches, automatically inferring active topics and using optimal transport for topic embedding merging.

Details

Motivation: Online topic models are needed for real-world data streams that evolve over time, but they receive less attention than offline models due to additional challenges like handling continuous data flow and topic evolution.

Method: Extends ETM with truncated stick-breaking construction for topic-per-document distribution to infer active topics automatically, and introduces optimal transport-based merging strategy for topic embeddings in high-dimensional latent space.

Result: Outperforms baselines on simulated scenarios and is extensively tested on a real-world corpus of news articles about the Russian-Ukrainian war (2022-2023).

Conclusion: SB-SETM successfully addresses challenges in online topic modeling through automatic topic inference and optimal transport-based merging, demonstrating effectiveness on both simulated and real-world data streams.

Abstract: Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.

[470] On Biologically Plausible Learning in Continuous Time

Marc Gong Bacvanski, Liu Ziyin, Tomaso Poggio

Main category: cs.LG

TL;DR: Continuous-time neural model unifies biologically plausible learning algorithms without phase separation, showing that learning depends on temporal overlap between inputs and error signals, with robust learning requiring plasticity timescales 1-2 orders of magnitude longer than stimulus duration.

Details

Motivation: Biological learning occurs continuously in time, but most algorithmic models use discrete updates and separate inference/learning phases, which doesn't reflect biological reality.

Method: Developed a continuous-time neural model that unifies several learning algorithms (SGD, FA, DFA, KP) as limiting cases, analyzed learning dynamics under temporal mismatches and integration noise.

Result: Learning depends on temporal overlap between input and error signals; when inputs are constant, learning strength declines linearly with delay; robust learning requires plasticity timescale to exceed stimulus duration by 1-2 orders of magnitude.

Conclusion: For cortical stimuli (tens of milliseconds), functional plasticity window should be in the few-second range, identifying seconds-scale eligibility traces as necessary for error-driven learning in biological circuits.

Abstract: Biological learning unfolds continuously in time, yet most algorithmic models rely on discrete updates and separate inference and learning phases. We study a continuous-time neural model that unifies several biologically plausible learning algorithms and removes the need for phase separation. Rules including stochastic gradient descent (SGD), feedback alignment (FA), direct feedback alignment (DFA), and Kolen-Pollack (KP) emerge naturally as limiting cases of the dynamics. Simulations show that these continuous-time networks stably learn at biological timescales, even under temporal mismatches and integration noise. Through analysis and simulation, we show that learning depends on temporal overlap: a synapse updates correctly only when its input and the corresponding error signal coincide in time. When inputs are held constant, learning strength declines linearly as the delay between input and error approaches the stimulus duration, explaining observed robustness and failure across network depths. Critically, robust learning requires the synaptic plasticity timescale to exceed the stimulus duration by one to two orders of magnitude. For typical cortical stimuli (tens of milliseconds), this places the functional plasticity window in the few-second range, a testable prediction that identifies seconds-scale eligibility traces as necessary for error-driven learning in biological circuits.

[471] When LRP Diverges from Leave-One-Out in Transformers

Weiqiu You, Siqi Zeng, Yao-Hung Hubert Tsai, Makoto Yamada, Han Zhao

Main category: cs.LG

TL;DR: LOO is computationally expensive for feature importance, while LRP’s axiomatic soundness in Transformers is questionable. Bilinear propagation rules in AttnLRP violate implementation invariance, and bypassing softmax in CP-LRP improves alignment with LOO.

Details

Motivation: To examine the axiomatic soundness of LRP in modern Transformers and identify why LRP fails to approximate LOO effectively.

Method: Analytical proof and empirical confirmation of bilinear propagation rule violations in linear attention layers; revisiting CP-LRP with softmax bypass.

Result: Bilinear factorization sensitivity and softmax propagation error jointly undermine LRP’s ability to approximate LOO in Transformers.

Conclusion: LRP’s approximation of LOO in Transformers is compromised by bilinear rule violations and softmax propagation issues, with CP-LRP modification showing improved alignment.

Abstract: Leave-One-Out (LOO) provides an intuitive measure of feature importance but is computationally prohibitive. While Layer-Wise Relevance Propagation (LRP) offers a potentially efficient alternative, its axiomatic soundness in modern Transformers remains largely under-examined. In this work, we first show that the bilinear propagation rules used in recent advances of AttnLRP violate the implementation invariance axiom. We prove this analytically and confirm it empirically in linear attention layers. Second, we also revisit CP-LRP as a diagnostic baseline and find that bypassing relevance propagation through the softmax layer – backpropagating relevance only through the value matrices – significantly improves alignment with LOO, particularly in middle-to-late Transformer layers. Overall, our results suggest that (i) bilinear factorization sensitivity and (ii) softmax propagation error potentially jointly undermine LRP’s ability to approximate LOO in Transformers.

[472] A Unified Perspective on Optimization in Machine Learning and Neuroscience: From Gradient Descent to Neural Adaptation

Jesús García Fernández, Nasir Ahmad, Marcel van Gerven

Main category: cs.LG

TL;DR: This review provides a unified perspective on iterative optimization, bridging classic theory with neural network training and biological learning, highlighting zeroth-order methods as biologically plausible alternatives to backpropagation.

Details

Motivation: To bridge the gap between gradient-based optimization methods (like backpropagation) and biological learning systems, exploring computationally lighter zeroth-order methods that align better with how the brain learns.

Method: Categorizes optimization approaches by derivative order used, explores adaptation to neural network training challenges, and applies these insights to biological learning through a zeroth-order optimization lens.

Result: Modern zeroth-order methods can effectively approximate gradients and achieve performance competitive with backpropagation in neural networks, while also providing a principled framework for understanding biological learning.

Conclusion: The zeroth-order optimization paradigm offers a mathematically principled perspective on biological learning, leveraging intrinsic noise as a computational resource, with implications for designing energy-efficient neuromorphic hardware.

Abstract: Iterative optimization is central to modern artificial intelligence (AI) and provides a crucial framework for understanding adaptive systems. This review provides a unified perspective on this subject, bridging classic theory with neural network training and biological learning. Although gradient-based methods, powered by the efficient but biologically implausible backpropagation (BP), dominate machine learning, their computational demands can hinder scalability in high-dimensional settings. In contrast, derivative-free or zeroth-order (ZO) optimization feature computationally lighter approaches that rely only on function evaluations and randomness. While generally less sample efficient, recent breakthroughs demonstrate that modern ZO methods can effectively approximate gradients and achieve performance competitive with BP in neural network models. This ZO paradigm is also particularly relevant for biology. Its core principles of random exploration (probing) and feedback-guided adaptation (reinforcing) parallel key mechanisms of biological learning, offering a mathematically principled perspective on how the brain learns. In this review, we begin by categorizing optimization approaches based on the order of derivative information they utilize, ranging from first-, second-, and higher-order gradient-based to ZO methods. We then explore how these methods are adapted to the unique challenges of neural network training and the resulting learning dynamics. Finally, we build upon these insights to view biological learning through an optimization lens, arguing that a ZO paradigm leverages the brain’s intrinsic noise as a computational resource. This framework not only illuminates our understanding of natural intelligence but also holds vast implications for neuromorphic hardware, helping us design fast and energy-efficient AI systems that exploit intrinsic hardware noise.

[473] Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

Main category: cs.LG

TL;DR: Online Supervised Finetuning (OSFT) is a simple, self-help training paradigm where LLMs generate their own responses and are immediately finetuned on this self-generated data, achieving reasoning performance comparable to complex RL methods.

Details

Motivation: To develop an efficient training strategy for LLM reasoning that doesn't require complex reward systems and leverages the model's existing latent knowledge from pretraining.

Method: A self-help online supervised finetuning paradigm where the model generates responses and is immediately finetuned on this self-generated data using just one rollout by default.

Result: OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards methods like GRPO, while being more efficient and robust.

Conclusion: OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms by facilitating the model’s existing preference and latent knowledge learned from pretraining.

Abstract: We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model’s own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.

[474] Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Haotian Xu, Jiaqi Guo, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.LG

TL;DR: Search Self-Play (SSP) enables scalable reinforcement learning for LLM agents by using self-play where the same LLM acts as both task proposer and solver, generating deep search queries with increasing difficulty and verifiable ground-truth answers through retrieval-augmented generation.

Details

Motivation: Traditional RLVR requires massive human effort for task queries and ground-truth answers, hindering scalability in agentic scenarios. Existing task synthesis methods struggle to control task difficulty for effective RL training.

Method: Self-play training where LLM acts as both task proposer (generating deep search queries with increasing difficulty) and problem solver. Uses retrieval-augmented generation to verify ground-truth answers from search results.

Result: SSP significantly improves search agents’ performance uniformly across various benchmarks without supervision, working effectively in both from-scratch and continuous RL training setups.

Conclusion: Search Self-Play provides a scalable approach for agentic RLVR by enabling co-evolution of agent capabilities through self-play, eliminating the need for human supervision while maintaining training effectiveness.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer’s trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents’ performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.

[475] BO4Mob: Bayesian Optimization Benchmarks for High-Dimensional Urban Mobility Problem

Seunghee Ryu, Donghoon Kwon, Seongjin Choi, Aryan Deshwal, Seungmo Kang, Carolina Osorio

Main category: cs.LG

TL;DR: BO4Mob is a new benchmark framework for high-dimensional Bayesian Optimization focused on origin-destination travel demand estimation in large urban road networks, with scenarios scaling up to 10,100 dimensions.

Details

Motivation: The challenge of origin-destination travel demand estimation from limited traffic sensor data is a difficult inverse optimization problem in large-scale transportation networks, involving high-dimensional continuous spaces with computationally expensive, stochastic, and non-differentiable objective evaluations.

Method: BO4Mob comprises five scenarios based on real-world San Jose road networks using high-resolution open-source traffic simulations with realistic nonlinear and stochastic dynamics. The framework evaluates five optimization methods including three state-of-the-art BO algorithms and two non-BO baselines.

Result: The benchmark demonstrates utility for evaluating optimization methods in high-dimensional spaces up to 10,100 dimensions, supporting development of scalable algorithms for urban mobility modeling.

Conclusion: BO4Mob provides a valuable benchmark for both developing scalable optimization algorithms and applying them to data-driven urban mobility models, including high-resolution digital twins of metropolitan road networks.

Abstract: We introduce \textbf{BO4Mob}, a new benchmark framework for high-dimensional Bayesian Optimization (BO), driven by the challenge of origin-destination (OD) travel demand estimation in large urban road networks. Estimating OD travel demand from limited traffic sensor data is a difficult inverse optimization problem, particularly in real-world, large-scale transportation networks. This problem involves optimizing over high-dimensional continuous spaces where each objective evaluation is computationally expensive, stochastic, and non-differentiable. BO4Mob comprises five scenarios based on real-world San Jose, CA road networks, with input dimensions scaling up to 10,100. These scenarios utilize high-resolution, open-source traffic simulations that incorporate realistic nonlinear and stochastic dynamics. We demonstrate the benchmark’s utility by evaluating five optimization methods: three state-of-the-art BO algorithms and two non-BO baselines. This benchmark is designed to support both the development of scalable optimization algorithms and their application for the design of data-driven urban mobility models, including high-resolution digital twins of metropolitan road networks. Code and documentation are available at https://github.com/UMN-Choi-Lab/BO4Mob.

[476] Actor-Free Continuous Control via Structurally Maximizable Q-Functions

Yigit Korkmaz, Urvi Bhuwania, Ayush Jain, Erdem Bıyık

Main category: cs.LG

TL;DR: A purely value-based framework for continuous control that uses structural maximization of Q-functions instead of actor-critic methods, achieving comparable performance without learning a separate actor.

Details

Motivation: Traditional value-based algorithms are limited to discrete action spaces, while actor-critic methods for continuous spaces suffer from training instability. The authors aim to develop a stable, purely value-based approach for continuous control.

Method: Proposes an actor-free Q-learning framework with structural maximization of Q-functions, introducing key architectural and algorithmic choices for efficient learning in continuous action spaces.

Result: The method achieves performance and sample efficiency on par with state-of-the-art baselines on standard simulation tasks, and outperforms traditional actor-critic methods in environments with constrained action spaces where value functions are non-smooth.

Conclusion: The proposed purely value-based approach provides a viable alternative to actor-critic methods for continuous control, offering training stability and competitive performance without the need for learning a separate actor network.

Abstract: Value-based algorithms are a cornerstone of off-policy reinforcement learning due to their simplicity and training stability. However, their use has traditionally been restricted to discrete action spaces, as they rely on estimating Q-values for individual state-action pairs. In continuous action spaces, evaluating the Q-value over the entire action space becomes computationally infeasible. To address this, actor-critic methods are typically employed, where a critic is trained on off-policy data to estimate Q-values, and an actor is trained to maximize the critic’s output. Despite their popularity, these methods often suffer from instability during training. In this work, we propose a purely value-based framework for continuous control that revisits structural maximization of Q-functions, introducing a set of key architectural and algorithmic choices to enable efficient and stable learning. We evaluate the proposed actor-free Q-learning approach on a range of standard simulation tasks, demonstrating performance and sample efficiency on par with state-of-the-art baselines, without the cost of learning a separate actor. Particularly, in environments with constrained action spaces, where the value functions are typically non-smooth, our method with structural maximization outperforms traditional actor-critic methods with gradient-based maximization. We have released our code at https://github.com/USC-Lira/Q3C.

[477] A Hybrid Enumeration Framework for Optimal Counterfactual Generation in Post-Acute COVID-19 Heart Failure

Jingya Cheng, Alaleh Azhir, Jiazi Tian, Hossein Estiri

Main category: cs.LG

TL;DR: A counterfactual inference framework for personalized risk estimation and intervention analysis in COVID-19 patients with pre-existing heart failure, combining predictive modeling with optimization-based counterfactual search.

Details

Motivation: To bridge causal reasoning and predictive modeling for individualized risk estimation and intervention analysis, specifically addressing post-acute sequelae of COVID-19 (PASC) in heart failure patients.

Method: Integrates regularized predictive modeling with counterfactual search using exact enumeration and optimization-based methods (NICE and MOC algorithms) to explore high-dimensional intervention spaces in longitudinal health data.

Result: Achieved strong discriminative performance (AUROC: 0.88, 95% CI: 0.84-0.91) on 2700+ patients, generating interpretable patient-specific counterfactuals showing how modifying comorbidities or treatments could alter outcomes.

Conclusion: Counterfactual reasoning can be formalized as an optimization problem over predictive functions, providing a rigorous, interpretable, and computationally efficient approach for personalized inference in complex biomedical systems.

Abstract: Counterfactual inference provides a mathematical framework for reasoning about hypothetical outcomes under alternative interventions, bridging causal reasoning and predictive modeling. We present a counterfactual inference framework for individualized risk estimation and intervention analysis, illustrated through a clinical application to post-acute sequelae of COVID-19 (PASC) among patients with pre-existing heart failure (HF). Using longitudinal diagnosis, laboratory, and medication data from a large health-system cohort, we integrate regularized predictive modeling with counterfactual search to identify actionable pathways to PASC-related HF hospital admissions. The framework combines exact enumeration with optimization-based methods, including the Nearest Instance Counterfactual Explanations (NICE) and Multi-Objective Counterfactuals (MOC) algorithms, to efficiently explore high-dimensional intervention spaces. Applied to more than 2700 individuals with confirmed SARS-CoV-2 infection and prior HF, the model achieved strong discriminative performance (AUROC: 0.88, 95% CI: 0.84-0.91) and generated interpretable, patient-specific counterfactuals that quantify how modifying comorbidity patterns or treatment factors could alter predicted outcomes. This work demonstrates how counterfactual reasoning can be formalized as an optimization problem over predictive functions, offering a rigorous, interpretable, and computationally efficient approach to personalized inference in complex biomedical systems.

[478] Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar

Main category: cs.LG

TL;DR: Standard off-the-shelf embedding models are not well-suited for pretraining data curation, underperforming simple embeddings from models trained on the same pretraining corpus.

Details

Motivation: Similarity metrics for pretraining data selection are typically computed with generic embedding models, but their suitability for this specific task remains unexplored.

Method: Proposed a framework with three evaluation criteria: (1) how well distances reflect generalization in pretraining loss, (2) utility of embeddings in diversity-based data curation algorithms measured by downstream task performance, and (3) ability to distinguish between examples from different data sources.

Result: Experiments on the Pile dataset with a 1.7B parameter model trained on 200B tokens showed that standard embeddings underperform simple embeddings extracted from models trained on the same pretraining corpus.

Conclusion: The analysis framework serves as a foundation for designing embeddings specifically for reasoning about similarity in pretraining datasets.

Abstract: Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework’s first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.

[479] Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu

Main category: cs.LG

TL;DR: SPO is a novel RL framework that uses segment-level advantage estimation to balance between token-level and trajectory-level methods, achieving better reasoning performance in large language models.

Details

Motivation: Existing RL approaches for enhancing LLM reasoning have limitations: token-level methods suffer from inaccurate critic model training, while trajectory-level methods provide only coarse-grained advantage signals that lead to imprecise credit assignment.

Method: SPO uses segment-level advantage estimation with three components: flexible segment partition, accurate segment advantage estimation, and policy optimization with probability-mask strategy. Two instantiations: SPO-chain for short CoT with cutpoint-based partition, and SPO-tree for long CoT with tree-based advantage estimation.

Result: SPO-chain achieves 6-12 percentage point improvements over PPO and GRPO on GSM8K. SPO-tree achieves 7-11 percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation.

Conclusion: SPO provides an effective RL framework that balances advantage estimation granularity, enabling more precise credit assignment without requiring a critic model, significantly improving reasoning performance in LLMs.

Abstract: Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

[480] Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

Siva Rajesh Kasa, Karan Gupta, Sumegh Roychowdhury, Ashutosh Kumar, Yaswanth Biruduraju, Santhosh Kumar Kasa, Nikhil Priyatam Pattisapu, Arindam Bhattacharya, Shailendra Agarwal, Vijay huddar

Main category: cs.LG

TL;DR: This paper provides the first comprehensive comparison of modern generative and discriminative transformer architectures for text classification, examining classical trade-offs in the context of contemporary models.

Details

Motivation: To explore the classical trade-offs between discriminative and generative classifiers in the transformer era, as these relationships remain unexplored despite Efron's seminal work on logistic regression vs discriminant analysis.

Method: Comprehensive evaluation of modern generative and discriminative architectures including Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification across multiple dimensions.

Result: The classical ’two regimes’ phenomenon manifests distinctly across different architectures and training paradigms, with findings covering sample efficiency, calibration, noise robustness, and ordinality.

Conclusion: The study offers practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.

Abstract: The comparison between discriminative and generative classifiers has intrigued researchers since Efron’s seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures - Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical ’two regimes’ phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.

[481] Learning to Interpret Weight Differences in Language Models

Avichal Goel, Yoon Kim, Nir Shavit, Tony T. Wang

Main category: cs.LG

TL;DR: Diff Interpretation Tuning (DIT) trains models to describe their own finetuning-induced weight changes in natural language, enabling interpretable understanding of model modifications.

Details

Motivation: Finetuning changes model weights in ways that are not interpretable, and finetuning datasets are often unavailable or too large to analyze directly, making it hard to understand how models have been modified.

Method: DIT uses synthetic, labeled weight diffs to train a DIT-adapter that can be applied to finetuned models to make them describe their modifications in natural language.

Result: The method enables models to accurately describe their finetuning-induced modifications in natural language across two proof-of-concept settings: reporting hidden behaviors and summarizing finetuned knowledge.

Conclusion: DIT provides a way to comprehensively understand weight diffs through natural language descriptions, making model finetuning modifications interpretable.

Abstract: Finetuning (pretrained) language models is a standard approach for updating their internal parametric knowledge and specializing them to new tasks and domains. However, the corresponding model weight changes (“weight diffs”) are not generally interpretable. While inspecting the finetuning dataset can give a sense of how the model might have changed, these datasets are often not publicly available or are too large to work with directly. Towards the goal of comprehensively understanding weight diffs in natural language, we introduce Diff Interpretation Tuning (DIT), a method that trains models to describe their own finetuning-induced modifications. Our approach uses synthetic, labeled weight diffs to train a DIT-adapter, which can be applied to a compatible finetuned model to make it describe how it has changed. We demonstrate in two proof-of-concept settings (reporting hidden behaviors and summarizing finetuned knowledge) that our method enables models to describe their finetuning-induced modifications using accurate natural language descriptions.

[482] One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares

Youngjae Min, Namhoon Cho, Navid Azizan

Main category: cs.LG

TL;DR: ORFit enables efficient one-pass learning by updating model parameters orthogonally to past gradients, achieving linear memory/computation complexity and matching SGD performance for overparameterized models.

Details

Motivation: Traditional ML training requires multiple passes over data, which is impractical for streaming data due to computational, memory, and privacy constraints.

Method: Propose Orthogonal Recursive Fitting (ORFit) that fits new datapoints perfectly while minimally altering previous predictions, using orthogonal gradient updates and incremental PCA for efficiency.

Result: ORFit achieves linear memory/computation complexity (vs quadratic in RLS), is minimax optimal for worst-case forgetting, and matches SGD convergence for overparameterized linear models.

Conclusion: ORFit provides an efficient one-pass learning solution that maintains performance while addressing practical constraints of streaming data scenarios.

Abstract: While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit’s update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.

[483] Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential

Xuansheng Wu, Xiaoman Pan, Wenlin Yao, Jianshu Chen

Main category: cs.LG

TL;DR: The paper identifies that a model’s reasoning potential after reinforcement learning with verifiable rewards (RLVR) depends on its pre-trained ability to distinguish sound knowledge from unsound knowledge, quantified by the Soundness-Aware Level (SAL) metric.

Details

Motivation: To understand why post-RLVR reasoning performance varies dramatically across different base LLMs and identify the microscopic properties that cause this variation.

Method: Formalize reasoning as chains of Horn clauses built from features extracted via cross-layer sparse autoencoders, estimate transition probabilities, categorize rules by semantic soundness levels, and introduce the SAL metric using Jensen-Shannon Divergence.

Result: High-potential models are soundness-aware with distinct internal probability distributions for different soundness levels, while weaker models are soundness-agnostic. SAL predicts post-RLVR reasoning performance with high accuracy (R²=0.87) across diverse model families and scales.

Conclusion: A model’s reasoning potential is intrinsically tied to its pre-trained ability to distinguish sound from unsound knowledge, highlighting the critical role of pre-training and providing a practical metric for model selection and design.

Abstract: Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses (“if-then” rules) built from features extracted from the LLM’s latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules’ soundness levels, becoming highly distinct for “strict” versus “noisy” rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL’s predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model’s reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model’s internal mechanisms for selecting/designing stronger base models.

[484] Better NTK Conditioning: A Free Lunch from (ReLU) Nonlinear Activation in Wide Neural Networks

Chaoyue Liu, Han Bi, Like Hui, Xiao Liu

Main category: cs.LG

TL;DR: Nonlinear activations (ReLU) improve feature separation and NTK conditioning in wide neural networks, with deeper networks amplifying these effects. Without nonlinear activations, data separation remains unchanged regardless of depth.

Details

Motivation: To understand the specific effects of nonlinear activation functions on neural networks, particularly how they enhance expressivity through better feature separation and improved neural tangent kernel (NTK) conditioning.

Method: Compare neural networks with enabled vs disabled nonlinear activations (ReLU), analyze feature separation angles in gradient space, and examine NTK condition numbers across different network depths.

Result: Nonlinear activations provide: (a) better feature separation (larger angle separation for similar data), (b) better NTK conditioning (smaller condition number). These effects are amplified by network depth, and in infinite-width-then-depth limit, all data achieve equal separation regardless of input similarity.

Conclusion: Nonlinear activation functions help improve worst-case convergence rates of gradient-based methods by enhancing feature separation and NTK conditioning, with network depth further amplifying these beneficial effects.

Abstract: Nonlinear activation functions are widely recognized for enhancing the expressivity of neural networks, which is the primary reason for their widespread implementation. In this work, we focus on ReLU activation and reveal a novel and intriguing property of nonlinear activations. By comparing enabling and disabling the nonlinear activations in the neural network, we demonstrate their specific effects on wide neural networks: (a) better feature separation, i.e., a larger angle separation for similar data in the feature space of model gradient, and (b) better NTK conditioning, i.e., a smaller condition number of neural tangent kernel (NTK). Furthermore, we show that the network depth (i.e., with more nonlinear activation operations) further amplifies these effects; in addition, in the infinite-width-then-depth limit, all data are equally separated with a fixed angle in the model gradient feature space, regardless of how similar they are originally in the input space. Note that, without the nonlinear activation, i.e., in a linear neural network, the data separation remains the same as for the original inputs and NTK condition number is equivalent to the Gram matrix, regardless of the network depth. Due to the close connection between NTK condition number and convergence theories, our results imply that nonlinear activation helps to improve the worst-case convergence rates of gradient based methods.

[485] Mitigating Prior Errors in Causal Structure Learning: A Resilient Approach via Bayesian Networks

Lyuzhou Chen, Taiyu Ban, Xiangyu Wang, Derui Lyu, Huanhuan Chen

Main category: cs.LG

TL;DR: Proposes a robust causal structure learning strategy resilient to edge-level prior errors by identifying and handling quasi-circle structures, minimizing human intervention while maintaining accuracy.

Details

Motivation: Current methods for integrating prior knowledge in causal structure learning lack resilience to errors in the prior - hard constraints ignore priors entirely, while soft constraints require predetermined confidence levels and expert intervention.

Method: Classifies prior errors into types and analyzes their impact on Structural Hamming Distance. Identifies that strong hazard of prior errors is associated with ‘quasi-circle’ structures. Uses a post-hoc strategy to detect prior errors by their impact on quasi-circle increments.

Result: Empirical evaluation on real and synthetic datasets demonstrates robust performance against prior errors, particularly showing strong resistance to order-reversed errors while preserving correct prior knowledge.

Conclusion: The proposed strategy effectively handles prior errors in causal structure learning by leveraging quasi-circle detection, reducing the need for human intervention while maintaining learning quality.

Abstract: Causal structure learning (CSL), a prominent technique for encoding cause-and-effect relationships among variables, through Bayesian Networks (BNs). Although recovering causal structure solely from data is a challenge, the integration of prior knowledge, revealing partial structural truth, can markedly enhance learning quality. However, current methods based on prior knowledge exhibit limited resilience to errors in the prior, with hard constraint methods disregarding priors entirely, and soft constraints accepting priors based on a predetermined confidence level, which may require expert intervention. To address this issue, we propose a strategy resilient to edge-level prior errors for CSL, thereby minimizing human intervention. We classify prior errors into different types and provide their theoretical impact on the Structural Hamming Distance (SHD) under the presumption of sufficient data. Intriguingly, we discover and prove that the strong hazard of prior errors is associated with a unique acyclic closed structure, defined as quasi-circle''. Leveraging this insight, a post-hoc strategy is employed to identify the prior errors by its impact on the increment of quasi-circles’’. Through empirical evaluation on both real and synthetic datasets, we demonstrate our strategy’s robustness against prior errors. Specifically, we highlight its substantial ability to resist order-reversed errors while maintaining the majority of correct prior.

[486] Estimating Model Performance Under Covariate Shift Without Labels

Jakub Białek, Juhani Kivimäki, Wojtek Kuberski, Nikolaos Perrakis

Main category: cs.LG

TL;DR: PAPE is a new method for estimating binary classification model performance on unlabeled tabular data under covariate shift, outperforming existing approaches.

Details

Motivation: Machine learning models degrade after deployment due to data distribution shifts, and existing proxy methods like data drift detection fail to adequately measure performance when labels are missing or delayed.

Method: Probabilistic Adaptive Performance Estimation (PAPE) - evaluates binary classification models on unlabeled tabular data, works with any confusion matrix-based metric, operates independently of original model using only predictions/probabilities, and learns directly from data without assumptions about covariate shift.

Result: Tested on 900+ dataset-model combinations from US census data, PAPE outperformed other benchmarks across various metrics.

Conclusion: PAPE is a superior choice for estimating binary classification model performance under covariate shift conditions.

Abstract: After deployment, machine learning models often experience performance degradation due to shifts in data distribution. It is challenging to assess post-deployment performance accurately when labels are missing or delayed. Existing proxy methods, such as data drift detection, fail to measure the effects of these shifts adequately. To address this, we introduce a new method for evaluating binary classification models on unlabeled tabular data that accurately estimates model performance under covariate shift and call it Probabilistic Adaptive Performance Estimation (PAPE). It can be applied to any performance metric defined with elements of the confusion matrix. Crucially, PAPE operates independently of the original model, relying only on its predictions and probability estimates, and does not need any assumptions about the nature of covariate shift, learning directly from data instead. We tested PAPE using over 900 dataset-model combinations from US census data, assessing its performance against several benchmarks through various metrics. Our findings show that PAPE outperforms other methodologies, making it a superior choice for estimating the performance of binary classification models.

[487] Sparse Explanations of Neural Networks Using Pruned Layer-Wise Relevance Propagation

Paulo Yanez Sarmiento, Simon Witzke, Nadja Klein, Bernhard Y. Renard

Main category: cs.LG

TL;DR: The paper presents a modified layer-wise relevance propagation method that enforces sparsity by pruning relevance propagation for different layers, achieving sparser relevance attributions for input features and intermediate layers.

Details

Motivation: Current DNN explanation methods require humans to distinguish relevant explanations from noise, which is infeasible for complex data like genome sequences. The goal is to increase explainability and accessibility of DNN outputs from such complex data.

Method: Modification of layer-wise relevance propagation that enforces sparsity by pruning relevance propagation for different layers. The approach prunes relevance propagation rather than the underlying model architecture, allowing different neurons to be pruned for different inputs.

Result: The method leads to noise reduction and concentrates relevance on the most important features compared to baseline methods. It was evaluated on images and genome sequences, showing efficacy in both data types.

Conclusion: The proposed sparsity-enforcing modification of layer-wise relevance propagation successfully improves explanation quality by reducing noise and focusing on relevant features, making it suitable for complex data types like genome sequences.

Abstract: Explainability is a key component in many applications involving deep neural networks (DNNs). However, current explanation methods for DNNs commonly leave it to the human observer to distinguish relevant explanations from spurious noise. This is not feasible anymore when going from easily human-accessible data such as images to more complex data such as genome sequences. To facilitate the accessibility of DNN outputs from such complex data and to increase explainability, we present a modification of the widely used explanation method layer-wise relevance propagation. Our approach enforces sparsity directly by pruning the relevance propagation for the different layers. Thereby, we achieve sparser relevance attributions for the input features as well as for the intermediate layers. As the relevance propagation is input-specific, we aim to prune the relevance propagation rather than the underlying model architecture. This allows to prune different neurons for different inputs and hence, might be more appropriate to the local nature of explanation methods. To demonstrate the efficacy of our method, we evaluate it on two types of data: images and genome sequences. We show that our modification indeed leads to noise reduction and concentrates relevance on the most important features compared to the baseline.

[488] Learning Fairer Representations with FairVIC

Charmaine Barker, Daniel Bethell, Dimitar Kazakov

Main category: cs.LG

TL;DR: FairVIC is a novel approach that improves fairness in neural networks by incorporating variance, invariance, and covariance terms into the loss function, achieving ~70% fairness improvements without accuracy loss.

Details

Motivation: Addressing bias in automated decision-making systems is challenging due to nuanced fairness definitions, dataset-specific biases, and the fairness-accuracy trade-off in deep learning models.

Method: FairVIC integrates variance, invariance, and covariance terms into the neural network loss function during training, abstracting fairness concepts to minimize dependency on protected characteristics.

Result: FairVIC demonstrates significant improvements (≈70%) in fairness across all tested metrics on benchmark datasets without compromising accuracy, outperforming comparable bias mitigation techniques.

Conclusion: FairVIC offers a robust, generalizable solution for fair deep learning across diverse tasks and datasets by effectively balancing fairness and accuracy.

Abstract: Mitigating bias in automated decision-making systems, particularly in deep learning models, is a critical challenge due to nuanced definitions of fairness, dataset-specific biases, and the inherent trade-off between fairness and accuracy. To address these issues, we introduce FairVIC, an innovative approach that enhances fairness in neural networks by integrating variance, invariance, and covariance terms into the loss function during training. Unlike methods that rely on predefined fairness criteria, FairVIC abstracts fairness concepts to minimise dependency on protected characteristics. We evaluate FairVIC against comparable bias mitigation techniques on benchmark datasets, considering both group and individual fairness, and conduct an ablation study on the accuracy-fairness trade-off. FairVIC demonstrates significant improvements ($\approx70%$) in fairness across all tested metrics without compromising accuracy, thus offering a robust, generalisable solution for fair deep learning across diverse tasks and datasets.

[489] A Flow-Based Model for Conditional and Probabilistic Electricity Consumption Profile Generation and Prediction

Weijie Xia, Chenguang Wang, Peter Palensky, Pedro P. Vergara

Main category: cs.LG

TL;DR: FCPFlow is a flow-based generative model for residential load profile generation and probabilistic forecasting, featuring novel invertible layers for continuous condition handling and superior scalability.

Details

Motivation: Address the need for accurate residential load profile generation and prediction as low-carbon technologies like PV and EVs become more prevalent in distribution networks.

Method: Proposes Full Convolutional Profile Flow (FCPFlow) with two new layers: invertible linear layer and invertible normalization layer, designed for both conditional and unconditional RLP generation.

Result: FCPFlow shows three main advantages: suitable for continuous conditions like weather and consumption variations, superior scalability across datasets, and better modeling of complex RLP correlations compared to other models.

Conclusion: FCPFlow is an effective flow-based generative model that outperforms traditional statistical and contemporary deep generative models for residential load profile tasks.

Abstract: Residential Load Profile (RLP) generation and prediction are critical for the operation and planning of distribution networks, especially as diverse low-carbon technologies (e.g., photovoltaic and electric vehicles) are increasingly adopted. This paper introduces a novel flow-based generative model, termed Full Convolutional Profile Flow (FCPFlow), which is uniquely designed for both conditional and unconditional RLP generation, and for probabilistic load forecasting. By introducing two new layers–the invertible linear layer and the invertible normalization layer–the proposed FCPFlow architecture shows three main advantages compared to traditional statistical and contemporary deep generative models: 1) it is well-suited for RLP generation under continuous conditions, such as varying weather and annual electricity consumption, 2) it demonstrates superior scalability in different datasets compared to traditional statistical models, and 3) it also demonstrates better modeling capabilities in capturing the complex correlation of RLPs compared with deep generative models.

[490] Learning Confidence Bounds for Classification with Imbalanced Data

Matt Clifford, Jonathan Erskine, Alexander Hepburn, Raúl Santos-Rodríguez, Dario Garcia-Garcia

Main category: cs.LG

TL;DR: A novel framework using learning theory and concentration inequalities to address class imbalance in classification, overcoming limitations of traditional undersampling/oversampling methods by incorporating class-dependent uncertainty estimates directly into learning.

Details

Motivation: Class imbalance causes biased models and unreliable predictions in classification tasks. Traditional solutions like undersampling and oversampling have inherent limitations - undersampling loses information while oversampling introduces additional biases.

Method: Proposes a framework that leverages learning theory and concentration inequalities to understand uncertainty in a class-dependent manner. Embeds confidence bounds directly into the learning process and incorporates class-dependent estimates to adapt to varying degrees of imbalance across classes.

Result: The method effectively adapts to varying class imbalance levels, resulting in more robust and reliable classification outcomes. Empirically demonstrates promising direction for handling imbalanced data.

Conclusion: The framework provides practitioners with a valuable tool for building more accurate and trustworthy models in imbalanced classification scenarios, overcoming limitations of traditional approaches.

Abstract: Class imbalance poses a significant challenge in classification tasks, where traditional approaches often lead to biased models and unreliable predictions. Undersampling and oversampling techniques have been commonly employed to address this issue, yet they suffer from inherent limitations stemming from their simplistic approach such as loss of information and additional biases respectively. In this paper, we propose a novel framework that leverages learning theory and concentration inequalities to overcome the shortcomings of traditional solutions. We focus on understanding the uncertainty in a class-dependent manner, as captured by confidence bounds that we directly embed into the learning process. By incorporating class-dependent estimates, our method can effectively adapt to the varying degrees of imbalance across different classes, resulting in more robust and reliable classification outcomes. We empirically show how our framework provides a promising direction for handling imbalanced data in classification tasks, offering practitioners a valuable tool for building more accurate and trustworthy models.

[491] One protein is all you need

Anton Bushuiev, Roman Bushuiev, Olga Pimenova, Nikola Zadorozhny, Raman Samusevich, Elisabet Manaskova, Rachel Seongeun Kim, Hannes Stärk, Jiri Sedlar, Martin Steinegger, Tomáš Pluskal, Josef Sivic

Main category: cs.LG

TL;DR: ProteinTTT enables self-supervised customization of protein language models to individual target proteins at test time, improving generalization without requiring additional training data.

Details

Motivation: General-purpose protein models struggle with specific proteins not covered in training data, while experimentalists need accurate predictions for individual proteins they study.

Method: Self-supervised test-time training that customizes protein language models to one target protein at a time, on the fly, without assuming additional data.

Result: Consistently enhances generalization across models and datasets; improves structure prediction for challenging targets; achieves SOTA on protein fitness prediction; enhances function prediction; improves antibody-antigen loop modeling and 19% of structures in Big Fantastic Virus Database.

Conclusion: ProteinTTT delivers improved predictions where general-purpose models like AlphaFold2 and ESMFold struggle, enabling better performance on specific proteins of interest.

Abstract: Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self-supervised pre-training on large datasets. However, aiming to perform well on all possible proteins can limit a model’s capacity to excel on any specific one, whereas experimentalists typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self-supervised customization of protein language models to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state-of-the-art results on protein fitness prediction, and enhances function prediction on two tasks. Through two challenging case studies, we also show that customization via ProteinTTT achieves more accurate antibody-antigen loop modeling and enhances 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general-purpose AlphaFold2 and ESMFold struggle.

[492] FedMeld: A Model-dispersal Federated Learning Framework for Space-ground Integrated Networks

Qian Chen, Xianhao Chen, Kaibin Huang

Main category: cs.LG

TL;DR: FedMeld is a federated learning framework for space-ground integrated networks that uses satellite movement patterns and store-carry-forward capabilities to enable parameter mixing without infrastructure, achieving better accuracy and lower communication costs than traditional FL schemes.

Details

Motivation: To overcome the limitations of existing space-ground integrated FL frameworks that require ground stations or costly inter-satellite links, which result in excessive training latency and communication costs.

Method: Proposes FedMeld framework based on model dispersal strategy, exploiting satellite movement patterns and store-carry-forward capabilities. Formulates joint optimization problem for staleness control and mixing ratio (SC-MR), decomposing it into sequential subproblems to derive optimal solutions.

Result: FedMeld achieves superior model accuracy while significantly reducing communication costs compared to traditional FL schemes for SGINs, as demonstrated through experiments using various datasets.

Conclusion: FedMeld provides an infrastructure-free FL framework that enables efficient global model training in space-ground integrated networks by leveraging satellite mobility patterns, achieving optimal latency-accuracy tradeoff.

Abstract: To bridge the digital divide, the space-ground integrated networks (SGINs), which will be a key component of the six-generation (6G) mobile networks, are expected to deliver artificial intelligence (AI) services to every corner of the world. One mission of SGINs is to support federated learning (FL) at a global scale. However, existing space-ground integrated FL frameworks involve ground stations or costly inter-satellite links, entailing excessive training latency and communication costs. To overcome these limitations, we propose an infrastructure-free federated learning framework based on a model dispersal (FedMeld) strategy, which exploits periodic movement patterns and store-carry-forward capabilities of satellites to enable parameter mixing across large-scale geographical regions. We theoretically show that FedMeld leads to global model convergence and quantify the effects of round interval and mixing ratio between adjacent areas on its learning performance. Based on the theoretical results, we formulate a joint optimization problem to design the staleness control and mixing ratio (SC-MR) for minimizing the training loss. By decomposing the problem into sequential SC and MR subproblems without compromising the optimality, we derive the round interval solution in a closed form and the mixing ratio in a semi-closed form to achieve the optimal latency-accuracy tradeoff. Experiments using various datasets demonstrate that FedMeld achieves superior model accuracy while significantly reducing communication costs as compared with traditional FL schemes for SGINs.

[493] LLM Safety Alignment is Divergence Estimation in Disguise

Rajdeep Haldar, Ziyi Wang, Qifan Song, Guang Lin, Yue Xing

Main category: cs.LG

TL;DR: The paper presents a theoretical framework showing that LLM alignment methods like RLHF can be viewed as divergence estimators between aligned and unaligned distributions, explaining latent space separation between safe and harmful prompts.

Details

Motivation: To provide a unified theoretical understanding of popular LLM alignment methods and explain why they create separation between safe and harmful content in the latent space.

Method: Proposed a general divergence framework, introduced KLDO (a KL divergence-based alignment method), and used compliance-refusal datasets instead of standard preference datasets. Also proposed a distance-based metric to quantify separation.

Result: Empirical validation shows KLDO is effective, and using compliance-refusal datasets leads to stronger separation and improved safety alignment. The distance-based metric serves as a statistically significant indicator for model safety.

Conclusion: The divergence framework provides theoretical grounding for LLM alignment methods, KLDO offers an effective alignment approach, and compliance-refusal datasets with distance metrics enhance safety evaluation and alignment.

Abstract: We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

[494] Asynchronous Federated Learning: A Scalable Approach for Decentralized Machine Learning

Ali Forootani, Raffaele Iervolino

Main category: cs.LG

TL;DR: Proposes Asynchronous Federated Learning (AFL) algorithm to overcome limitations of synchronous FL, enabling independent client updates with convergence guarantees despite delays and model staleness.

Details

Motivation: Traditional FL faces scalability and efficiency issues due to synchronous client updates, causing delays and high communication overhead in heterogeneous environments.

Method: Develops AFL algorithm with convergence analysis using martingale difference sequence theory and variance bounds, handles client delays and model staleness, assumes strongly convex objectives.

Result: Establishes convergence bounds under random client sampling, demonstrates practical applicability with linear regression and SVM classifiers on non-IID data, outperforms synchronous FL.

Conclusion: AFL addresses key FL limitations, enhances scalability, robustness, and efficiency for real-world heterogeneous environments and resource-constrained applications.

Abstract: Federated Learning (FL) has emerged as a powerful paradigm for decentralized machine learning, enabling collaborative model training across diverse clients without sharing raw data. However, traditional FL approaches often face limitations in scalability and efficiency due to their reliance on synchronous client updates, which can result in significant delays and increased communication overhead, particularly in heterogeneous and dynamic environments. To address these challenges in this paper, we propose an Asynchronous Federated Learning (AFL) algorithm, which allows clients to update the global model independently and asynchronously. Our key contributions include a comprehensive convergence analysis of AFL in the presence of client delays and model staleness. By leveraging martingale difference sequence theory and variance bounds, we ensure robust convergence despite asynchronous updates. Assuming strongly convex local objective functions, we establish bounds on gradient variance under random client sampling and derive a recursion formula quantifying the impact of client delays on convergence. Furthermore, we demonstrate the practical applicability of the AFL algorithm by training decentralized linear regression and Support Vector Machine (SVM) based classifiers and compare its results with synchronous FL algorithm to effectively handling non-IID data distributed among clients. The proposed AFL algorithm addresses key limitations of traditional FL methods, such as inefficiency due to global synchronization and susceptibility to client drift. It enhances scalability, robustness, and efficiency in real-world settings with heterogeneous client populations and dynamic network conditions. Our results underscore the potential of AFL to drive advancements indistributed learning systems, particularly for large-scale, privacy-preserving applications in resource-constrained environments.

[495] A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI

Kazusato Oko, Licong Lin, Yuhang Cai, Song Mei

Main category: cs.LG

TL;DR: The paper develops a theoretical framework explaining why contrastive pre-training works for multi-modal AI systems, introducing approximate sufficient statistics and showing how transformers can efficiently learn cross-modal representations.

Details

Motivation: While contrastive pre-training is widely used in multi-modal AI systems, there's limited theoretical understanding of why it works so well for downstream tasks like zero-shot classification and vision-language models.

Method: Introduces approximate sufficient statistics concept and Joint Generative Hierarchical Model for image-text distributions. Shows transformers can approximate relevant functions via belief propagation and derives sample complexity guarantees.

Result: Theoretical framework demonstrates that near-minimizers of contrastive pre-training loss are approximately sufficient, making them adaptable to diverse downstream tasks. Numerical simulations validate strong generalization performance.

Conclusion: Contrastive pre-training provides theoretically grounded, efficient representations for multi-modal learning, with transformers capable of approximating complex cross-modal functions through the proposed hierarchical model.

Abstract: Multi-modal generative AI systems, such as those combining vision and language, rely on contrastive pre-training to learn representations across different modalities. While their practical benefits are widely acknowledged, a rigorous theoretical understanding of the contrastive pre-training framework remains limited. This paper develops a theoretical framework to explain the success of contrastive pre-training in downstream tasks, such as zero-shot classification, conditional diffusion models, and vision-language models. We introduce the concept of approximate sufficient statistics, a generalization of the classical sufficient statistics, and show that near-minimizers of the contrastive pre-training loss are approximately sufficient, making them adaptable to diverse downstream tasks. We further propose the Joint Generative Hierarchical Model for the joint distribution of images and text, showing that transformers can efficiently approximate relevant functions within this model via belief propagation. Building on this framework, we derive sample complexity guarantees for multi-modal learning based on contrastive pre-trained representations. Numerical simulations validate these theoretical findings, demonstrating the strong generalization performance of contrastively pre-trained transformers in various multi-modal tasks.

[496] Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs

Nir Ailon, Akhiad Bercovich, Yahel Uffenheimer, Omri Weinstein

Main category: cs.LG

TL;DR: The paper proposes Strassen-Tile (STL), a GPU-native bilinear operator that replaces matrix multiplications in neural networks, offering a tradeoff between speed, accuracy, and parameter count with significantly fewer FLOPs.

Details

Motivation: Modern AI faces scalability problems due to the computational demands of huge matrix multiplications (MatMuls) for inference and training. There's a need for more efficient alternatives.

Method: STL uses local learnable change-of-basis applied to tiles of weight and activation matrices, followed by element-wise product between tiles implemented via MatMul. The key innovation is optimizing the change-of-basis using theory-backed initializations inspired by fast matrix and polynomial multiplication.

Result: STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66. It improves Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-optimized PyTorch code, STL achieves wall-clock speedups in compute-bound regimes.

Conclusion: STL is a promising building block for scalable and cost-efficient AI, offering theoretical grounds and practical benefits for reducing computational demands while maintaining or improving performance.

Abstract: Modern AI relies on huge matrix multiplications (MatMuls), whose computation poses a scalability problem for inference and training. We propose an alternative, GPU native bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially fewer FLOPs to evaluate ($\ll n^3$), yet increases the parameter count compared to MatMul ($\gg n^2$). We call this operator Strassen-Tile (STL). The key idea behind STL is a local learnable change-of-basis, applied on tiles of the weight and activation matrices, followed by an element-wise product between the tiles, implemented simultaneously via MatMul. The key technical question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization. This phenomenon motivates further algorithmic study of STL optimization in DNNs. Our experiments demonstrate that STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66, and can improve Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized PyTorch code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.

[497] Can We Validate Counterfactual Estimations in the Presence of General Network Interference?

Sadegh Shirani, Yuwei Luo, William Overman, Ruoxuan Xiong, Mohsen Bayati

Main category: cs.LG

TL;DR: A framework for causal inference in network experiments using distribution-preserving network bootstrap and counterfactual cross-validation to handle interference and enable rigorous model validation.

Details

Motivation: Network interference in randomized experiments makes causal effect estimation and validation challenging, as treatments affect multiple units and outcomes are only observable under single treatment scenarios with complex correlations.

Method: Introduces distribution-preserving network bootstrap to generate statistically-valid subpopulations from single experiment data, and counterfactual cross-validation for model selection and evaluation. Extends causal message-passing with heterogeneous characteristics and local interactions.

Result: Extensive testing across diverse experimental environments (AI agent networks, ride-sharing) demonstrates robustness to various forms of network interference. Method provides reliable finite-sample performance through non-asymptotic analysis.

Conclusion: The framework successfully addresses network interference challenges in causal inference, enabling rigorous estimation and validation through novel bootstrap and cross-validation techniques that work across diverse real-world scenarios.

Abstract: Randomized experiments have become a cornerstone of evidence-based decision-making in contexts ranging from online platforms to public health. However, in experimental settings with network interference, a unit’s treatment can influence outcomes of other units, challenging both causal effect estimation and its validation. Classic validation approaches fail as outcomes are only observable under a single treatment scenario and exhibit complex correlation patterns due to interference. To address these challenges, we introduce a framework that facilitates the use of machine learning tools for both estimation and validation in causal inference. Central to our approach is the new distribution-preserving network bootstrap, a theoretically-grounded technique that generates multiple statistically-valid subpopulations from a single experiment’s data. This amplification of experimental samples enables our second contribution: a counterfactual cross-validation procedure. This procedure adapts the principles of model validation to the unique constraints of causal settings, providing a rigorous, data-driven method for selecting and evaluating estimators. We extend recent causal message-passing developments by incorporating heterogeneous unit-level characteristics and varying local interactions, ensuring reliable finite-sample performance through non-asymptotic analysis. Additionally, we develop and publicly release a comprehensive benchmark toolbox featuring diverse experimental environments, from networks of interacting AI agents to ride-sharing applications. These environments provide known ground truth values while maintaining realistic complexities, enabling systematic evaluation of causal inference methods. Extensive testing across these environments demonstrates our method’s robustness to diverse forms of network interference.

[498] Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling

Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze

Main category: cs.LG

TL;DR: Energy Matching is a framework that combines flow-based generative models with energy-based model flexibility, using a single scalar field to guide samples from noise to data while capturing likelihood structure.

Details

Motivation: Current generative models (flows/scores) cannot readily integrate partial observations and priors, while EBMs can but have limitations. The goal is to give flow-based approaches EBM flexibility.

Method: Parameterizes dynamics with a single time-independent scalar field that serves as both generator and prior. Samples move from noise to data along optimal transport paths, then guided by entropic energy term into Boltzmann distribution.

Result: Substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in fidelity, while retaining simulation-free training. Also demonstrated in protein generation with interaction energy for diverse mode exploration.

Conclusion: The simplified formulation without time conditioning, auxiliary generators, or additional networks significantly advances EBM capabilities and enables wider adoption in diverse domains.

Abstract: Current state-of-the-art generative models map noise to data distributions by matching flows or scores. A key limitation of these models is their inability to readily integrate available partial observations and additional priors. In contrast, energy-based models (EBMs) address this by incorporating corresponding scalar energy terms. Here, we propose Energy Matching, a framework that endows flow-based approaches with the flexibility of EBMs. Far from the data manifold, samples move from noise to data along irrotational, optimal transport paths. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize these dynamics with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. The present method substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in terms of fidelity, while retaining simulation-free training of transport-based approaches away from the data manifold. Furthermore, we leverage the flexibility of the method to introduce an interaction energy that supports the exploration of diverse modes, which we demonstrate in a controlled protein generation setting. This approach learns a scalar potential energy, without time conditioning, auxiliary generators, or additional networks, marking a significant departure from recent EBM methods. We believe this simplified yet rigorous formulation significantly advances EBMs capabilities and paves the way for their wider adoption in generative modeling in diverse domains.

[499] Beyond Benign Overfitting in Nadaraya-Watson Interpolators

Daniel Barzilai, Guy Kornowski, Ohad Shamir

Main category: cs.LG

TL;DR: The paper analyzes the Nadaraya-Watson interpolating estimator and shows it exhibits multiple overfitting behaviors (catastrophic, benign, tempered) depending on bandwidth choice, with over-estimating data dimension being less harmful than under-estimating.

Details

Motivation: To understand the generalization behavior of interpolating predictors that overfit noisy training data, particularly examining how classical methods like NW estimator exhibit complex generalization patterns.

Method: Theoretical analysis of the Nadaraya-Watson interpolating estimator by varying a bandwidth hyperparameter, complemented by numerical experiments.

Result: Proved existence of multiple overfitting behaviors ranging non-monotonically from catastrophic to benign to tempered; showed over-estimating intrinsic dimension is less harmful than under-estimating.

Conclusion: Even classical interpolating methods can exhibit intricate generalization behaviors, and careful hyperparameter tuning is crucial, with dimension over-estimation being preferable.

Abstract: In recent years, there has been much interest in understanding the generalization behavior of interpolating predictors, which overfit on noisy training data. Whereas standard analyses are concerned with whether a method is consistent or not, recent observations have shown that even inconsistent predictors can generalize well. In this work, we revisit the classic interpolating Nadaraya-Watson (NW) estimator (also known as Shepard’s method), and study its generalization capabilities through this modern viewpoint. In particular, by varying a single bandwidth-like hyperparameter, we prove the existence of multiple overfitting behaviors, ranging non-monotonically from catastrophic, through benign, to tempered. Our results highlight how even classical interpolating methods can exhibit intricate generalization behaviors. In addition, for the purpose of tuning the hyperparameter, the results suggest that over-estimating the intrinsic dimension of the data is less harmful than under-estimating it. Numerical experiments complement our theory, demonstrating the same phenomena.

[500] In-Context Learning of Linear Dynamical Systems with Transformers: Approximation Bounds and Depth-Separation

Frank Cole, Yuxuan Zhao, Yulong Lu, Tianhao Zhang

Main category: cs.LG

TL;DR: Transformers with logarithmic depth can approximate noisy linear dynamical systems as well as least-squares estimators, while single-layer linear transformers have fundamental limitations, revealing a depth-separation phenomenon.

Details

Motivation: To understand the approximation-theoretic capabilities of transformers in in-context learning of noisy linear dynamical systems, particularly examining how depth affects their performance.

Method: Theoretical analysis establishing upper bounds on approximation error for multi-layer transformers and lower bounds for single-layer linear transformers, comparing performance across different data distributions (IID vs non-IID).

Result: Multi-layer transformers with logarithmic depth achieve error bounds comparable to least-squares estimators, while single-layer linear transformers have non-diminishing approximation errors, showing depth-separation and revealing different approximation power for IID vs non-IID data.

Conclusion: Transformer depth is crucial for in-context learning of dynamical systems, with multi-layer transformers achieving optimal performance while single-layer transformers have fundamental limitations, highlighting the importance of architecture depth and data distribution characteristics.

Abstract: This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the approximation error of multi-layer transformers with respect to an $L^2$-testing loss uniformly defined across tasks. This result demonstrates that transformers with logarithmic depth can achieve error bounds comparable with those of the least-squares estimator. In contrast, our second result establishes a non-diminishing lower bound on the approximation error for a class of single-layer linear transformers, which suggests a depth-separation phenomenon for transformers in the in-context learning of dynamical systems. Moreover, this second result uncovers a critical distinction in the approximation power of single-layer linear transformers when learning from IID versus non-IID data.

[501] Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples

Suqin Yuan, Lei Feng, Bo Han, Tongliang Liu

Main category: cs.LG

TL;DR: The paper proposes Early Cutting, a method that uses the model’s later training state to re-select confident samples identified early in training, effectively filtering out harmful Mislabeled Easy Examples (MEEs) to improve sample selection in learning with noisy labels.

Details

Motivation: Existing sample selection methods overlook that not all mislabeled examples harm model performance equally. The authors found that mislabeled examples correctly predicted by the model early in training (MEEs) are particularly harmful to performance.

Method: Early Cutting introduces a recalibration step that employs the model’s later training state to re-select the confident subset identified early in training, avoiding misleading confidence from early learning and effectively filtering out MEEs.

Result: Experiments on CIFAR, WebVision, and full ImageNet-1k datasets demonstrate that the method effectively improves sample selection and model performance by reducing MEEs.

Conclusion: The proposed Early Cutting method successfully addresses the problem of Mislabeled Easy Examples by leveraging later training states for sample re-selection, leading to improved performance in learning with noisy labels.

Abstract: Sample selection is a prevalent approach in learning with noisy labels, aiming to identify confident samples for training. Although existing sample selection methods have achieved decent results by reducing the noise rate of the selected subset, they often overlook that not all mislabeled examples harm the model’s performance equally. In this paper, we demonstrate that mislabeled examples correctly predicted by the model early in the training process are particularly harmful to model performance. We refer to these examples as Mislabeled Easy Examples (MEEs). To address this, we propose Early Cutting, which introduces a recalibration step that employs the model’s later training state to re-select the confident subset identified early in training, thereby avoiding misleading confidence from early learning and effectively filtering out MEEs. Experiments on the CIFAR, WebVision, and full ImageNet-1k datasets demonstrate that our method effectively improves sample selection and model performance by reducing MEEs.

[502] How Transformers Learn In-Context Recall Tasks? Optimality, Training Dynamics and Generalization

Quan Nguyen, Thanh Nguyen-Tang

Main category: cs.LG

TL;DR: This paper analyzes transformers’ approximation capabilities, convergence rates, and out-of-distribution generalization for in-context recall tasks, showing they achieve Bayes-optimal performance with linear convergence rates.

Details

Motivation: Existing theoretical work only examines transformers after one gradient descent step, leaving gaps in understanding convergence behavior, convergence speed, and generalization capabilities of transformers trained on in-context recall tasks.

Method: Theoretical analysis of transformers with linear, ReLU or softmax attentions trained with gradient descent on in-context recall tasks, supported by extensive empirical validations.

Result: Transformers achieve Bayes-optimal performance for in-context recall tasks, converge at linear rates to Bayes risks, and exhibit out-of-distribution generalization. Without proper parameterization, models with larger expressive power fail to generalize OOD.

Conclusion: Transformers are provably effective for in-context recall tasks with strong convergence properties and generalization capabilities, but proper parameterization is crucial for achieving out-of-distribution generalization.

Abstract: We study the approximation capabilities, convergence speeds and on-convergence behaviors of transformers trained on in-context recall tasks – which requires to recognize the \emph{positional} association between a pair of tokens from in-context examples. Existing theoretical results only focus on the in-context reasoning behavior of transformers after being trained for the \emph{one} gradient descent step. It remains unclear what is the on-convergence behavior of transformers being trained by gradient descent and how fast the convergence rate is. In addition, the generalization of transformers in one-step in-context reasoning has not been formally investigated. This work addresses these gaps. We first show that a class of transformers with either linear, ReLU or softmax attentions, is provably Bayes-optimal for an in-context recall task. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss converges at linear rate to the Bayes risks. Moreover, we show that the trained transformers exhibit out-of-distribution (OOD) generalization, i.e., generalizing to samples outside of the population distribution. Our theoretical findings are further supported by extensive empirical validations, showing that \emph{without} proper parameterization, models with larger expressive power surprisingly \emph{fail} to generalize OOD after being trained by gradient descent.

[503] CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs

A. Chervov, A. Soibelman, S. Lytkin, I. Kiselev, S. Fironov, A. Lukyanenko, A. Dolgorukova, A. Ogurtsov, F. Petrov, S. Krymskii, M. Evseev, L. Grunvald, D. Gorodkov, G. Antiufeev, G. Verbii, V. Zamkovoy, L. Cheldieva, I. Koltsov, A. Sychev, M. Obozov, A. Eliseev, S. Nikolenko, N. Narynbaev, R. Turtayev, N. Rokotyan, S. Kovalev, A. Rozanov, V. Nelin, S. Ermilov, L. Shishina, D. Mamayeva, A. Korolkova, K. Khoruzhii, A. Romanov

Main category: cs.LG

TL;DR: This paper presents an AI-based approach combining reinforcement learning with diffusion distance for pathfinding on large Cayley graphs, achieving better performance than classical methods like GAP and providing mathematical insights including diameter bounds and conjectures for symmetric groups.

Details

Motivation: To develop efficient AI approaches for pathfinding on extremely large graphs (up to 10^70 nodes) with applications to Cayley graphs and mathematical problems, overcoming limitations of classical computer algebra systems.

Method: Combines reinforcement learning with diffusion distance approach from previous work, using neural network architectures, random walk generators, and beam search pathfinding. Applied to Cayley graphs of symmetric groups with cyclic shift and transposition generators.

Result: Outperformed classical GAP system, provided strong support for OEIS-A186783 conjecture (diameter = n(n-1)/2), proved diameter bounds (lower: n(n-1)/2-n/2, upper: n(n-1)/2+3n), identified longest elements, and generated conjectures about distribution patterns and graph spectrum.

Conclusion: The AI-based approach successfully handles large-scale Cayley graph pathfinding, provides mathematical insights and bounds, and establishes a framework for collaborative research through Kaggle challenges to further improve methods.

Abstract: This paper is the second in a series of studies on developing efficient artificial intelligence-based approaches to pathfinding on extremely large graphs (e.g. $10^{70}$ nodes) with a focus on Cayley graphs and mathematical applications. The open-source CayleyPy project is a central component of our research. The present paper proposes a novel combination of a reinforcement learning approach with a more direct diffusion distance approach from the first paper. Our analysis includes benchmarking various choices for the key building blocks of the approach: architectures of the neural network, generators for the random walks and beam search pathfinding. We compared these methods against the classical computer algebra system GAP, demonstrating that they “overcome the GAP” for the considered examples. As a particular mathematical application we examine the Cayley graph of the symmetric group with cyclic shift and transposition generators. We provide strong support for the OEIS-A186783 conjecture that the diameter is equal to n(n-1)/2 by machine learning and mathematical methods. We identify the conjectured longest element and generate its decomposition of the desired length. We prove a diameter lower bound of n(n-1)/2-n/2 and an upper bound of n(n-1)/2+ 3n by presenting the algorithm with given complexity. We also present several conjectures motivated by numerical experiments, including observations on the central limit phenomenon (with growth approximated by a Gumbel distribution), the uniform distribution for the spectrum of the graph, and a numerical study of sorting networks. To stimulate crowdsourcing activity, we create challenges on the Kaggle platform and invite contributions to improve and benchmark approaches on Cayley graph pathfinding and other tasks.

[504] MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization

Zeyuan Ma, Yue-Jiao Gong, Hongshu Guo, Wenjie Qiu, Sijie Ma, Hongqiao Lian, Jiajun Zhan, Kaixu Chen, Chen Wang, Zhiyang Huang, Zechuan Huang, Guojun Peng, Ran Cheng, Yining Ma

Main category: cs.LG

TL;DR: MetaBox-v2 is a major upgrade to the MetaBBO framework that supports RL, evolutionary, and gradient-based approaches, offers efficient parallelization, comprehensive benchmarks, and extensible interfaces for optimization algorithm design automation.

Details

Motivation: The original MetaBox framework had limited scope and couldn't keep up with rapid advancements in meta-black-box optimization, necessitating a more comprehensive and efficient framework.

Method: Developed a unified architecture supporting multiple approaches (RL, evolutionary, gradient-based), implemented efficient parallelization schemes, created comprehensive benchmarks with 18 tasks across various optimization scenarios, and provided extensible interfaces.

Result: Successfully reproduced 23 up-to-date baselines, achieved 10-40x reduction in training/testing time, and demonstrated utility through systematic case studies evaluating optimization performance, generalization, and learning efficiency.

Conclusion: MetaBox-v2 provides valuable insights for practitioners and newcomers, serving as a milestone upgrade that addresses the limitations of previous frameworks and supports the growing needs of meta-black-box optimization research.

Abstract: Meta-Black-Box Optimization (MetaBBO) streamlines the automation of optimization algorithm design through meta-learning. It typically employs a bi-level structure: the meta-level policy undergoes meta-training to reduce the manual effort required in developing algorithms for low-level optimization tasks. The original MetaBox (2023) provided the first open-source framework for reinforcement learning-based single-objective MetaBBO. However, its relatively narrow scope no longer keep pace with the swift advancement in this field. In this paper, we introduce MetaBox-v2 (https://github.com/MetaEvo/MetaBox) as a milestone upgrade with four novel features: 1) a unified architecture supporting RL, evolutionary, and gradient-based approaches, by which we reproduce $23$ up-to-date baselines; 2) efficient parallelization schemes, which reduce the training/testing time by $10-40$x; 3) a comprehensive benchmark suite of $18$ synthetic/realistic tasks ($1900$+ instances) spanning single-objective, multi-objective, multi-model, and multi-task optimization scenarios; 4) plentiful and extensible interfaces for custom analysis/visualization and integrating to external optimization tools/benchmarks. To show the utility of MetaBox-v2, we carry out a systematic case study that evaluates the built-in baselines in terms of the optimization performance, generalization ability and learning efficiency. Valuable insights are concluded from thorough and detailed analysis for practitioners and those new to the field.

[505] In-Context Learning of Stochastic Differential Equations with Foundation Inference Models

Patrick Seifner, Kostadin Cvejoski, David Berghaus, Cesar Ojeda, Ramses J. Sanchez

Main category: cs.LG

TL;DR: FIM-SDE is a pretrained model that provides accurate in-context estimation of drift and diffusion functions for SDEs from noisy time series data, requiring no prior knowledge and allowing rapid finetuning.

Details

Motivation: Current SDE function estimation methods either require extensive prior knowledge of dynamics or involve complex training procedures, limiting their practical application across sciences.

Method: Leveraging amortized inference and neural operators, FIM-SDE is pretrained in supervised fashion to map noisy SDE paths to drift/diffusion functions, enabling in-context estimation and rapid finetuning.

Result: FIM-SDE achieves robust in-context estimation across synthetic and real-world processes (double-well dynamics, Lorenz attractors, stock prices, oil/wind data), matching specialized baselines trained on target data.

Conclusion: FIM-SDE provides effective zero-shot SDE function estimation and consistently outperforms all baselines when finetuned, offering a practical solution for dynamical system analysis.

Abstract: Stochastic differential equations (SDEs) describe dynamical systems where deterministic flows, governed by a drift function, are superimposed with random fluctuations, dictated by a diffusion function. The accurate estimation (or discovery) of these functions from data is a central problem in machine learning, with wide application across the natural and social sciences. Yet current solutions either rely heavily on prior knowledge of the dynamics or involve intricate training procedures. We introduce FIM-SDE (Foundation Inference Model for SDEs), a pretrained recognition model that delivers accurate in-context (or zero-shot) estimation of the drift and diffusion functions of low-dimensional SDEs, from noisy time series data, and allows rapid finetuning to target datasets. Leveraging concepts from amortized inference and neural operators, we (pre)train FIM-SDE in a supervised fashion to map a large set of noisy, discretely observed SDE paths onto the space of drift and diffusion functions. We demonstrate that FIM-SDE achieves robust in-context function estimation across a wide range of synthetic and real-world processes – from canonical SDE systems (e.g., double-well dynamics or weakly perturbed Lorenz attractors) to stock price recordings and oil-price and wind-speed fluctuations – while matching the performance of symbolic, Gaussian process and Neural SDE baselines trained on the target datasets. When finetuned to the target processes, we show that FIM-SDE consistently outperforms all these baselines.

[506] GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection

Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, Jiaqi W. Ma

Main category: cs.LG

TL;DR: GraSS is a gradient compression algorithm that leverages the sparsity of per-sample gradients to achieve sub-linear space and time complexity for data attribution methods like influence functions, enabling faster throughput on large models.

Details

Motivation: Gradient-based data attribution methods are computationally expensive due to per-sample gradient computation, limiting their scalability for large models.

Method: Proposed GraSS algorithm and its variant FactGraSS for linear layers that exploit inherent sparsity in per-sample gradients through compression techniques.

Result: Achieved substantial speedups while preserving data influence fidelity, with FactGraSS achieving up to 165% faster throughput on billion-scale models compared to state-of-the-art baselines.

Conclusion: GraSS provides an efficient solution for scalable data attribution by reducing computational and memory costs through gradient compression.

Abstract: Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.

[507] Reinforcement Learning with Verifiable Rewards: GRPO’s Effective Loss, Dynamics, and Success Amplification

Youssef Mroueh

Main category: cs.LG

TL;DR: GRPO is analyzed as inducing a weighted contrastive loss with synthetic data from previous policy. Different variants with reward normalization and KL regularization forms are studied, showing explicit optimal policy forms and convergence to success probability fixed points.

Details

Motivation: To understand how GRPO promotes reasoning in LLMs under verifiable binary rewards and analyze different regularization approaches for policy optimization.

Method: Analyze GRPO variants differing in reward normalization (mean-only vs mean+variance) and KL regularization forms (divergence from previous model, fixed reference model, or both). Derive explicit optimal policy forms and study convergence properties.

Result: The optimal policy has explicit form in terms of binary reward statistics and previous/reference policies. The sequence converges to a fixed point for probability of success that exceeds the reference policy’s success probability.

Conclusion: GRPO amplifies the policy’s probability of success through the analyzed regularization mechanisms, with the fixed point exceeding the reference policy’s performance.

Abstract: Group Relative Policy Optimization (GRPO) was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards. We show that the mean + variance calibration of these rewards induces a weighted contrastive loss in which the contrastive samples are synthetic data drawn from the previous policy. While GRPO was originally paired with clipping to keep updates near the old policy, we analyze variants that differ in reward normalization (mean-only vs mean + variance) and in how they regularize updates using KL divergence: either penalizing divergence from the previous model (mirror), penalizing divergence from a fixed reference model $\pi_{\mathrm{ref}}$, or combining both forms of regularization. For each, the optimal policy $\pi_n$ admits an explicit form in terms of the binary reward and the first and second order statistics of the reward under $\pi_{n-1}$, as well as the policies $\pi_{n-1}$ and $\pi_{\mathrm{ref}}$. Iterating results in a sequence ${\pi_n}$ whose probability of success (PoS) obeys a simple recurrence that converges to a fixed point determined by the reference PoS and the regularization strength. We further show that this fixed point exceeds the reference, demonstrating that GRPO amplifies the policy’s probability of success.

[508] REOrdering Patches Improves Vision Models

Declan Kutscher, David M. Chan, Yutong Bai, Trevor Darrell, Ritwik Gupta

Main category: cs.LG

TL;DR: REOrder is a framework that discovers optimal patch orderings for vision transformers, improving accuracy by learning task-specific sequences rather than using fixed row-major ordering.

Details

Motivation: Modern transformers with long-sequence approximations are sensitive to patch ordering, and simple alternatives like column-major or Hilbert curves show significant performance variations, indicating the importance of finding optimal patch sequences.

Method: Two-stage framework: 1) Information-theoretic prior based on patch sequence compressibility, 2) Learning permutation policy using Plackett-Luce policy optimized with REINFORCE for efficient combinatorial learning.

Result: Improves top-1 accuracy by up to 3.01% on ImageNet-1K and 13.35% on Functional Map of the World compared to standard row-major ordering.

Conclusion: Patch ordering significantly impacts transformer performance, and REOrder’s learned task-optimal orderings provide substantial accuracy improvements over conventional fixed orderings.

Abstract: Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

[509] Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization

Guanchen Li, Yixing Xu, Zeping Li, Ji Liu, Xuanwu Yin, Dong Li, Emad Barsoum

Main category: cs.LG

TL;DR: T’yr-the-Pruner is an end-to-end search-based global structural pruning framework that achieves 97% performance retention while removing 50% of Llama-3.1-70B’s parameters.

Details

Motivation: Address limitations of existing pruning methods: local pruning ignores global topology, while global pruning uses two-stage approaches that ignore inter-structure dependencies and lack end-to-end optimization.

Method: Constructs a supernet by applying local pruning across sparsity ratios per layer, uses expectation error accumulation for supernet construction, and employs iterative prune-and-search with coarse-to-fine sparsity granularity.

Result: Achieves state-of-the-art structural pruning, retaining 97% of dense model’s performance while removing 50% of Llama-3.1-70B’s parameters.

Conclusion: Proposed framework effectively addresses global structural pruning challenges and demonstrates superior performance compared to existing methods.

Abstract: Structural pruning enhances hardware-agnostic inference efficiency for large language models (LLMs) yet often fails to maintain comparable performance. Local pruning performs efficient layer-by-layer compression but ignores global topology. Although global pruning aims to identify an optimal sparse model, intuitive methods typically adopt a two-stage paradigm that first evaluates substructure saliency and then applies global pruning, which ignores inter-structure dependencies and fails to achieve end-to-end optimization. To address these limitations, we propose T'yr-the-Pruner, an efficient end-to-end search-based global structural pruning framework. This framework constructs a supernet by repeatedly applying local pruning across a range of sparsity ratios to each layer in an LLM, with the core goal of determining the optimal sparsity distribution under a target overall sparsity ratio. Concretely, we introduce an effective local pruning and an expectation error accumulation approach to improve supernet construction. Furthermore, we employ an iterative prune-and-search strategy with coarse-to-fine sparsity granularity to ensure efficient search convergence. Experimental results show that T'yr-the-Pruner achieves state-of-the-art structural pruning, retaining 97% of the dense model’s performance while removing a challenging 50% of Llama-3.1-70B’s parameters. Code will be available at https://github.com/AMD-AGI/Tyr-the-Pruner.

[510] Noise-Robustness Through Noise: A Framework combining Asymmetric LoRA with Poisoning MoE

Zhaokun Wang, Jinyu Guo, Jingwen Pu, Lingfeng Chen, Hongli Pu, Jie Ou, Libo Qin, Wenhong Tian

Main category: cs.LG

TL;DR: LoPE is a noise-robust adaptation method using asymmetric LoRA poisoning experts that enhances model robustness through generated noisy data, eliminating the need for data cleaning.

Details

Motivation: Current parameter-efficient fine-tuning methods are vulnerable to noisy data interference, and existing noise-handling approaches require laborious data pre-processing or error-prone model modifications.

Method: LoPE integrates a dedicated poisoning expert in asymmetric LoRA configuration, using a two-stage paradigm: noise injection during fine-tuning to enhance noise discrimination, and selective masking of the poisoning expert during inference to leverage purified knowledge.

Result: Extensive experiments show LoPE achieves strong performance and robustness purely through low-cost noise injection.

Conclusion: LoPE provides an effective noise-robust adaptation framework that completely eliminates data cleaning requirements while maintaining strong performance.

Abstract: Current parameter-efficient fine-tuning methods for adapting pre-trained language models to downstream tasks are susceptible to interference from noisy data. Conventional noise-handling approaches either rely on laborious data pre-processing or employ model architecture modifications prone to error accumulation. In contrast to existing noise-process paradigms, we propose a noise-robust adaptation method via asymmetric LoRA poisoning experts (LoPE), a novel framework that enhances model robustness to noise only with generated noisy data. Drawing inspiration from the mixture-of-experts architecture, LoPE strategically integrates a dedicated poisoning expert in an asymmetric LoRA configuration. Through a two-stage paradigm, LoPE performs noise injection on the poisoning expert during fine-tuning to enhance its noise discrimination and processing ability. During inference, we selectively mask the dedicated poisoning expert to leverage purified knowledge acquired by normal experts for noise-robust output. Extensive experiments demonstrate that LoPE achieves strong performance and robustness purely through the low-cost noise injection, which completely eliminates the requirement of data cleaning.

[511] PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning

Ori Peleg, Natalie Lang, Dan Ben Ami, Stefano Rini, Nir Shlezinger, Kobi Cohen

Main category: cs.LG

TL;DR: PAUSE: A multi-armed bandit-based method that jointly addresses privacy leakage accumulation and communication latency in federated learning through active user selection while maintaining model performance.

Details

Motivation: Federated learning faces two key challenges: accumulation of privacy leakage over time and communication latency. Current approaches address these separately via perturbed updates (for privacy) and user selection (for latency), both at the expense of accuracy.

Method: Proposed PAUSE algorithm using multi-armed bandit framework with a reward function balancing privacy, latency, and model performance. Also introduced a simulated annealing-based relaxation for reduced complexity.

Result: Theoretical analysis shows PAUSE achieves reward growth rate matching best-known MAB rates. Numerical validation demonstrates improved privacy leakage control, reduced latency, and accuracy gains across various federated training scenarios.

Conclusion: PAUSE effectively addresses the joint optimization of privacy, latency, and accuracy in federated learning through active user selection with bounded privacy leakage and theoretical guarantees.

Abstract: Federated learning (FL) enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: First, the accumulation of privacy leakage over time, and second, communication latency. These two limitations are typically addressed separately: The former via perturbed updates to enhance privacy and the latter using user selection to mitigate latency - both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a multi-armed bandit (MAB)-based algorithm, termed Privacy-aware Active User SElection (PAUSE) which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.

[512] Efficient Verified Machine Unlearning For Distillation

Yijun Quan, Zushu Li, Giovanni Montana

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 503 error from arXiv API

Details

Motivation: Cannot determine motivation - paper content unavailable

Method: Cannot determine method - paper content unavailable

Result: Cannot determine results - paper content unavailable

Conclusion: Cannot determine conclusion - paper content unavailable

Abstract: Failed to fetch summary for 2503.22539: Page request resulted in HTTP 503 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22539&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[513] Enabling Automatic Differentiation with Mollified Graph Neural Operators

Ryan Y. Lin, Julius Berner, Valentin Duruisseaux, David Pitt, Daniel Leibovici, Jean Kossaifi, Kamyar Azizzadenesheli, Anima Anandkumar

Main category: cs.LG

TL;DR: mGNO uses automatic differentiation to compute exact gradients for physics-informed neural operators, enabling efficient training on irregular grids and improved generalization.

Details

Motivation: Existing physics losses in neural operators rely on derivatives computed with approximation methods like spectral or finite differences, which introduce errors due to finite resolution.

Method: Proposes mollified graph neural operator (mGNO) that leverages automatic differentiation to compute exact gradients on arbitrary geometries, allowing physics loss evaluation at randomly sampled points.

Result: On regular grids, mGNO reduced L2 relative data error by 20x vs finite differences. On unstructured point clouds, achieved errors 2 orders of magnitude lower than ML baselines with comparable runtimes, and 1-3 orders of magnitude speedup vs numerical solvers.

Conclusion: mGNO enables efficient PDE solving on complex geometries using only physics losses at low resolutions, and can be applied to inverse design and shape optimization problems.

Abstract: Physics-informed neural operators offer a powerful framework for learning solution operators of partial differential equations (PDEs) by combining data and physics losses. However, these physics losses rely on derivatives. Computing these derivatives remains challenging, with spectral and finite difference methods introducing approximation errors due to finite resolution. Here, we propose the mollified graph neural operator ($m$GNO), the first method to leverage automatic differentiation and compute exact gradients on arbitrary geometries. This enhancement enables efficient training on irregular grids and varying geometries while allowing seamless evaluation of physics losses at randomly sampled points for improved generalization. For a PDE example on regular grids, $m$GNO paired with autograd reduced the L2 relative data error by 20x compared to finite differences, although training was slower. It can also solve PDEs on unstructured point clouds seamlessly, using physics losses only, at resolutions vastly lower than those needed for finite differences to be accurate enough. On these unstructured point clouds, $m$GNO leads to errors that are consistently 2 orders of magnitude lower than machine learning baselines (Meta-PDE, which accelerates PINNs) for comparable runtimes, and also delivers speedups from 1 to 3 orders of magnitude compared to the numerical solver for similar accuracy. $m$GNOs can also be used to solve inverse design and shape optimization problems on complex geometries.

[514] Denoising the Future: Top-p Distributions for Moving Through Time

Florian Andreas Marwitz, Ralf Möller, Magnus Bender, Marcel Gehrke

Main category: cs.LG

TL;DR: The paper proposes using only the top-p most probable states in Hidden Markov Models to speed up inference and reduce noise, with bounded error and significant speed improvements.

Details

Motivation: Dynamic probabilistic model inference is computationally expensive as it requires enumerating all states, including those with negligible probabilities, leading to inefficiency and noise propagation.

Method: Use only the top-p states (most probable states with accumulated probability p) for inference, which denoises the future and speeds up computation.

Result: Empirical evaluation shows speedups of at least an order of magnitude while maintaining total variation distance error below 0.09.

Conclusion: The top-p states approach effectively accelerates inference in Hidden Markov Models with bounded error, making it a practical solution for computational efficiency.

Abstract: Inference in dynamic probabilistic models is a complex task involving expensive operations. In particular, for Hidden Markov Models, the whole state space has to be enumerated for advancing in time. Even states with negligible probabilities are considered, resulting in computational inefficiency and increased noise due to the propagation of unlikely probability mass. We propose to denoise the future and speed up inference by using only the top-p states, i.e., the most probable states with accumulated probability p. We show that the error introduced by using only the top-p states is bound by p and the so-called minimal mixing rate of the underlying model. Moreover, in our empirical evaluation, we show that we can expect speedups of at least an order of magnitude, while the error in terms of total variation distance is below 0.09.

[515] Trial and Trust: Addressing Byzantine Attacks with Comprehensive Defense Strategy

Gleb Molodtsov, Daniil Medyakov, Sergey Skorik, Nikolas Khachaturov, Shahane Tigranyan, Vladimir Aletov, Aram Avetisyan, Martin Takáč, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: The paper proposes Byzantine-robust federated learning methods using trust scores and trial functions to filter malicious updates, working even when Byzantine nodes are in majority and adapting to various optimization methods and practical scenarios.

Details

Motivation: Federated learning systems are vulnerable to Byzantine attacks where compromised clients inject adversarial updates to disrupt global convergence, requiring robust defense mechanisms.

Method: Combines trust scores concept with trial function methodology to dynamically filter outliers, adapting to scaled methods like Adam/RMSProp and practical scenarios including local training and partial participation.

Result: Extensive experiments on synthetic and real ECG data validate robustness, with convergence guarantees comparable to classical algorithms without Byzantine interference.

Conclusion: The proposed methods provide effective Byzantine resilience in federated learning, maintaining performance even under majority Byzantine attacks while supporting practical deployment scenarios.

Abstract: Recent advancements in machine learning have improved performance while also increasing computational demands. While federated and distributed setups address these issues, their structure is vulnerable to malicious influences. In this paper, we address a specific threat, Byzantine attacks, where compromised clients inject adversarial updates to derail global convergence. We combine the trust scores concept with trial function methodology to dynamically filter outliers. Our methods address the critical limitations of previous approaches, allowing functionality even when Byzantine nodes are in the majority. Moreover, our algorithms adapt to widely used scaled methods like Adam and RMSProp, as well as practical scenarios, including local training and partial participation. We validate the robustness of our methods by conducting extensive experiments on both synthetic and real ECG data collected from medical institutions. Furthermore, we provide a broad theoretical analysis of our algorithms and their extensions to aforementioned practical setups. The convergence guarantees of our methods are comparable to those of classical algorithms developed without Byzantine interference.

[516] Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao

Main category: cs.LG

TL;DR: This paper provides a theoretical analysis of in-context learning for regression of Hölder functions on manifolds, establishing connections between transformers’ attention mechanism and kernel methods, and deriving generalization bounds that scale with intrinsic manifold dimension.

Details

Motivation: While in-context learning has achieved empirical success, its theoretical understanding - especially for structured geometric data - remains unexplored. The paper aims to bridge this gap by studying ICL for regression on manifolds.

Method: The authors establish a connection between attention mechanisms and kernel methods, showing transformers perform kernel-based prediction. They validate this through numerical experiments and derive generalization error bounds based on prompt length and number of training tasks.

Result: Transformers achieve minimax regression rates for Hölder functions on manifolds, with error scaling exponentially with intrinsic manifold dimension rather than ambient space dimension. The learned query-prompt scores correlate strongly with Gaussian kernels.

Conclusion: The findings provide foundational insights into geometry’s role in ICL and offer new tools for studying ICL of nonlinear models, characterizing transformers as in-context kernel algorithm learners.

Abstract: While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of H"older functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for H"older functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of H"older functions on manifolds, which scales exponentially with the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

[517] Spike-timing-dependent Hebbian learning as noisy gradient descent

Niklas Dexheimer, Sascha Gaudlitz, Johannes Schmidt-Hieber

Main category: cs.LG

TL;DR: Hebbian STDP learning is shown to be equivalent to noisy gradient descent on a non-convex loss function, yet it converges exponentially fast to identify the most active presynaptic neuron despite noise and non-convexity.

Details

Motivation: To understand how Hebbian learning principles in biological neural networks relate to optimization theory and why they work effectively despite noise and non-convex optimization landscapes.

Method: Relates a Hebbian spike-timing-dependent plasticity (STDP) rule to noisy gradient descent on a non-convex loss function defined on the probability simplex.

Result: Proves that the Hebbian learning dynamic identifies the presynaptic neuron with highest activity and converges exponentially fast in iterations, which is surprising given noise and non-convexity.

Conclusion: Hebbian learning can achieve fast, stable convergence to optimal solutions despite inherent noise and non-convex optimization problems, providing theoretical justification for its effectiveness in biological systems.

Abstract: Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.

[518] FlashBias: Fast Computation of Attention with Bias

Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long

Main category: cs.LG

TL;DR: FlashBias is a method that optimizes attention computation with bias by leveraging low-rank compressed sensing theory, achieving significant speedups without accuracy loss.

Details

Motivation: Attention with bias creates efficiency bottlenecks that disrupt optimized memory-compute pipelines like FlashAttention, making biased attention computationally expensive despite its widespread use in vision, language, and scientific models.

Method: Based on theoretical analysis showing optimal efficiency depends on attention weight matrix rank, FlashBias uses low-rank compressed sensing theory to provide fast-exact computation for common attention biases and fast-accurate approximation for general bias formalizations.

Result: FlashBias achieves 1.5× speedup for Pairformer in AlphaFold 3 and over 2× speedup for attention with bias in vision and language models while maintaining full accuracy.

Conclusion: FlashBias effectively addresses the efficiency bottleneck in attention with bias computation, enabling practical deployment in complex tasks by fully leveraging GPU matrix multiplication optimizations.

Abstract: Attention with bias, which extends standard attention by introducing prior knowledge as an additive bias matrix to the query-key scores, has been widely deployed in vision, language, protein-folding and other advanced scientific models, underscoring its status as a key evolution of this foundational module. However, introducing bias terms creates a severe efficiency bottleneck in attention computation. It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention, thereby stripping away most of their performance gains and leaving biased attention computationally expensive. Surprisingly, despite its common usage, targeted efficiency optimization for attention with bias remains absent, which seriously hinders its application in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalizations. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5$\times$ speedup for Pairformer in AlphaFold 3, and over 2$\times$ speedup for attention with bias in vision and language models without loss of accuracy. Code is available at this repository: https://github.com/thuml/FlashBias.

[519] Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks

Niket Patel, Randall Balestriero

Main category: cs.LG

TL;DR: The paper proposes a probabilistic framework with Task Priors to evaluate AI models over all possible downstream tasks, addressing limitations of fixed benchmark evaluations.

Details

Motivation: Current AI evaluation relies on fixed downstream benchmarks, creating a bottleneck. The goal is to develop systems that can solve any task, but rigid evaluation protocols don't capture this comprehensive capability.

Method: Define a probabilistic space of downstream tasks using Task Priors - a distribution over tasks. This allows evaluation of model performance across all possible tasks weighted by their probability.

Result: The framework enables answering key questions about average performance and variance across all possible downstream tasks under defined Task Priors.

Conclusion: Task Priors establish a new evaluation standard that can accelerate SSL research by providing comprehensive performance assessment beyond fixed benchmarks.

Abstract: The grand goal of AI research, and particularly Self Supervised Learning (SSL), is to produce systems that can successfully solve any possible task. In contrast, current evaluation methods available to AI researchers typically rely on a fixed collection of hand-picked downstream benchmarks. Hence, a large amount of effort is put into designing and searching for large collection of evaluation tasks that can serve as a proxy of our grand goal. We argue that such a rigid evaluation protocol creates a silent bottleneck in AI research. To remedy that, we define a probabilistic space of downstream tasks obtained by adopting a distribution of tasks and by defining Task Priors. Under this view, one can evaluate a model’s performance over the set of all possible downstream tasks. Our framework is the first to provide answers to key questions such as (i) what is the average performance of my model over all possible downstream tasks weighted by the probability to encounter each task? or (ii) what is the variance of my model’s performance across all downstream tasks under the defined Task Priors? Beyond establishing a new standard for evaluation, we believe that Task Priors will accelerate the pace of research in SSL - where downstream task evaluation is the sole qualitative signal that researchers have access to.

[520] Neural Graduated Assignment for Maximum Common Edge Subgraphs

Chaolong Ying, Yingqi Ruan, Xuemin Chen, Yaomin Wang, Tianshu Yu

Main category: cs.LG

TL;DR: NGA is a scalable, unsupervised method for Maximum Common Edge Subgraph problems that combines differentiable assignment optimization with neural components, achieving faster computation and better performance than traditional approaches.

Details

Motivation: Traditional MCES methods like max-clique transformations and search-based algorithms face scalability issues with larger instances, limiting their practical applications in domains like biology and chemistry.

Method: Neural Graduated Assignment (NGA) stacks differentiable assignment optimization with neural components, using a learnable temperature mechanism for high-dimensional parameterization of the matching process.

Result: NGA significantly improves computation time and scalability on large instances, enhances performance across MCES computation, graph similarity estimation, and graph retrieval tasks compared to existing methods.

Conclusion: NGA represents a significant advancement in MCES computation and provides insights applicable to other assignment problems, with theoretical analysis showing fast convergence and better exploration-exploitation tradeoffs.

Abstract: The Maximum Common Edge Subgraph (MCES) problem is a crucial challenge with significant implications in domains such as biology and chemistry. Traditional approaches, which include transformations into max-clique and search-based algorithms, suffer from scalability issues when dealing with larger instances. This paper introduces ``Neural Graduated Assignment’’ (NGA), a simple, scalable, unsupervised-training-based method that addresses these limitations. Central to NGA is stacking of differentiable assignment optimization with neural components, enabling high-dimensional parameterization of the matching process through a learnable temperature mechanism. We further theoretically analyze the learning dynamics of NGA, showing its design leads to fast convergence, better exploration-exploitation tradeoff, and ability to escape local optima. Extensive experiments across MCES computation, graph similarity estimation, and graph retrieval tasks reveal that NGA not only significantly improves computation time and scalability on large instances but also enhances performance compared to existing methodologies. The introduction of NGA marks a significant advancement in the computation of MCES and offers insights into other assignment problems.

[521] Fair Supervised Learning Through Constraints on Smooth Nonconvex Unfairness-Measure Surrogates

Zahra Khatti, Daniel P. Robinson, Frank E. Curtis

Main category: cs.LG

TL;DR: A new fair ML strategy using smooth nonconvex surrogates for unfairness measures and hard constraints instead of regularization, enabling tractable optimization with multiple fairness constraints.

Details

Motivation: Existing fair ML methods often use convex surrogates that fail to ensure fairness in practice, and regularization approaches lead to difficult optimization problems with expensive parameter tuning.

Method: Proposes smooth nonconvex surrogates to approximate Heaviside functions in unfairness measures, and uses hard constraints instead of regularization to enforce fairness tolerances.

Result: The method provides tight fairness approximations, enables multiple conflicting unfairness constraints simultaneously, and leads to tractable optimization problems with minimal tuning.

Conclusion: The proposed strategy offers practical advantages over existing approaches by ensuring fairness through tight approximations and hard constraints while maintaining computational tractability.

Abstract: A new strategy for fair supervised machine learning is proposed. The main advantages of the proposed strategy as compared to others in the literature are as follows. (a) We introduce a new smooth nonconvex surrogate to approximate the Heaviside functions involved in discontinuous unfairness measures. The surrogate is based on smoothing methods from the optimization literature, and is new for the fair supervised learning literature. The surrogate is a tight approximation which ensures the trained prediction models are fair, as opposed to other (e.g., convex) surrogates that can fail to lead to a fair prediction model in practice. (b) Rather than rely on regularizers (that lead to optimization problems that are difficult to solve) and corresponding regularization parameters (that can be expensive to tune), we propose a strategy that employs hard constraints so that specific tolerances for unfairness can be enforced without the complications associated with the use of regularization. (c) Our proposed strategy readily allows for constraints on multiple (potentially conflicting) unfairness measures at the same time. Multiple measures can be considered with a regularization approach, but at the cost of having even more difficult optimization problems to solve and further expense for tuning. By contrast, through hard constraints, our strategy leads to optimization models that can be solved tractably with minimal tuning.

[522] Understanding Differential Transformer Unchains Pretrained Self-Attentions

Chaerin Kong, Jiho Jang, Nojun Kwak

Main category: cs.LG

TL;DR: The abstract for paper 2505.16333 could not be retrieved due to a server error (HTTP 503), preventing analysis of the paper’s content.

Details

Motivation: Unable to determine the motivation as the abstract is not accessible.

Method: Unable to determine the method as the abstract is not accessible.

Result: Unable to determine the results as the abstract is not accessible.

Conclusion: Unable to determine the conclusion as the abstract is not accessible.

Abstract: Failed to fetch summary for 2505.16333: Page request resulted in HTTP 503 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[523] Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

Main category: cs.LG

TL;DR: Shuffle-R1 is a framework that addresses training inefficiencies in RL fine-tuning of MLLMs by solving Advantage Collapsing and Rollout Silencing through pairwise trajectory sampling and advantage-based batch shuffling.

Details

Motivation: Current RL pipelines for MLLMs suffer from training inefficiencies due to Advantage Collapsing (advantages concentrate near zero) and Rollout Silencing (few rollouts contribute non-zero gradients), leading to suboptimal gradient updates.

Method: Proposes Shuffle-R1 with two components: (1) Pairwise Trajectory Sampling that selects high-contrast trajectories with large advantages, and (2) Advantage-based Trajectory Shuffle that increases exposure of valuable rollouts through informed batch reshuffling.

Result: Experiments across multiple reasoning benchmarks show consistent outperformance over strong RL baselines with minimal overhead.

Conclusion: The framework highlights the importance of data-centric adaptations for more efficient RL training in MLLMs.

Abstract: Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.

[524] The Spacetime of Diffusion Models: An Information Geometry Perspective

Rafał Karczewski, Markus Heinonen, Alison Pouplin, Søren Hauberg, Vikas Garg

Main category: cs.LG

TL;DR: The paper presents a novel geometric perspective on diffusion model latent spaces, showing standard approaches are flawed and introducing a latent spacetime framework that enables principled geodesic computation and applications like Diffusion Edit Distance and molecular transition path sampling.

Details

Motivation: Standard geometric approaches to diffusion model latent spaces are fundamentally flawed - the deterministic decoder forces geodesics to be straight lines ignoring data geometry, while the stochastic approach collapses due to memorylessness. A new geometric framework is needed to properly capture intrinsic data geometry.

Method: Introduce a latent spacetime z=(x_t,t) that indexes denoising distributions across all noise scales. Prove these distributions form an exponential family and derive simulation-free estimators for curve lengths. This enables efficient geodesic computation in the resulting geometric structure.

Result: The method yields a principled Diffusion Edit Distance where geodesics trace minimal sequences of noise and denoise edits between data. Also demonstrates benefits for transition path sampling in molecular systems, including constrained variants like low-variance transitions and region avoidance.

Conclusion: The latent spacetime framework provides a mathematically sound geometric structure for diffusion models that properly captures intrinsic data geometry, enabling applications in edit distance computation and molecular dynamics simulation.

Abstract: We present a novel geometric perspective on the latent space of diffusion models. We first show that the standard pullback approach, utilizing the deterministic probability flow ODE decoder, is fundamentally flawed. It provably forces geodesics to decode as straight segments in data space, effectively ignoring any intrinsic data geometry beyond the ambient Euclidean space. Complementing this view, diffusion also admits a stochastic decoder via the reverse SDE, which enables an information geometric treatment with the Fisher-Rao metric. However, a choice of $x_T$ as the latent representation collapses this metric due to memorylessness. We address this by introducing a latent spacetime $z=(x_t,t)$ that indexes the family of denoising distributions $p(x_0 | x_t)$ across all noise scales, yielding a nontrivial geometric structure. We prove these distributions form an exponential family and derive simulation-free estimators for curve lengths, enabling efficient geodesic computation. The resulting structure induces a principled Diffusion Edit Distance, where geodesics trace minimal sequences of noise and denoise edits between data. We also demonstrate benefits for transition path sampling in molecular systems, including constrained variants such as low-variance transitions and region avoidance. Code is available at: https://github.com/rafalkarczewski/spacetime-geometry

[525] Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs

Antoine Moulin, Gergely Neu, Luca Viano

Main category: cs.LG

TL;DR: SPOIL is a new offline imitation learning algorithm for MDPs that matches expert performance with O(ε⁻²) samples in linear Q^π-realizable MDPs, and O(ε⁻⁴) in non-linear cases, outperforming behavior cloning.

Details

Motivation: To address offline imitation learning with different structural assumptions about the environment, moving beyond the assumption that the expert belongs to a known policy class.

Method: Introduces saddle-point offline imitation learning (SPOIL) algorithm that leverages Q^π-realizability in MDPs, with extensions to neural net implementation using a new critic loss function.

Result: SPOIL guarantees matching expert performance up to additive error ε with O(ε⁻²) samples for linear Q^π-realizable MDPs and O(ε⁻⁴) for non-linear cases, showing superior performance to behavior cloning and competitiveness with state-of-the-art methods.

Conclusion: SPOIL provides an effective approach for offline imitation learning with provable guarantees and practical neural implementations that outperform traditional methods.

Abstract: We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear $Q^\pi$-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (\SPOIL), which is guaranteed to match the performance of any expert up to an additive error $\varepsilon$ with access to $\mathcal{O}(\varepsilon^{-2})$ samples. Moreover, we extend this result to possibly non-linear $Q^\pi$-realizable MDPs at the cost of a worse sample complexity of order $\mathcal{O}(\varepsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of \SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.

[526] Sign-SGD is the Golden Gate between Multi-Node to Single-Node Learning: Significant Boost via Parameter-Free Optimization

Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: The paper proposes deterministic Sign-SGD variants that automatically determine effective stepsizes, addressing a key limitation of standard Sign-SGD where stepsize selection depends on inaccessible dataset parameters.

Details

Motivation: Training large language models is extremely resource-intensive, and Sign-SGD offers memory-efficient training and gradient compression for distributed learning. However, standard Sign-SGD cannot automatically determine effective stepsizes as this depends on inaccessible dataset parameters.

Method: The authors design several variants of single-node deterministic Sign-SGD and extend them to practical scenarios including stochastic single-node and multi-node learning, as well as methods with incorporated momentum.

Result: Extensive experiments on real machine learning problems demonstrate the practical applicability of the proposed approaches.

Conclusion: The proposed deterministic Sign-SGD variants successfully address the stepsize determination problem and show practical effectiveness across various learning scenarios.

Abstract: Quite recently, large language models have made a significant breakthrough across various disciplines. However, training them is an extremely resource-intensive task, even for major players with vast computing resources. One of the methods gaining popularity in light of these challenges is Sign-SGD. This method can be applied both as a memory-efficient approach in single-node training and as a gradient compression technique in the distributed learning. Nevertheless, it is impossible to automatically determine the effective stepsize from the theoretical standpoint. Indeed, it depends on the parameters of the dataset to which we do not have access in the real-world learning paradigm. To address this issue, we design several variants of single-node deterministic Sign-SGD. We extend our approaches to practical scenarios: stochastic single-node and multi-node learning, methods with incorporated momentum. We conduct extensive experiments on real machine learning problems that emphasize the practical applicability of our ideas.

[527] FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

Fangxin Liu, Zongwu Wang, JinHong Xia, Junping Zhao, Shouren Zhao, Jinjin Li, Jian Liu, Li Jiang, Haibing Guan

Main category: cs.LG

TL;DR: FlexQuant is a dynamic precision-switching framework for LLMs that optimizes memory efficiency by adapting bit-widths during token generation, achieving 1.3x speedup with minimal accuracy loss.

Details

Motivation: Address the memory bottleneck in large language models caused by the gap between model scaling and hardware capabilities, overcoming limitations of static quantization methods that can't adapt to dynamic workloads.

Method: Uses model perplexity entropy and Kullback-Leibler divergence to enable fine-grained, layer-wise mixed-precision quantization. Dynamically adjusts bit-widths during each token generation and implements precision requirement modeling for optimal switching.

Result: Achieves 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss, providing efficient fine-grained precision management.

Conclusion: FlexQuant offers a flexible and adaptive solution for efficient LLM deployment by dynamically optimizing the trade-off between inference speed and accuracy through precision switching.

Abstract: The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment. Code is released at https://github.com/ZongwuWang/FlexQuant.git.

[528] PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries

Steven Kolawole, Keshav Santhanam, Virginia Smith, Pratiksha Thaker

Main category: cs.LG

TL;DR: PARALLELPROMPT is a benchmark for measuring intra-query parallelism in LLM prompts, showing that decomposing prompts into parallel subtasks can achieve up to 5x speedups with minimal quality loss.

Details

Motivation: Current LLM serving systems treat prompts as monolithic inputs, missing opportunities for parallelism in prompts with decomposable structures where subtasks can be executed independently to reduce latency.

Method: Created a dataset of 37,000+ real-world prompts annotated with structured schemas using LLM-assisted prompting and rule-based multilingual validation, plus an execution suite to benchmark serial vs. parallel strategies.

Result: Intra-query parallelism can be successfully parsed in over 75% of curated datasets, achieving up to 5x speedups on tasks like translation, comprehension, and comparative analysis with minimal quality degradation.

Conclusion: The benchmark provides the first standardized testbed for studying structure-aware execution in LLM serving pipelines, demonstrating significant latency improvements through prompt decomposition.

Abstract: LLM serving systems typically treat user prompts as monolithic inputs, optimizing inference through decoding tricks or inter-query batching. However, many real-world prompts contain latent semantic parallelism–decomposable structures where subtasks can be executed independently to reduce latency while preserving meaning. We introduce PARALLELPROMPT, the first benchmark for measuring intra-query parallelism in natural user prompts. Our dataset comprises over 37,000 real-world prompts from public LLM chat logs, each annotated with a structured schema capturing task templates, shared context, and iteration inputs. These schemas are extracted using LLM-assisted prompting with rule-based multilingual validation. To evaluate the benefits of decomposition, we provide an execution suite that benchmarks serial vs. parallel strategies, measuring latency, structural adherence, and semantic fidelity. Our results show that intra-query parallelism can be successfully parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks like translation, comprehension, and comparative analysis, with minimal quality degradation. By releasing this benchmark, curation pipeline, and evaluation suite, we provide the first standardized testbed for studying structure-aware execution in LLM serving pipelines.

[529] A unified framework for establishing the universal approximation of transformer-type architectures

Jingpu Cheng, Ting Lin, Zuowei Shen, Qianxiao Li

Main category: cs.LG

TL;DR: The paper establishes universal approximation property (UAP) for transformer architectures by identifying token distinguishability as a key requirement and providing a unified theoretical framework with simplified verification methods.

Details

Motivation: To extend theoretical understanding of universal approximation from residual networks to transformer architectures with attention mechanisms, providing a unified framework for analyzing UAP across various transformer variants.

Method: Developed a general sufficient condition for UAP requiring token distinguishability, leveraged analyticity assumption on attention layers to simplify verification, and used non-constructive approach to establish UAP for transformers with various attention mechanisms.

Result: Proved UAP for transformers with kernel-based and sparse attention mechanisms, generalized prior works, established UAP for previously uncovered architectures, and provided foundation for designing novel transformers with UAP guarantees.

Conclusion: The framework successfully extends UAP theory to transformer architectures, identifies fundamental requirements, simplifies verification, and enables principled design of transformers with inherent approximation guarantees and functional symmetries.

Abstract: We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.

[530] Measuring the Measures: Discriminative Capacity of Representational Similarity Metrics Across Model Families

Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla

Main category: cs.LG

TL;DR: Systematic comparison of representational similarity metrics shows that separability increases with more stringent alignment constraints, with soft-matching performing best among mapping-based methods.

Details

Motivation: Lack of systematic comparisons of representational similarity metrics' discriminative power across different model families and training regimes.

Method: Quantitative framework using three separability measures (dprime, silhouette coefficients, ROC-AUC) to evaluate metrics including RSA, linear predictivity, Procrustes, and soft matching across CNN, Vision Transformer, Swin Transformer, ConvNeXt architectures with supervised vs. self-supervised training.

Result: Separability systematically increases with more stringent alignment constraints; soft-matching achieves highest separability among mapping-based approaches, followed by Procrustes and linear predictivity; non-fitting methods like RSA also show strong separability.

Conclusion: Provides first systematic comparison of similarity metrics through separability lens, clarifying relative sensitivity and guiding metric choice for large-scale model and brain comparisons.

Abstract: Representational similarity metrics are fundamental tools in neuroscience and AI, yet we lack systematic comparisons of their discriminative power across model families. We introduce a quantitative framework to evaluate representational similarity measures based on their ability to separate model families-across architectures (CNNs, Vision Transformers, Swin Transformers, ConvNeXt) and training regimes (supervised vs. self-supervised). Using three complementary separability measures-dprime from signal detection theory, silhouette coefficients and ROC-AUC, we systematically assess the discriminative capacity of commonly used metrics including RSA, linear predictivity, Procrustes, and soft matching. We show that separability systematically increases as metrics impose more stringent alignment constraints. Among mapping-based approaches, soft-matching achieves the highest separability, followed by Procrustes alignment and linear predictivity. Non-fitting methods such as RSA also yield strong separability across families. These results provide the first systematic comparison of similarity metrics through a separability lens, clarifying their relative sensitivity and guiding metric choice for large-scale model and brain comparisons.

[531] Class-wise Balancing Data Replay for Federated Class-Incremental Learning

Zhuang Qi, Ying-Peng Tang, Lei Meng, Han Yu, Xiaoxiao Li, Xiangxu Meng

Main category: cs.LG

TL;DR: FedCBDR is a federated class incremental learning method that addresses class imbalance through balanced data replay and task-aware temperature scaling, achieving 2%-15% accuracy improvement over state-of-the-art methods.

Details

Motivation: To solve class imbalance issues in federated class incremental learning, both within replay buffers due to limited global awareness and between replayed and new classes, which limits performance of existing data replay methods.

Method: Two key components: 1) Global-perspective data replay module that reconstructs global representations of prior tasks privately and guides class-aware, importance-sensitive sampling for balanced replay; 2) Task-aware temperature scaling module that adaptively adjusts logit temperatures at class and instance levels based on task dynamics.

Result: FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance, yielding 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.

Conclusion: The proposed FedCBDR effectively addresses class imbalance in federated class incremental learning through coordinated data replay and adaptive temperature scaling, demonstrating significant performance improvements.

Abstract: Federated Class Incremental Learning (FCIL) aims to collaboratively process continuously increasing incoming tasks across multiple clients. Among various approaches, data replay has become a promising solution, which can alleviate forgetting by reintroducing representative samples from previous tasks. However, their performance is typically limited by class imbalance, both within the replay buffer due to limited global awareness and between replayed and newly arrived classes. To address this issue, we propose a class wise balancing data replay method for FCIL (FedCBDR), which employs a global coordination mechanism for class-level memory construction and reweights the learning objective to alleviate the aforementioned imbalances. Specifically, FedCBDR has two key components: 1) the global-perspective data replay module reconstructs global representations of prior task in a privacy-preserving manner, which then guides a class-aware and importance-sensitive sampling strategy to achieve balanced replay; 2) Subsequently, to handle class imbalance across tasks, the task aware temperature scaling module adaptively adjusts the temperature of logits at both class and instance levels based on task dynamics, which reduces the model’s overconfidence in majority classes while enhancing its sensitivity to minority classes. Experimental results verified that FedCBDR achieves balanced class-wise sampling under heterogeneous data distributions and improves generalization under task imbalance between earlier and recent tasks, yielding a 2%-15% Top-1 accuracy improvement over six state-of-the-art methods.

[532] TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding

Shukai Gong, Yiyang Fu, Fengyuan Ran, Quyu Kong, Feng Zhou

Main category: cs.LG

TL;DR: TPP-SD accelerates Transformer temporal point process sampling using speculative decoding, achieving 2-6× speedup while maintaining identical output distributions to standard methods.

Details

Motivation: Bridge the gap between powerful Transformer TPP models and practical need for rapid sequence sampling by leveraging structural similarities between TPP thinning algorithms and speculative decoding.

Method: Adapt speculative decoding techniques from language models to TPPs, using a smaller draft model to generate multiple candidate events that are verified in parallel by the larger target model.

Result: Produces samples from identical distributions as standard methods with 2-6× speedup across synthetic and real datasets. Ablation studies analyze impact of draft length and model size on efficiency.

Conclusion: TPP-SD successfully accelerates Transformer TPP sampling while maintaining distributional fidelity, making powerful TPP models more practical for real-world applications requiring rapid sequence generation.

Abstract: We propose TPP-SD, a novel approach that accelerates Transformer temporal point process (TPP) sampling by adapting speculative decoding (SD) techniques from language models. By identifying the structural similarities between thinning algorithms for TPPs and speculative decoding for language models, we develop an efficient sampling framework that leverages a smaller draft model to generate multiple candidate events, which are then verified by the larger target model in parallel. TPP-SD maintains the same output distribution as autoregressive sampling while achieving significant acceleration. Experiments on both synthetic and real datasets demonstrate that our approach produces samples from identical distributions as standard methods, but with 2-6$\times$ speedup. Our ablation studies analyze the impact of hyperparameters such as draft length and draft model size on sampling efficiency. TPP-SD bridges the gap between powerful Transformer TPP models and the practical need for rapid sequence sampling.

[533] A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications

Zhenyu Tao, Wei Xu, Xiaohu You

Main category: cs.LG

TL;DR: This paper introduces a generalized bisimulation metric (GBSM) for comparing states across different MDPs, with rigorous mathematical properties and improved theoretical bounds for policy transfer and state aggregation.

Details

Motivation: The bisimulation metric (BSM) has been limited to single MDP scenarios, and existing attempts to generalize it to multiple MDPs lack rigorous mathematical analysis, hindering theoretical progress in applications like policy transfer.

Method: The authors formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, proving three fundamental properties: symmetry, inter-MDP triangle inequality, and distance bounds on identical state spaces.

Result: GBSM provides strictly tighter theoretical bounds for policy transfer, state aggregation, and sampling-based estimation compared to standard BSM, and offers closed-form sample complexity for estimation that improves upon existing asymptotic results.

Conclusion: GBSM enables rigorous analysis of multi-MDP scenarios with validated theoretical improvements and practical effectiveness in numerical experiments.

Abstract: The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

[534] The Impact of Coreset Selection on Spurious Correlations and Group Robustness

Amaya Dharmasiri, William Yang, Polina Kirichenko, Lydia Liu, Olga Russakovsky

Main category: cs.LG

TL;DR: Analysis of how coreset selection methods affect spurious biases in datasets and model robustness across multiple benchmarks and selection policies.

Details

Motivation: To understand whether dataset reduction methods perpetuate, amplify, or mitigate biases in datasets that cause models to learn spurious correlations instead of causal features.

Method: Comprehensive analysis using 10 spurious correlations benchmarks, 5 score metrics for sample importance/difficulty, and 5 data selection policies across various coreset sizes.

Result: Embedding-based sample characterization scores have lower risk of exacerbating bias compared to learning dynamics-based methods. Some coreset selection methods can achieve lower bias levels by prioritizing difficult samples but don’t guarantee downstream robustness.

Conclusion: Coreset selection methods have nuanced interactions with dataset bias, and while some methods can reduce bias, they don’t reliably ensure model robustness against spurious correlations.

Abstract: Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning. However, as many datasets suffer from biases that cause models to learn spurious correlations instead of causal features, it is important to understand whether and how dataset reduction methods may perpetuate, amplify, or mitigate these biases. In this work, we conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them. We use an extensive experimental setting spanning ten different spurious correlations benchmarks, five score metrics to characterize sample importance/ difficulty, and five data selection policies across a broad range of coreset sizes. Thereby, we unravel a series of nontrivial nuances in interactions between sample difficulty and bias alignment, as well as dataset bias and resultant model robustness. For example, we find that selecting coresets using embedding-based sample characterization scores runs a comparatively lower risk of inadvertently exacerbating bias than selecting using characterizations based on learning dynamics. Most importantly, our analysis reveals that although some coreset selection methods could achieve lower bias levels by prioritizing difficult samples, they do not reliably guarantee downstream robustness.

[535] Who cuts emissions, who turns up the heat? causal machine learning estimates of energy efficiency interventions

Bernardino D’Amico, Francesco Pomponi, Jay H. Arehart, Lina Khaddour

Main category: cs.LG

TL;DR: Wall insulation reduces gas consumption by up to 19% on average, but benefits are uneven - low energy burden households save significantly while high burden households see little reduction due to reallocating savings to improved comfort.

Details

Motivation: To understand the heterogeneous impacts of energy efficiency interventions and distributional effects across different energy burden subgroups, as reducing domestic energy demand is central to climate mitigation and fuel poverty strategies.

Method: Causal machine learning model trained on nationally representative data of the English housing stock to estimate average and conditional treatment effects of wall insulation on gas consumption.

Result: Wall insulation reduces gas demand by up to 19% on average, but households with high energy burdens (costs-to-income ratios >0.1) see little to no reduction as they reallocate savings toward improved thermal comfort rather than lowering consumption.

Conclusion: Energy efficiency interventions have complex distributional effects that require broader evaluation frameworks accounting for both climate impacts and equity implications, as comfort-seeking behavior in high-burden households represents rational adjustments to prior deprivation with potential health co-benefits.

Abstract: Reducing domestic energy demand is central to climate mitigation and fuel poverty strategies, yet the impact of energy efficiency interventions is highly heterogeneous. Using a causal machine learning model trained on nationally representative data of the English housing stock, we estimate average and conditional treatment effects of wall insulation on gas consumption, focusing on distributional effects across energy burden subgroups. While interventions reduce gas demand on average (by as much as 19 percent), low energy burden groups achieve substantial savings, whereas those experiencing high energy burdens see little to no reduction. This pattern reflects a behaviourally-driven mechanism: households constrained by high costs-to-income ratios (e.g. more than 0.1) reallocate savings toward improved thermal comfort rather than lowering consumption. Far from wasteful, such responses represent rational adjustments in contexts of prior deprivation, with potential co-benefits for health and well-being. These findings call for a broader evaluation framework that accounts for both climate impacts and the equity implications of domestic energy policy.

[536] Inductive Domain Transfer In Misspecified Simulation-Based Inference

Ortal Senouf, Antoine Wehenkel, Cédric Vincent-Cuaz, Emmanuel Abbé, Pascal Frossard

Main category: cs.LG

TL;DR: An inductive and amortized SBI framework that integrates calibration and distributional alignment using mini-batch optimal transport and conditional normalizing flows, addressing model misspecification without requiring test samples at inference time.

Details

Motivation: To overcome limitations of existing SBI methods like RoPE that operate in transductive settings requiring test samples at inference time, which limits scalability and generalization in misspecified environments.

Method: Uses mini-batch optimal transport with closed-form coupling to align real and simulated observations corresponding to same latent parameters, then trains conditional normalizing flow to approximate OT-induced posterior for efficient inference.

Result: Matches or surpasses performance of RoPE and other SBI/non-SBI estimators across synthetic and real-world benchmarks including medical biomarker estimation, while offering improved scalability.

Conclusion: The proposed inductive framework successfully addresses model misspecification in SBI with better scalability and applicability than transductive approaches like RoPE.

Abstract: Simulation-based inference (SBI) is a statistical inference approach for estimating latent parameters of a physical system when the likelihood is intractable but simulations are available. In practice, SBI is often hindered by model misspecification–the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, a recent SBI approach, addresses this challenge through a two-stage domain transfer process that combines semi-supervised calibration with optimal transport (OT)-based distribution alignment. However, RoPE operates in a fully transductive setting, requiring access to a batch of test samples at inference time, which limits scalability and generalization. We propose here a fully inductive and amortized SBI framework that integrates calibration and distributional alignment into a single, end-to-end trainable model. Our method leverages mini-batch OT with a closed-form coupling to align real and simulated observations that correspond to the same latent parameters, using both paired calibration data and unpaired samples. A conditional normalizing flow is then trained to approximate the OT-induced posterior, enabling efficient inference without simulation access at test time. Across a range of synthetic and real-world benchmarks–including complex medical biomarker estimation–our approach matches or surpasses the performance of RoPE, as well as other standard SBI and non-SBI estimators, while offering improved scalability and applicability in challenging, misspecified environments.

[537] MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature

Hirofumi Tsuruta, Masaya Kumagai

Main category: cs.LG

TL;DR: MatPROV introduces a dataset of synthesis procedures extracted from scientific literature using PROV-DM standard, enabling graph-based modeling of complex materials synthesis workflows.

Details

Motivation: Existing approaches for structuring synthesis procedures rely on rigid schemas or linear sequences, limiting their ability to capture the structural complexity of real-world procedures.

Method: Adopt PROV-DM international standard for provenance information to create graph-based models of synthesis procedures, extracted from scientific literature using large language models.

Result: Created MatPROV dataset that captures structural complexities and causal relationships among materials, operations, and conditions through directed graphs.

Conclusion: PROV-DM-based representation enables machine-interpretable synthesis knowledge, opening opportunities for automated synthesis planning and optimization.

Abstract: Synthesis procedures play a critical role in materials research, as they directly affect material properties. With data-driven approaches increasingly accelerating materials discovery, there is growing interest in extracting synthesis procedures from scientific literature as structured data. However, existing studies often rely on rigid, domain-specific schemas with predefined fields for structuring synthesis procedures or assume that synthesis procedures are linear sequences of operations, which limits their ability to capture the structural complexity of real-world procedures. To address these limitations, we adopt PROV-DM, an international standard for provenance information, which supports flexible, graph-based modeling of procedures. We present MatPROV, a dataset of PROV-DM-compliant synthesis procedures extracted from scientific literature using large language models. MatPROV captures structural complexities and causal relationships among materials, operations, and conditions through visually intuitive directed graphs. This representation enables machine-interpretable synthesis knowledge, opening opportunities for future research such as automated synthesis planning and optimization.

[538] DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training

Qi Cao, Pengtao Xie

Main category: cs.LG

TL;DR: DreamPRM-1.5 introduces instance-level reweighting via bi-level optimization to address distribution shift and quality imbalance in multimodal process reward models, achieving state-of-the-art performance on multimodal reasoning benchmarks.

Details

Motivation: Training multimodal process reward models faces challenges from distribution shift between training and test sets, and quality imbalance across training data samples. Existing domain-level reweighting methods leave a significant gap to oracle upper bounds.

Method: Instance-level reweighting framework using bi-level optimization with two regimes: Instance Table (explicit per-sample weights for small/medium data) and Instance Net (lightweight neural network for large corpora). Includes practical training techniques like time-scale matching, cold-start initialization, and bounded-range weights.

Result: Achieves 84.6% accuracy on MMMU validation set, 31.3% accuracy on R-Bench-V, and first-place results on public multimodal reasoning leaderboards when paired with GPT-5-mini. Closes the gap toward oracle performance and demonstrates stable training.

Conclusion: DreamPRM-1.5 effectively addresses the limitations of domain-level reweighting through instance-level optimization, achieving leading performance in multimodal reasoning tasks while maintaining training stability.

Abstract: Training multimodal process reward models (PRMs) is hard due to (i) distribution shift between training set and test set and (ii) quality imbalance across training data samples. While domain-level reweighting (e.g., DreamPRM) aligns training with test-time objectives, it leaves a clear gap to an oracle upper bound (pass@N), even under a “sanity check” that uses test set data to probe headroom – pointing to meta-level under-parameterization. We introduce DreamPRM-1.5, an instance-level reweighting framework that assigns an adaptive weight to every training example via bi-level optimization. To realize instance reweighting across scales, we develop two complementary regimes: Instance Table, which learns explicit per-sample weights and excels on small/medium data, and Instance Net, a lightweight neural network that generalizes better and scales to large corpora. A practical, stable training recipe – time-scale matching between upper/lower updates, cold-start initialization, and bounded-range weights – prevents divergence. Integrated with test-time scaling, DreamPRM-1.5 attains 84.6 accuracy on the MMMU validation set, 31.3 accuracy on R-Bench-V and, when paired with a leading backbone (e.g., GPT-5-mini), achieves first-place results on public multimodal reasoning leaderboards. Moreover, extensive experiments, including benchmark evaluations, baseline comparisons, and a sanity check, demonstrate that DreamPRM-1.5 closes the gap toward the oracle, achieves leading performance, and trains stably.

[539] Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial Imputation

Zhangqi Jiang, Tingjin Luo, Xu Yang, Xinyan Liang

Main category: cs.LG

TL;DR: AGF-TI addresses view missing in multi-view semi-supervised learning by using adversarial graph fusion and tensor completion to handle sub-cluster problems and incomplete structures.

Details

Motivation: Traditional methods ignore missing samples which can cause discontinuous local structures (sub-cluster problem), breaking the smoothness assumption in label propagation and degrading classification performance.

Method: Proposes adversarial graph fusion to learn robust consensus graphs, uses tensor completion for structure recovery from high-order consistency, and incorporates anchor-based strategy for computational efficiency.

Result: Extensive experiments show AGF-TI outperforms state-of-the-art methods on various datasets.

Conclusion: AGF-TI effectively addresses the sub-cluster problem in incomplete multi-view learning through adversarial graph fusion and tensor completion, achieving superior performance.

Abstract: View missing remains a significant challenge in graph-based multi-view semi-supervised learning, hindering their real-world applications. To address this issue, traditional methods introduce a missing indicator matrix and focus on mining partial structure among existing samples in each view for label propagation (LP). However, we argue that these disregarded missing samples sometimes induce discontinuous local structures, i.e., sub-clusters, breaking the fundamental smoothness assumption in LP. Consequently, such a Sub-Cluster Problem (SCP) would distort graph fusion and degrade classification performance. To alleviate SCP, we propose a novel incomplete multi-view semi-supervised learning method, termed AGF-TI. Firstly, we design an adversarial graph fusion scheme to learn a robust consensus graph against the distorted local structure through a min-max framework. By stacking all similarity matrices into a tensor, we further recover the incomplete structure from the high-order consistency information based on the low-rank tensor learning. Additionally, the anchor-based strategy is incorporated to reduce the computational complexity. An efficient alternative optimization algorithm combining a reduced gradient descent method is developed to solve the formulated objective, with theoretical convergence. Extensive experimental results on various datasets validate the superiority of our proposed AGF-TI as compared to state-of-the-art methods. Code is available at https://github.com/ZhangqiJiang07/AGF_TI.

[540] Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning

Dongkwan Lee, Junhoo Lee, Nojun Kwak

Main category: cs.LG

TL;DR: Deep Edge Filter applies high-pass filtering to neural network features to improve generalization by isolating task-relevant high-frequency components while removing domain-biased low-frequency components.

Details

Motivation: The hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features.

Method: Subtracting low-pass filtered outputs from original features to isolate generalizable representations while preserving architectural integrity.

Result: Experimental results across Vision, Text, 3D, and Audio domains show consistent performance improvements regardless of model architecture and data modality. The method induces feature sparsification and effectively isolates high-frequency components.

Conclusion: The approach provides empirical validation of the core hypothesis and improves model generalizability across diverse domains.

Abstract: We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.

[541] $\boldsymbolλ$-Orthogonality Regularization for Compatible Representation Learning

Simone Ricci, Niccolò Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo

Main category: cs.LG

TL;DR: The paper proposes λ-Orthogonality regularization to adapt different learned representations while preserving original structure, achieving compatibility across model updates.

Details

Motivation: Address the challenge of adapting latent spaces between updated and previous models while preserving newly learned representations, overcoming limitations of affine and orthogonal transformations.

Method: Impose λ-Orthogonality regularization while learning affine transformation, providing distribution-specific adaptation while retaining original learned representations.

Result: Extensive experiments show the approach preserves zero-shot performance and ensures compatibility across model updates across various architectures and datasets.

Conclusion: The proposed λ-Orthogonality regularization successfully balances adaptation and preservation of learned representations, enabling effective model compatibility.

Abstract: Retrieval systems rely on representations learned by increasingly powerful models. However, due to the high training cost and inconsistencies in learned representations, there is significant interest in facilitating communication between representations and ensuring compatibility across independently trained neural networks. In the literature, two primary approaches are commonly used to adapt different learned representations: affine transformations, which adapt well to specific distributions but can significantly alter the original representation, and orthogonal transformations, which preserve the original structure with strict geometric constraints but limit adaptability. A key challenge is adapting the latent spaces of updated models to align with those of previous models on downstream distributions while preserving the newly learned representation spaces. In this paper, we impose a relaxed orthogonality constraint, namely $\lambda$-Orthogonality regularization, while learning an affine transformation, to obtain distribution-specific adaptation while retaining the original learned representations. Extensive experiments across various architectures and datasets validate our approach, demonstrating that it preserves the model’s zero-shot performance and ensures compatibility across model updates. Code available at: \href{https://github.com/miccunifi/lambda_orthogonality.git}{https://github.com/miccunifi/lambda_orthogonality}.

[542] Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection

Denis Janiak, Jakub Binkowski, Tomasz Kajdanowicz

Main category: cs.LG

TL;DR: The paper analyzes how representation geometry and normalization affect Mahalanobis-based OOD detection, shows these methods aren’t universally reliable, and proposes a new radially scaled L2 normalization method to improve performance.

Details

Motivation: To understand the impact of representation geometry and normalization on Mahalanobis distance methods for OOD detection, as this gap limits their downstream application.

Method: Comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes; analysis of ideal geometry using spectral and intrinsic-dimensionality metrics; proposal of radially scaled L2 normalization with tunable parameter.

Result: Mahalanobis-based methods aren’t universally reliable; spectral and intrinsic-dimensionality metrics can predict OOD performance; radially scaled L2 normalization significantly improves OOD detection by controlling feature space radial geometry.

Conclusion: The findings bridge representation geometry, normalization, and OOD performance, offering insights for designing more effective and reliable deep learning models.

Abstract: Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren’t universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model’s OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled $\ell_2$ normalization, a method that generalizes the standard $\ell_2$ normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.

[543] Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

Yiyuan Pan, Zhe Liu, Hesheng Wang

Main category: cs.LG

TL;DR: CERMIC is a multi-agent exploration framework that filters noisy surprise signals and calibrates intrinsic curiosity using inferred multi-agent context, outperforming state-of-the-art methods in sparse-reward environments.

Details

Motivation: Existing curiosity mechanisms in MARL confuse environmental stochasticity with meaningful novelty and treat all unexpected observations equally, while overlooking peer behavior novelty that encodes latent task dynamics, leading to suboptimal exploration.

Method: CERMIC filters noisy surprise signals and guides exploration by dynamically calibrating intrinsic curiosity with inferred multi-agent context. It generates theoretically-grounded intrinsic rewards that encourage exploration of state transitions with high information gain.

Result: CERMIC significantly outperforms state-of-the-art algorithms in sparse-reward environments across benchmark suites including VMAS, Meltingpot, and SMACv2.

Conclusion: CERMIC provides an effective approach for multi-agent exploration by robustly filtering noise and leveraging peer behavior context, demonstrating superior performance in challenging sparse-reward MARL settings.

Abstract: Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.

[544] Automotive Crash Dynamics Modeling Accelerated with Machine Learning

Mohammad Amin Nabian, Sudeep Chavare, Deepak Akhare, Rishikesh Ranade, Ram Cherukuri, Srinivas Tadepalli

Main category: cs.LG

TL;DR: Machine learning surrogate models (MeshGraphNet and Transolver) are developed to predict structural deformation in automotive crash scenarios, achieving significant computational speedup compared to traditional finite element simulations.

Details

Motivation: Traditional finite element simulations for crashworthiness assessment are computationally expensive and time-consuming, creating a need for faster surrogate models to enable rapid design exploration.

Method: Used two neural network architectures (MeshGraphNet and Transolver) with three transient dynamics modeling strategies (time-conditional, standard Autoregressive, and stability-enhanced Autoregressive) on a Body-in-White crash dataset with 150 LS-DYNA simulations.

Result: Models captured overall deformation trends with reasonable fidelity, achieving orders-of-magnitude computational cost reduction compared to full FE simulations, though not yet matching full FE accuracy.

Conclusion: Machine learning approaches are feasible for structural crash dynamics prediction and enable rapid design exploration in crashworthiness evaluation, despite current limitations in matching full FE simulation accuracy.

Abstract: Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.

[545] ProtoTS: Learning Hierarchical Prototypes for Explainable Time Series Forecasting

Ziheng Peng, Shijie Ren, Xinyue Gu, Linxiao Yang, Xiting Wang, Liang Sun

Main category: cs.LG

TL;DR: ProtoTS is an interpretable time series forecasting framework that uses prototypical temporal patterns to achieve both high accuracy and transparent decision-making through hierarchical prototype organization and denoised representations.

Details

Motivation: Existing interpretable models provide only local and partial explanations, lacking the capability to reveal how heterogeneous and interacting input variables jointly shape overall temporal patterns in forecast curves, which is crucial for building trust in high-stakes scenarios.

Method: ProtoTS computes instance-prototype similarity based on denoised representations that preserve heterogeneous information, with prototypes organized hierarchically to capture both global temporal patterns (coarse prototypes) and finer-grained local variations (detailed prototypes).

Result: Experiments on multiple realistic benchmarks including the LOF dataset show that ProtoTS exceeds existing methods in forecast accuracy and delivers expert-steerable interpretations for better model understanding and decision support.

Conclusion: ProtoTS successfully achieves both high forecasting accuracy and transparent decision-making through its hierarchical prototype-based approach, enabling multi-level interpretability and expert steering capabilities.

Abstract: While deep learning has achieved impressive performance in time series forecasting, it becomes increasingly crucial to understand its decision-making process for building trust in high-stakes scenarios. Existing interpretable models often provide only local and partial explanations, lacking the capability to reveal how heterogeneous and interacting input variables jointly shape the overall temporal patterns in the forecast curve. We propose ProtoTS, a novel interpretable forecasting framework that achieves both high accuracy and transparent decision-making through modeling prototypical temporal patterns. ProtoTS computes instance-prototype similarity based on a denoised representation that preserves abundant heterogeneous information. The prototypes are organized hierarchically to capture global temporal patterns with coarse prototypes while capturing finer-grained local variations with detailed prototypes, enabling expert steering and multi-level interpretability. Experiments on multiple realistic benchmarks, including a newly released LOF dataset, show that ProtoTS not only exceeds existing methods in forecast accuracy but also delivers expert-steerable interpretations for better model understanding and decision support.

[546] Preference-driven Knowledge Distillation for Few-shot Node Classification

Xing Wei, Chunchun Chen, Rui Fan, Xiaofeng Cao, Sourav Medya, Wei Ye

Main category: cs.LG

TL;DR: A preference-driven knowledge distillation framework that synergizes LLMs and GNNs for few-shot node classification on text-attributed graphs.

Details

Motivation: GNNs rely heavily on labeled data and struggle with diverse local topologies, while LLMs have scalability issues despite good zero/few-shot performance. The framework aims to combine their complementary strengths.

Method: Uses GNN-preference-driven node selector to distill predictions from LLMs to teacher GNNs, and node-preference-driven GNN selector to identify the best teacher GNN for each node for tailored knowledge distillation to student GNN.

Result: Extensive experiments validate the framework’s efficacy in few-shot node classification on real-world TAGs.

Conclusion: The proposed PKD framework successfully combines LLMs and GNNs through preference-driven knowledge distillation for effective few-shot learning on text-attributed graphs.

Abstract: Graph neural networks (GNNs) can efficiently process text-attributed graphs (TAGs) due to their message-passing mechanisms, but their training heavily relies on the human-annotated labels. Moreover, the complex and diverse local topologies of nodes of real-world TAGs make it challenging for a single mechanism to handle. Large language models (LLMs) perform well in zero-/few-shot learning on TAGs but suffer from a scalability challenge. Therefore, we propose a preference-driven knowledge distillation (PKD) framework to synergize the complementary strengths of LLMs and various GNNs for few-shot node classification. Specifically, we develop a GNN-preference-driven node selector that effectively promotes prediction distillation from LLMs to teacher GNNs. To further tackle nodes’ intricate local topologies, we develop a node-preference-driven GNN selector that identifies the most suitable teacher GNN for each node, thereby facilitating tailored knowledge distillation from teacher GNNs to the student GNN. Extensive experiments validate the efficacy of our proposed framework in few-shot node classification on real-world TAGs. Our code is available.

[547] Language Models are Injective and Hence Invertible

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodolà

Main category: cs.LG

TL;DR: Transformer language models are injective (lossless) - different inputs map to different representations, enabling exact input reconstruction from hidden activations.

Details

Motivation: Challenge the view that transformer components like non-linear activations and normalization prevent exact input recovery due to non-injectivity.

Method: Mathematical proof of injectivity at initialization preserved during training, empirical collision tests on 6 state-of-the-art models, and development of SipIt algorithm for exact input reconstruction.

Result: No collisions observed in billions of tests, SipIt algorithm achieves linear-time exact input reconstruction from hidden activations.

Conclusion: Injectivity is a fundamental and exploitable property of language models with implications for transparency, interpretability, and safe deployment.

Abstract: Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model’s representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.

[548] Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

Daria Frolova, Talgat Daulbaev, Egor Sevriugov, Sergei A. Nikolenko, Dmitry N. Ivankov, Ivan Oseledets, Marina A. Pak

Main category: cs.LG

TL;DR: Matcha is a fast molecular docking pipeline that uses multi-stage flow matching with scoring and physical filtering to achieve high accuracy in protein-ligand binding pose prediction.

Details

Motivation: Existing protein-ligand docking methods struggle to balance speed, accuracy, and physical plausibility, creating a need for improved approaches.

Method: Three-stage flow matching pipeline operating on different geometric spaces (R³, SO(3), SO(2)) with learned scoring and unsupervised physical validity filtering to refine docking predictions.

Result: Superior performance on Astex and PDBbind test sets with higher docking success rate and physical plausibility, running ~25x faster than large-scale co-folding models.

Conclusion: Matcha provides an effective solution for accurate and fast protein-ligand docking with improved physical plausibility compared to existing methods.

Abstract: Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively to refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ($\mathbb{R}^3$, $\mathrm{SO}(3)$, and $\mathrm{SO}(2)$). We enhance the prediction quality through a dedicated scoring model and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBbind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.

[549] ProSh: Probabilistic Shielding for Model-free Reinforcement Learning

Edwin Hamel-De le Court, Gaspard Ohlmann, Francesco Belardinelli

Main category: cs.LG

TL;DR: ProSh is a model-free safe RL algorithm that uses risk augmentation and shielding to ensure safety under cost constraints while preserving optimality in deterministic environments.

Details

Motivation: Safety is crucial for deploying RL systems, requiring formal guarantees about their behavior while maintaining optimal performance.

Method: Augments Constrained MDP state space with risk budget and applies a shield to policy distribution using learned cost critic to ensure sampled actions remain safe in expectation.

Result: Provides tight upper-bound on expected cost depending on backup-critic accuracy, and guarantees safety during training under practical assumptions.

Conclusion: ProSh enables safe reinforcement learning with formal safety guarantees while preserving optimality in deterministic settings.

Abstract: Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent’s policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.

[550] Doubly Robust Estimation of Causal Effects in Strategic Equilibrium Systems

Sibo Xiao

Main category: cs.LG

TL;DR: SDR estimator combines strategic equilibrium modeling with doubly robust estimation for causal inference in strategic environments, addressing endogenous treatment from agent behavior.

Details

Motivation: To handle endogenous treatment assignment caused by strategic agent behavior in causal inference, maintaining doubly robust properties while incorporating strategic considerations.

Method: Strategic Doubly Robust (SDR) estimator that integrates strategic equilibrium modeling with doubly robust estimation framework.

Result: SDR achieves 7.6%-29.3% bias reduction over baseline methods across varying strategic strengths and maintains robust scalability with agent populations.

Conclusion: SDR provides a principled approach for reliable causal inference when agents respond strategically to interventions, with theoretical guarantees of consistency and asymptotic normality.

Abstract: We introduce the Strategic Doubly Robust (SDR) estimator, a novel framework that integrates strategic equilibrium modeling with doubly robust estimation for causal inference in strategic environments. SDR addresses endogenous treatment assignment arising from strategic agent behavior, maintaining double robustness while incorporating strategic considerations. Theoretical analysis confirms SDR’s consistency and asymptotic normality under strategic unconfoundedness. Empirical evaluations demonstrate SDR’s superior performance over baseline methods, achieving 7.6%-29.3% bias reduction across varying strategic strengths and maintaining robust scalability with agent populations. The framework provides a principled approach for reliable causal inference when agents respond strategically to interventions.

[551] Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns

Wenshuo Wang, Ziyou Jiang, Junjie Wang, Mingyang Li, Jie Huang, Yuekai Huang, Zhiyuan Chang, Feiyan Duan, Qing Wang

Main category: cs.LG

TL;DR: PatMD improves harmful meme detection by identifying misjudgment risk patterns and proactively guiding MLLMs to avoid known pitfalls, achieving significant performance improvements over state-of-the-art methods.

Details

Motivation: Internet memes are increasingly weaponized to convey harmful opinions through subtle rhetorical devices like irony and metaphor, and existing detection approaches struggle with these implicit expressions, leading to frequent misjudgments.

Method: Construct a knowledge base where each meme is deconstructed into misjudgment risk patterns explaining potential false negatives or false positives. For target memes, retrieve relevant patterns to dynamically guide MLLM reasoning.

Result: Experiments on 6,626 memes across 5 harmful detection tasks show PatMD outperforms state-of-the-art baselines with 8.30% improvement in F1-score and 7.71% improvement in accuracy.

Conclusion: PatMD demonstrates strong generalizability and improved detection capability for harmful memes by proactively mitigating misjudgment risks through pattern-based guidance.

Abstract: Internet memes have emerged as a popular multimodal medium, yet they are increasingly weaponized to convey harmful opinions through subtle rhetorical devices like irony and metaphor. Existing detection approaches, including MLLM-based techniques, struggle with these implicit expressions, leading to frequent misjudgments. This paper introduces PatMD, a novel approach that improves harmful meme detection by learning from and proactively mitigating these potential misjudgment risks. Our core idea is to move beyond superficial content-level matching and instead identify the underlying misjudgment risk patterns, proactively guiding the MLLMs to avoid known misjudgment pitfalls. We first construct a knowledge base where each meme is deconstructed into a misjudgment risk pattern explaining why it might be misjudged, either overlooking harmful undertones (false negative) or overinterpreting benign content (false positive). For a given target meme, PatMD retrieves relevant patterns and utilizes them to dynamically guide the MLLM’s reasoning. Experiments on a benchmark of 6,626 memes across 5 harmful detection tasks show that PatMD outperforms state-of-the-art baselines, achieving an average of 8.30% improvement in F1-score and 7.71% improvement in accuracy, demonstrating strong generalizability and improved detection capability of harmful memes.

[552] MEET-Sepsis: Multi-Endogenous-View Enhanced Time-Series Representation Learning for Early Sepsis Prediction

Zexi Tan, Tao Xie, Binbin Sun, Xiang Zhang, Yiqun Zhang, Yiu-Ming Cheung

Main category: cs.LG

TL;DR: The paper proposes MEET-Sepsis, a framework for early sepsis prediction using multi-view feature enhancement and multi-scale temporal attention, achieving competitive accuracy with only 20% of ICU monitoring time compared to state-of-the-art methods.

Details

Motivation: Early sepsis prediction is critical for timely intervention but remains challenging due to subtle early manifestations and rapidly escalating mortality. Existing AI methods struggle to capture weak early temporal signals.

Method: Introduces a Multi-Endogenous-view Representation Enhancement (MERE) mechanism for enriched feature views and a Cascaded Dual-convolution Time-series Attention (CDTA) module for multi-scale temporal representation learning.

Result: The MEET-Sepsis framework achieves competitive prediction accuracy using only 20% of the ICU monitoring time required by state-of-the-art methods.

Conclusion: The proposed method significantly advances early sepsis prediction, with extensive validation confirming its efficacy.

Abstract: Sepsis is a life-threatening infectious syndrome associated with high mortality in intensive care units (ICUs). Early and accurate sepsis prediction (SP) is critical for timely intervention, yet remains challenging due to subtle early manifestations and rapidly escalating mortality. While AI has improved SP efficiency, existing methods struggle to capture weak early temporal signals. This paper introduces a Multi-Endogenous-view Representation Enhancement (MERE) mechanism to construct enriched feature views, coupled with a Cascaded Dual-convolution Time-series Attention (CDTA) module for multi-scale temporal representation learning. The proposed MEET-Sepsis framework achieves competitive prediction accuracy using only 20% of the ICU monitoring time required by SOTA methods, significantly advancing early SP. Extensive validation confirms its efficacy. Code is available at: https://github.com/yueliangy/MEET-Sepsis.

[553] Airfoil optimization using Design-by-Morphing with minimized design-space dimensionality

Sangjoon Lee, Haris Moazam Sheikh

Main category: cs.LG

TL;DR: AirDbM is a specialized Design-by-Morphing approach for airfoil optimization that uses only 12 optimal baseline airfoils to achieve high reconstruction accuracy and superior performance in multi-objective optimization compared to methods using more baselines.

Details

Motivation: To enable effective airfoil geometry optimization by exploring diverse designs with minimal design variables, reducing design-space dimensionality while maintaining performance.

Method: Selects optimal baseline airfoils from UIUC database (1,600+ shapes) by sequentially adding baselines that maximize design capacity, then uses these 12 baselines for airfoil reconstruction and optimization.

Result: Reconstructs 99% of database with mean absolute error <0.005, achieves rapid convergence in multi-objective optimization with greater Pareto front hypervolume, and discovers new Pareto-optimal solutions with enhanced lift-to-drag ratios.

Conclusion: AirDbM demonstrates superior performance over previous methods with fewer baselines, and shows outstanding adaptability for reinforcement learning agents, indicating broader potential of DbM in machine learning-driven design.

Abstract: Effective airfoil geometry optimization requires exploring a diverse range of designs using as few design variables as possible. This study introduces AirDbM, a Design-by-Morphing (DbM) approach specialized for airfoil optimization that systematically reduces design-space dimensionality. AirDbM selects an optimal set of 12 baseline airfoils from the UIUC airfoil database, which contains over 1,600 shapes, by sequentially adding the baseline that most increases the design capacity. With these baselines, AirDbM reconstructs 99 % of the database with a mean absolute error below 0.005, which matches the performance of a previous DbM approach that used more baselines. In multi-objective aerodynamic optimization, AirDbM demonstrates rapid convergence and achieves a Pareto front with a greater hypervolume than that of the previous larger-baseline study, where new Pareto-optimal solutions are discovered with enhanced lift-to-drag ratios at moderate stall tolerances. Furthermore, AirDbM demonstrates outstanding adaptability for reinforcement learning (RL) agents in generating airfoil geometry when compared to conventional airfoil parameterization methods, implying the broader potential of DbM in machine learning-driven design.

[554] Expressive Reward Synthesis with the Runtime Monitoring Language

Daniel Donnelly, Angelo Ferrando, Francesco Belardinelli

Main category: cs.LG

TL;DR: The paper introduces a new class of language-based Reward Machines using Runtime Monitoring Language (RML) to handle non-regular, non-Markovian reward functions that traditional Reward Machines cannot capture.

Details

Motivation: Traditional RL reward functions are black-box mappings that lack interpretability and cannot capture complex behaviors like counting or parameterized conditions. Reward Machines help but are limited to regular languages.

Method: Developed language-based Reward Machines using Runtime Monitoring Language (RML), leveraging its built-in memory to specify non-regular, non-Markovian reward functions.

Result: The approach demonstrates expressiveness for complex tasks, with additional advantages in flexible event-handling and task specification compared to existing Reward Machine methods.

Conclusion: The proposed language-based Reward Machines using RML successfully extend the expressivity beyond regular languages, enabling specification of more complex reward functions while maintaining interpretability.

Abstract: A key challenge in reinforcement learning (RL) is reward (mis)specification, whereby imprecisely defined reward functions can result in unintended, possibly harmful, behaviours. Indeed, reward functions in RL are typically treated as black-box mappings from state-action pairs to scalar values. While effective in many settings, this approach provides no information about why rewards are given, which can hinder learning and interpretability. Reward Machines address this issue by representing reward functions as finite state automata, enabling the specification of structured, non-Markovian reward functions. However, their expressivity is typically bounded by regular languages, leaving them unable to capture more complex behaviours such as counting or parametrised conditions. In this work, we build on the Runtime Monitoring Language (RML) to develop a novel class of language-based Reward Machines. By leveraging the built-in memory of RML, our approach can specify reward functions for non-regular, non-Markovian tasks. We demonstrate the expressiveness of our approach through experiments, highlighting additional advantages in flexible event-handling and task specification over existing Reward Machine-based methods.

[555] Diverse Influence Component Analysis: A Geometric Approach to Nonlinear Mixture Identifiability

Hoang-Son Nguyen, Xiao Fu

Main category: cs.LG

TL;DR: Diverse Influence Component Analysis (DICA) enables latent component identification from nonlinear mixtures using Jacobian Volume Maximization, achieving identifiability without auxiliary signals, independence assumptions, or sparsity requirements.

Details

Motivation: To address the fundamental challenge of identifying latent components from unknown nonlinear mixtures, overcoming limitations of existing methods that rely on auxiliary signals, independence assumptions, or Jacobian sparsity.

Method: Proposes DICA framework with Jacobian Volume Maximization (J-VolMax) criterion that exploits convex geometry of mixing function’s Jacobian to encourage diversity in latent components’ influence on observed variables.

Result: Achieves identifiability of latent components without requiring auxiliary information, latent component independence, or Jacobian sparsity assumptions.

Conclusion: Extends the scope of identifiability analysis and provides a complementary perspective to existing nonlinear ICA methods by leveraging geometric properties of the mixing function.

Abstract: Latent component identification from unknown nonlinear mixtures is a foundational challenge in machine learning, with applications in tasks such as disentangled representation learning and causal inference. Prior work in nonlinear independent component analysis (nICA) has shown that auxiliary signals – such as weak supervision – can support identifiability of conditionally independent latent components. More recent approaches explore structural assumptions, e.g., sparsity in the Jacobian of the mixing function, to relax such requirements. In this work, we introduce Diverse Influence Component Analysis (DICA), a framework that exploits the convex geometry of the mixing function’s Jacobian. We propose a Jacobian Volume Maximization (J-VolMax) criterion, which enables latent component identification by encouraging diversity in their influence on the observed variables. Under reasonable conditions, this approach achieves identifiability without relying on auxiliary information, latent component independence, or Jacobian sparsity assumptions. These results extend the scope of identifiability analysis and offer a complementary perspective to existing methods.

[556] TrajMamba: An Efficient and Semantic-rich Vehicle Trajectory Pre-training Model

Yichen Liu, Yan Lin, Shengnan Guo, Zeyu Zhou, Youfang Lin, Huaiyu Wan

Main category: cs.LG

TL;DR: TrajMamba is a novel approach for efficient and semantically rich vehicle trajectory learning that addresses computational burden from textual data and redundancy in trajectory points through specialized encoders and pre-training schemes.

Details

Motivation: To overcome challenges in learning travel semantics from GPS trajectories, including computational burden from textual road/POI information and efficiency issues from redundant trajectory points.

Method: Uses Traj-Mamba Encoder for joint GPS-road modeling, Travel Purpose-aware Pre-training for embedding travel purposes, and Knowledge Distillation Pre-training with learnable mask generator to compress trajectories.

Result: Outperforms state-of-the-art baselines in both efficiency and accuracy on two real-world datasets and three downstream tasks.

Conclusion: TrajMamba provides an effective solution for learning semantically rich vehicle trajectory embeddings while maintaining computational efficiency.

Abstract: Vehicle GPS trajectories record how vehicles move over time, storing valuable travel semantics, including movement patterns and travel purposes. Learning travel semantics effectively and efficiently is crucial for real-world applications of trajectory data, which is hindered by two major challenges. First, travel purposes are tied to the functions of the roads and points-of-interest (POIs) involved in a trip. Such information is encoded in textual addresses and descriptions and introduces heavy computational burden to modeling. Second, real-world trajectories often contain redundant points, which harm both computational efficiency and trajectory embedding quality. To address these challenges, we propose TrajMamba, a novel approach for efficient and semantically rich vehicle trajectory learning. TrajMamba introduces a Traj-Mamba Encoder that captures movement patterns by jointly modeling both GPS and road perspectives of trajectories, enabling robust representations of continuous travel behaviors. It also incorporates a Travel Purpose-aware Pre-training procedure to integrate travel purposes into the learned embeddings without introducing extra overhead to embedding calculation. To reduce redundancy in trajectories, TrajMamba features a Knowledge Distillation Pre-training scheme to identify key trajectory points through a learnable mask generator and obtain effective compressed trajectory embeddings. Extensive experiments on two real-world datasets and three downstream tasks show that TrajMamba outperforms state-of-the-art baselines in both efficiency and accuracy.

[557] Enabling Fine-Grained Operating Points for Black-Box LLMs

Ege Beyazit, KL Navaneet, Prashant Mathur, Roi Blanco, Vidit Bansal, Karim Bouyarmane

Main category: cs.LG

TL;DR: Black-box LLMs have limited operational granularity due to low numerical output cardinalities, making it hard to control specific metrics like precision. This paper proposes efficient methods to increase operating points without performance loss.

Details

Motivation: Black-box LLMs are practical for decision-making but lack fine-grained control over operating points due to low numerical output cardinalities, preventing adjustment of metrics like precision constraints.

Method: Investigates reasons for low-cardinality outputs, tests standard techniques (prompt engineering, uncertainty estimation), and proposes efficient approaches to increase operating point diversity.

Result: Proposed approaches significantly increase operating points and achieve comparable or better performance than benchmarks across 11 datasets and 3 LLMs.

Conclusion: Efficient methods can improve black-box LLMs’ operational granularity without sacrificing performance, enabling better control over decision-making metrics.

Abstract: Black-box Large Language Models (LLMs) provide practical and accessible alternatives to other machine learning methods, as they require minimal labeled data and machine learning expertise to develop solutions for various decision making problems. However, for applications that need operating with constraints on specific metrics (e.g., precision $\geq$ 95%), decision making with black-box LLMs remains unfavorable, due to their low numerical output cardinalities. This results in limited control over their operating points, preventing fine-grained adjustment of their decision making behavior. In this paper, we study using black-box LLMs as classifiers, focusing on efficiently improving their operational granularity without performance loss. Specifically, we first investigate the reasons behind their low-cardinality numerical outputs and show that they are biased towards generating rounded but informative verbalized probabilities. Then, we experiment with standard prompt engineering, uncertainty estimation and confidence elicitation techniques, and observe that they do not effectively improve operational granularity without sacrificing performance or increasing inference cost. Finally, we propose efficient approaches to significantly increase the number and diversity of available operating points. Our proposed approaches provide finer-grained operating points and achieve comparable to or better performance than the benchmark methods across 11 datasets and 3 LLMs.

cs.MA

[558] TACLA: An LLM-Based Multi-Agent Tool for Transactional Analysis Training in Education

Monika Zamojska, Jarosław A. Chudziak

Main category: cs.MA

TL;DR: TACLA is a Multi-Agent architecture that integrates Transactional Analysis principles to model psychologically authentic social dynamics using LLMs, validated in educational scenarios.

Details

Motivation: Existing LLM-based social simulations lack psychological depth and consistent persona behavior needed for high-fidelity training tools in education and other domains.

Method: TACLA models agents as orchestrated systems of Parent, Adult, and Child ego states with individual pattern memories, using an Orchestrator Agent to prioritize ego state activation based on contextual triggers and life scripts.

Result: TACLA demonstrates realistic ego state shifts in Student Agents, effectively modeling conflict de-escalation and escalation based on teacher intervention strategies, with high conversational credibility.

Conclusion: TACLA advances psychologically-grounded social simulations for effective AI tools in education and beyond by creating dynamic, authentic agent interactions.

Abstract: Simulating nuanced human social dynamics with Large Language Models (LLMs) remains a significant challenge, particularly in achieving psychological depth and consistent persona behavior crucial for high-fidelity training tools. This paper introduces TACLA (Transactional Analysis Contextual LLM-based Agents), a novel Multi-Agent architecture designed to overcome these limitations. TACLA integrates core principles of Transactional Analysis (TA) by modeling agents as an orchestrated system of distinct Parent, Adult, and Child ego states, each with its own pattern memory. An Orchestrator Agent prioritizes ego state activation based on contextual triggers and an agent’s life script, ensuring psychologically authentic responses. Validated in an educational scenario, TACLA demonstrates realistic ego state shifts in Student Agents, effectively modeling conflict de-escalation and escalation based on different teacher intervention strategies. Evaluation shows high conversational credibility and confirms TACLA’s capacity to create dynamic, psychologically-grounded social simulations, advancing the development of effective AI tools for education and beyond.

[559] Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning

Rui Jerry Huang, Wendy Liu, Anastasia Miin

Main category: cs.MA

TL;DR: AdCo is a novel inference-time framework that uses adaptive coopetition (collaboration + competition) with UCB-based mechanisms to enhance LLM reasoning without requiring high-performance verifiers, achieving 20% improvement on challenging math benchmarks.

Details

Motivation: Existing inference-time methods like self-correction reinforce biases and Multi-Agent Collaboration lacks efficient coordination, while high-performance verifiers require substantial training. The goal is to develop a robust framework that leverages model diversity without these limitations.

Method: LLM agents use adaptive UCB-based coopetition mechanism with coarse verifier signals to decide between collaboration and competition, iteratively refining reasoning through peer feedback without relying on high-performance verifiers.

Result: Achieved 20% relative improvement over baselines on challenging mathematical reasoning benchmarks, with robust and consistent performance across different sample sizes and configurations.

Conclusion: The adaptive coopetition framework enhances reasoning robustness by leveraging model diversity and reasoning trace measures while promoting uncertainty-driven exploration, offering a new perspective on inference-time computation for resilient multi-agent LLM systems.

Abstract: Inference-time computation is a critical yet challenging paradigm for enhancing the reasoning performance of large language models (LLMs). While existing strategies improve reasoning stability and consistency, they suffer from notable limitations: self-correction often reinforces the model’s initial biases, and Multi-Agent Collaboration (MAC) often fails due to the lack of efficient coordination mechanisms, leading to collective errors. Although high-performing verifiers can detect reasoning errors, making them reliable requires substantial training. To address these challenges, we introduce a novel inference-time framework, Adaptive Coopetition (AdCo), in which LLM agents utilize an adaptive, UCB-based “coopetition” mechanism. At each round, agents leverage coarse verifier signals to determine whether to collaborate or compete, and iteratively refine their reasoning based on peer feedback. Without relying on high-performance verifiers, our adaptive strategy achieves significant performance gains on mathematical reasoning benchmarks, yielding a 20% relative improvement over baselines on the more challenging dataset. Our approach remains robust and consistent in terms of accuracy under different sample sizes and configurations. This adaptive, signal-guided “coopetition” framework enhances reasoning robustness by leveraging both model knowledge diversity and reasoning trace measures, while also promoting uncertainty-driven exploration, especially when participants have comparable capabilities. From this perspective, our work offers a fresh lens on inference-time computation and paves the way for more resilient multi-agent LLM systems. Our code is available at: https://github.com/AdCo-Research/adaptive-coopetition.

[560] The Emergence of Complex Behavior in Large-Scale Ecological Environments

Joseph Bejjani, Chase Van Amburg, Chengrui Wang, Chloe Huangyuan Su, Sarah M. Pratt, Yasin Mazloumi, Naeem Khoshnevis, Sham M. Kakade, Kianté Brantley

Main category: cs.MA

TL;DR: This paper explores how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments through unsupervised evolution, reproduction, mutation, and natural selection.

Details

Motivation: To examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures, rather than optimizing single high-performance policies.

Method: Conducted experiments in large-scale worlds with populations over 60,000 agents, each with evolved neural network policies, using unsupervised evolution with reproduction, mutation, and natural selection in dynamic ecological environments.

Result: Identified emergent behaviors like long-range resource extraction, vision-based foraging, and predation that arise under competitive pressures. Found that some behaviors only appear in sufficiently large environments and populations, with larger scales increasing behavioral stability and consistency.

Conclusion: Scaling results provide promising directions to explore ecology as an instrument of machine learning in an era of abundant computational resources, showing that environmental scale and population size significantly impact behavioral emergence.

Abstract: We explore how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments. In our setting, agents are unsupervised and have no explicit rewards or learning objectives but instead evolve over time according to reproduction, mutation, and natural selection. As they act, agents also shape their environment and the population around them in an ongoing dynamic ecology. Our goal is not to optimize a single high-performance policy, but instead to examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures. In an effort to discover how complex behaviors naturally emerge, we conduct experiments in large-scale worlds that reach populations of more than 60,000 individual agents, each with their own evolved neural network policy. We identify various emergent behaviors such as long-range resource extraction, vision-based foraging, and predation that arise under competitive and survival pressures. We examine how sensing modalities and environmental scale affect the emergence of these behaviors, finding that some appear only in sufficiently large environments and populations, with larger scales increasing behavioral stability and consistency. While there is a rich history of research in evolutionary settings, our scaling results provide promising new directions to explore ecology as an instrument of machine learning in an era of abundant computational resources. Experimental code is available at https://github.com/jbejjani2022/ecological-emergent-behavior.

Xiao Xue, Deyu Zhou, Ming Zhang, Fei-Yue Wang

Main category: cs.MA

TL;DR: This paper reviews the historical development of Agent-Based Modeling (ABM), its design principles, and classic applications in social simulation.

Details

Motivation: To address the limitations of traditional physical simulation methods in social domains and provide a comprehensive understanding of ABM's evolution and foundational concepts.

Method: The review covers ABM’s development history, design principles, and foundational models including individual models, environmental models, and rule-based models. It also examines classic social simulation cases categorized as thought experiments, mechanism exploration, and parallel optimization.

Result: The paper presents a systematic overview of ABM’s historical trajectory and its application in social simulation through various classic case studies.

Conclusion: This comprehensive review establishes ABM as a valuable approach for social system simulation, highlighting its ability to overcome traditional simulation limitations and providing a foundation for understanding complex social phenomena.

Abstract: This is the first part of the comprehensive review, focusing on the historical development of Agent-Based Modeling (ABM) and its classic cases. It begins by discussing the development history and design principles of Agent-Based Modeling (ABM), helping readers understand the significant challenges that traditional physical simulation methods face in the social domain. Then, it provides a detailed introduction to foundational models for simulating social systems, including individual models, environmental models, and rule-based models. Finally, it presents classic cases of social simulation, covering three types: thought experiments, mechanism exploration, and parallel optimization.

[562] Socialized Learning and Emergent Behaviors in Multi-Agent Systems based on Multimodal Large Language Models

Sureyya Akin, Shruti T. Tiwari, Ram Bhattacharya, Sagar A. Raman, Kiran Mohanty, Sita Krishnan

Main category: cs.MA

TL;DR: M-S2L framework integrates multimodal LLMs with social learning to develop AI agents with emergent social intelligence, enabling collaborative problem-solving through multimodal perception and communication.

Details

Motivation: To foster emergent social intelligence in AI agents by combining multimodal perception with social learning mechanisms for human-like collaborative capabilities.

Method: Multimodal Socialized Learning Framework (M-S2L) that integrates multimodal LLMs with direct reinforcement learning, multimodal observational learning, communication-driven learning from feedback, and episodic memory for social context.

Result: M-S2L agents outperformed Text-Only and No-Social-Learning baselines in Collaborative Assembly Environment tasks, showing improved task completion rates and faster completion times, especially in dynamic scenarios. Agents developed efficient communication protocols and role specialization.

Conclusion: Integrating multimodal perception with explicit social learning is essential for developing human-like collaborative intelligence in multi-agent systems, enabling emergent social cognition and adaptive problem-solving.

Abstract: This search introduces the Multimodal Socialized Learning Framework (M-S2L), designed to foster emergent social intelligence in AI agents by integrating Multimodal Large Language Models (M-LLMs) with social learning mechanisms. The framework equips agents with multimodal perception (vision and text) and structured action capabilities, enabling physical manipulation and grounded multimodal communication (e.g., text with visual pointers). M-S2L combines direct reinforcement learning with two novel social learning pathways: multimodal observational learning and communication-driven learning from feedback, augmented by an episodic memory system for long-term social context. We evaluate M-S2L in a Collaborative Assembly Environment (CAE), where agent teams must construct complex devices from ambiguous blueprints under informational asymmetry. Across tasks of increasing complexity, M-S2L agents consistently outperform Text-Only and No-Social-Learning baselines in Task Completion Rate and Time to Completion, particularly in dynamic problem-solving scenarios. Ablation studies confirm the necessity of both multimodality and socialized learning. Our analysis reveals the emergence of efficient communication protocols integrating visual pointers with concise text, alongside rapid role specialization leading to stable labor division. Qualitative case studies demonstrate agents’ abilities for shared awareness, dynamic re-planning, and adaptive problem-solving, suggesting a nascent form of machine social cognition. These findings indicate that integrating multimodal perception with explicit social learning is critical for developing human-like collaborative intelligence in multi-agent systems.

[563] Fetch.ai: An Architecture for Modern Multi-Agent Systems

Michael J. Wooldridge, Attila Bagoly, Jonathan J. Ward, Emanuele La Malfa, Gabriel Paludo Licks

Main category: cs.MA

TL;DR: Fetch.ai architecture bridges classical multi-agent systems with modern AI by providing a decentralized blockchain-based platform for creating secure, interoperable agents with intelligent orchestration.

Details

Motivation: Current LLM-driven systems overlook decades of MAS research, resulting in limitations like centralization and poor trust/communication protocols.

Method: Multi-layered solution with decentralized blockchain services for identity/discovery/transactions, development framework for agents, cloud deployment platform, and agent-native LLM for workflow orchestration.

Result: Demonstrated through a decentralized logistics use case where autonomous agents dynamically discover, negotiate and transact securely.

Conclusion: Provides a principled architecture for open, collaborative, and economically sustainable multi-agent ecosystems beyond current implementations.

Abstract: Recent surges in LLM-driven intelligent systems largely overlook decades of foundational multi-agent systems (MAS) research, resulting in frameworks with critical limitations such as centralization and inadequate trust and communication protocols. This paper introduces the Fetch.ai architecture, an industrial-strength platform designed to bridge this gap by facilitating the integration of classical MAS principles with modern AI capabilities. We present a novel, multi-layered solution built on a decentralized foundation of on-chain blockchain services for verifiable identity, discovery, and transactions. This is complemented by a comprehensive development framework for creating secure, interoperable agents, a cloud-based platform for deployment, and an intelligent orchestration layer where an agent-native LLM translates high-level human goals into complex, multi-agent workflows. We demonstrate the deployed nature of this system through a decentralized logistics use case where autonomous agents dynamically discover, negotiate, and transact with one another securely. Ultimately, the Fetch.ai stack provides a principled architecture for moving beyond current agent implementations towards open, collaborative, and economically sustainable multi-agent ecosystems.

[564] Computational Foundations for Strategic Coopetition: Formalizing Interdependence and Complementarity

Vik Pant, Eric Yu

Main category: cs.MA

TL;DR: This paper bridges the gap between qualitative conceptual modeling and quantitative game theory by developing computational foundations for analyzing strategic coopetition, focusing on interdependence and complementarity dimensions.

Details

Motivation: Modern socio-technical systems involve strategic coopetition where actors cooperate to create value and compete to capture it. Existing approaches like i* modeling provide qualitative representations but lack quantitative analysis, while game theory offers mathematical rigor but lacks contextual richness.

Method: The paper formalizes interdependence through i* structural dependency analysis translated into quantitative coefficients, and formalizes complementarity using Brandenburger and Nalebuff’s Added Value concept. It integrates structural dependencies with bargaining power and introduces a game-theoretic formulation with Nash Equilibrium incorporating structural interdependence.

Result: Validation shows functional form robustness across power and logarithmic value function specifications. Empirical application to the Samsung-Sony S-LCD joint venture demonstrates logarithmic specifications achieve superior empirical fit (45/60 validation score) while power functions provide theoretical tractability.

Conclusion: This technical report serves as the foundational reference for a coordinated research program on strategic coopetition in requirements engineering and multi-agent systems, with companion work addressing trust dynamics, team production, and reciprocity mechanisms.

Abstract: Modern socio-technical systems are characterized by strategic coopetition where actors simultaneously cooperate to create value and compete to capture it. While conceptual modeling languages like i* provide rich qualitative representations of strategic dependencies, they lack mechanisms for quantitative analysis of dynamic trade-offs. Conversely, classical game theory offers mathematical rigor but strips away contextual richness. This technical report bridges this gap by developing computational foundations that formalize two critical dimensions of coopetition: interdependence and complementarity. We ground interdependence in i* structural dependency analysis, translating depender-dependee-dependum relationships into quantitative interdependence coefficients through a structured translation framework. We formalize complementarity following Brandenburger and Nalebuff’s Added Value concept, modeling synergistic value creation with validated parameterization. We integrate structural dependencies with bargaining power in value appropriation and introduce a game-theoretic formulation where Nash Equilibrium incorporates structural interdependence. Validation combines comprehensive experimental testing across power and logarithmic value function specifications, demonstrating functional form robustness, with empirical application to the Samsung-Sony S-LCD joint venture (2004-2011), where logarithmic specifications achieve superior empirical fit (validation score 45/60) while power functions provide theoretical tractability. This technical report serves as the foundational reference for a coordinated research program examining strategic coopetition in requirements engineering and multi-agent systems, with companion work addressing trust dynamics, team production, and reciprocity mechanisms.

[565] Safe Voting: Resilience to Abstention and Sybils

Reshef Meir, Gal Shahaf, Ehud Shapiro, Nimrod Talmon

Main category: cs.MA

TL;DR: The paper analyzes voting rules that maintain social choice integrity when facing sybil (fake/duplicate) votes and voter abstention by enforcing the status quo as a default option.

Details

Motivation: To address the failure of traditional voting rules when sybil votes are present and honest voters don't participate, ensuring social choice integrity in imperfect voting environments.

Method: Uses Reality-aware Social Choice framework with status quo Enforcing (QUE) voting rules that add virtual votes supporting the status quo, analyzing tradeoffs between safety (maintaining status quo) and liveness (changing status quo).

Result: Characterizes optimal tradeoff between safety and liveness in various domains, identifying exact conditions for mechanisms to remain resilient to sybils while responsive to verified participation.

Conclusion: Provides quantitative tools for voting system designers to measure benefits of increased participation and verified identities, offering optimal voting rules for sybil-resilient social choice.

Abstract: Voting rules may implement the will of the society when all eligible voters vote, and only them. However, they may fail to do so when sybil (fake or duplicate) votes are present and when only some honest (non sybil) voters actively participate. As, unfortunately, sometimes this is the case, our aim here is to address social choice in the presence of sybils and voter abstention. % To do so, we build upon the framework of Reality-aware Social Choice: we assume the status quo as an ever-present distinguished alternative, and study \emph{status quo Enforcing (QUE) voting rules}, which add virtual votes in support of the status quo. We characterize the tradeoff between \emph{safety} and \emph{liveness} (the ability of active honest voters to maintain/change the status quo, respectively) in several domains, and show that the voting rules are often optimal. \revision{Our characterization identifies the exact conditions under which mechanisms remain both resilient to sybils and responsive to verified participation, offering a quantitative tool for designers to measure the benefit of increased participation and verified identities.

[566] Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations

Jinkun Chen, Sher Badshah, Xuemin Yu, Sijia Han

Main category: cs.MA

TL;DR: The paper argues that current multi-agent simulations using LLMs are too static and constrained, and proposes a shift toward open-ended, evolving systems that better model real-world complexity.

Details

Motivation: Current LLM-powered multi-agent simulations are limited by predefined tasks, static environments, and rigid evaluation, preventing them from capturing real-world societal complexity.

Method: Critical review of emerging LLM-multi-agent architectures, analysis of key challenges (stability vs diversity, unexpected behavior evaluation, scaling), and development of a new taxonomy for the field.

Result: A research roadmap focused on open-endedness, continuous co-evolution, and resilient socially-aligned AI ecosystems is presented.

Conclusion: The community should move beyond static paradigms and develop adaptive, socially-aware multi-agent simulations that can evolve unpredictably.

Abstract: What if artificial agents could not just communicate, but also evolve, adapt, and reshape their worlds in ways we cannot fully predict? With llm now powering multi-agent systems and social simulations, we are witnessing new possibilities for modeling open-ended, ever-changing environments. Yet, most current simulations remain constrained within static sandboxes, characterized by predefined tasks, limited dynamics, and rigid evaluation criteria. These limitations prevent them from capturing the complexity of real-world societies. In this paper, we argue that static, task-specific benchmarks are fundamentally inadequate and must be rethought. We critically review emerging architectures that blend llm with multi-agent dynamics, highlight key hurdles such as balancing stability and diversity, evaluating unexpected behaviors, and scaling to greater complexity, and introduce a fresh taxonomy for this rapidly evolving field. Finally, we present a research roadmap centered on open-endedness, continuous co-evolution, and the development of resilient, socially aligned AI ecosystems. We call on the community to move beyond static paradigms and help shape the next generation of adaptive, socially-aware multi-agent simulations.

[567] Stop Reducing Responsibility in LLM-Powered Multi-Agent Systems to Local Alignment

Jinwei Hu, Yi Dong, Shuang Ao, Zhuoyun Li, Boxuan Wang, Lokesh Singh, Guangliang Cheng, Sarvapali D. Ramchurn, Xiaowei Huang

Main category: cs.MA

TL;DR: LLM-powered Multi-Agent Systems need global systemic agreement and lifecycle-wide responsibility management instead of local agent alignment, requiring interdisciplinary governance with human-AI oversight.

Details

Motivation: LLM-MAS introduce risks like unguaranteed agreement, cascading uncertainty, and adversarial vulnerabilities that require a paradigm shift from local to global responsibility.

Method: Propose a dual-perspective governance framework combining interdisciplinary design with human-AI collaborative oversight, treating LLM-MAS as unified socio-technical systems.

Result: Conceptualizes responsibility as lifecycle-wide property encompassing agreement, uncertainty, and security with complementary integration of subjective human values and objective verifiability.

Conclusion: LLM-MAS should be viewed as unified dynamic socio-technical systems requiring principled mechanisms for ethically aligned, verifiably coherent, and resilient behavior to achieve sustained system-wide agreement.

Abstract: LLM-powered Multi-Agent Systems (LLM-MAS) unlock new potentials in distributed reasoning, collaboration, and task generalization but also introduce additional risks due to unguaranteed agreement, cascading uncertainty, and adversarial vulnerabilities. We argue that ensuring responsible behavior in such systems requires a paradigm shift: from local, superficial agent-level alignment to global, systemic agreement. We conceptualize responsibility not as a static constraint but as a lifecycle-wide property encompassing agreement, uncertainty, and security, each requiring the complementary integration of subjective human-centered values and objective verifiability. Furthermore, a dual-perspective governance framework that combines interdisciplinary design with human-AI collaborative oversight is essential for tracing and ensuring responsibility throughout the lifecycle of LLM-MAS. Our position views LLM-MAS not as loose collections of agents, but as unified, dynamic socio-technical systems that demand principled mechanisms to support each dimension of responsibility and enable ethically aligned, verifiably coherent, and resilient behavior for sustained, system-wide agreement.

[568] Disaster Management in the Era of Agentic AI Systems: A Vision for Collective Human-Machine Intelligence for Augmented Resilience

Bo Li, Junwei Ma, Kai Yin, Yiming Xiao, Chia-Wei Hsu, Ali Mostafavi

Main category: cs.MA

TL;DR: Disaster Copilot is a multi-agent AI system that coordinates specialized AI tools to overcome challenges in disaster management, providing real-time operational insights and serving as an AI backbone for Disaster Digital Twins.

Details

Motivation: Traditional disaster response is overwhelmed by fragmented data, siloed technologies, resource constraints, and loss of institutional memory, which hinder effective decision-making.

Method: A central orchestrator coordinates diverse sub-agents specializing in predictive risk analytics, situational awareness, and impact assessment, integrating multi-modal data for real-time operational pictures with on-device orchestration for resource-limited environments.

Result: The system delivers holistic, real-time operational awareness and transforms Disaster Digital Twins from passive models to active, intelligent environments while capturing institutional knowledge to mitigate staff turnover impacts.

Conclusion: Disaster Copilot offers a transformative vision for building more adaptive, data-driven, and resilient communities through collective human-machine intelligence, with a three-phased roadmap for technology, organizational capacity, and human-AI teaming development.

Abstract: The escalating frequency and severity of disasters routinely overwhelm traditional response capabilities, exposing critical vulnerability in disaster management. Current practices are hindered by fragmented data streams, siloed technologies, resource constraints, and the erosion of institutional memory, which collectively impede timely and effective decision making. This study introduces Disaster Copilot, a vision for a multi-agent artificial intelligence system designed to overcome these systemic challenges by unifying specialized AI tools within a collaborative framework. The proposed architecture utilizes a central orchestrator to coordinate diverse sub-agents, each specializing in critical domains such as predictive risk analytics, situational awareness, and impact assessment. By integrating multi-modal data, the system delivers a holistic, real-time operational picture and serve as the essential AI backbone required to advance Disaster Digital Twins from passive models to active, intelligent environments. Furthermore, it ensures functionality in resource-limited environments through on-device orchestration and incorporates mechanisms to capture institutional knowledge, mitigating the impact of staff turnover. We detail the system architecture and propose a three-phased roadmap emphasizing the parallel growth of technology, organizational capacity, and human-AI teaming. Disaster Copilot offers a transformative vision, fostering collective human-machine intelligence to build more adaptive, data-driven and resilient communities.

cs.MM

[569] EVER: Edge-Assisted Auto-Verification for Mobile MR-Aided Operation

Jiangong Chen, Mingyu Zhu, Bin Li

Main category: cs.MM

TL;DR: EVER is an edge-assisted auto-verification system for Mixed Reality operations that uses segmentation models and IoU metrics to verify user compliance with MR guidance, achieving over 90% accuracy within 100ms.

Details

Motivation: Existing MR operation verification approaches fail to account for discrepancies between physical and virtual objects due to imperfect 3D modeling or lighting estimation, requiring a more robust verification method.

Method: EVER uses segmentation models and rendering pipeline adapted to physical/virtual object attributes, employs IoU-based threshold strategy for verification, and offloads compute-intensive tasks to edge servers for efficiency.

Result: EVER achieves over 90% verification accuracy within 100 milliseconds (faster than human reaction time of 273ms) while consuming minimal additional computational resources and energy.

Conclusion: The proposed EVER system provides fast, accurate, and energy-efficient auto-verification for MR-aided operations, significantly outperforming existing approaches and human reaction times.

Abstract: Mixed Reality (MR)-aided operation overlays digital objects on the physical world to provide a more immersive and intuitive operation process. A primary challenge is the precise and fast auto-verification of whether the user follows MR guidance by comparing frames before and after each operation. The pre-operation frame includes virtual guiding objects, while the post-operation frame contains physical counterparts. Existing approaches fall short of accounting for the discrepancies between physical and virtual objects due to imperfect 3D modeling or lighting estimation. In this paper, we propose EVER: an edge-assisted auto-verification system for mobile MR-aided operations. Unlike traditional frame-based similarity comparisons, EVER leverages the segmentation model and rendering pipeline adapted to the unique attributes of frames with physical pieces and those with their virtual counterparts; it adopts a threshold-based strategy using Intersection over Union (IoU) metrics for accurate auto-verification. To ensure fast auto-verification and low energy consumption, EVER offloads compute-intensive tasks to an edge server. Through comprehensive evaluations of public datasets and custom datasets with practical implementation, EVER achieves over 90% verification accuracy within 100 milliseconds (significantly faster than average human reaction time of approximately 273 milliseconds), while consuming only minimal additional computational resources and energy compared to a system without auto-verification.

[570] How2Compress: Scalable and Efficient Edge Video Analytics via Adaptive Granular Video Compression

Yuheng Wu, Thanh-Tung Nguyen, Lucas Liebe, Quang Tau, Pablo Espinosa Campos, Jinghan Cheng, Dongman Lee

Main category: cs.MM

TL;DR: How2Compress is a plug-and-play framework that enhances video compression efficiency through fine-grained quality control at the macroblock level, achieving up to 50.4% bitrate savings without compromising analytical accuracy.

Details

Motivation: Existing learning-based adaptive quantization frameworks for video compression fail to fully exploit the fine-grained quality control capabilities of modern block-based video codecs, leaving significant compression efficiency untapped.

Method: How2Compress is a simple plug-and-play module that provides precise, fine-grained quality control at the macroblock level. It can be seamlessly integrated into existing edge video analytics pipelines and was implemented on the H.264 codec.

Result: Experimental results show How2Compress achieves up to 50.4% bitrate savings and outperforms baselines by up to 3.01× without compromising accuracy across diverse real-world scenarios.

Conclusion: How2Compress demonstrates practical effectiveness and efficiency in enhancing video compression for IoT and edge video analytics applications under bandwidth constraints.

Abstract: With the rapid proliferation of the Internet of Things, video analytics has become a cornerstone application in wireless multimedia sensor networks. To support such applications under bandwidth constraints, learning-based adaptive quantization for video compression have demonstrated strong potential in reducing bitrate while maintaining analytical accuracy. However, existing frameworks often fail to fully exploit the fine-grained quality control enabled by modern blockbased video codecs, leaving significant compression efficiency untapped. In this paper, we present How2Compress, a simple yet effective framework designed to enhance video compression efficiency through precise, fine-grained quality control at the macroblock level. How2Compress is a plug-and-play module and can be seamlessly integrated into any existing edge video analytics pipelines. We implement How2Compress on the H.264 codec and evaluate its performance across diverse real-world scenarios. Experimental results show that How2Compress achieves up to $50.4%$ bitrate savings and outperforms baselines by up to $3.01\times$ without compromising accuracy, demonstrating its practical effectiveness and efficiency. Code is available at https://github.com/wyhallenwu/how2compress and a reproducible docker image at https://hub.docker.com/r/wuyuheng/how2compress.

[571] DeLoad: Demand-Driven Short-Video Preloading with Scalable Watch-Time Estimation

Tong Liu, Zhiwei Fan, Guanyan Peng, Haodan Zhang, Yucheng Zhang, Zhen Wang, Pengjin Xie, Liang Liu

Main category: cs.MM

TL;DR: DeLoad is a novel preloading framework for short video streaming that introduces dynamic task sizing and practical watch time estimation, enhanced by DRL to optimize download decisions, achieving significant QoE improvements and bandwidth savings.

Details

Motivation: Existing preloading approaches have limitations: insufficient adaptation of download task sizes to dynamic conditions and unreliable watch time prediction models at scale, which hinder effective QoE and bandwidth efficiency in short video streaming.

Method: Proposed DeLoad framework with dynamic task sizing, multi-dimensional watch time estimation, and a DRL-enhanced agent to adaptively optimize download range decisions.

Result: Extensive offline evaluations show 34.4% to 87.4% QoE improvement. After commercial deployment, increased user watch time by 0.09%, reduced rebuffering events, and saved 3.76% bandwidth consumption.

Conclusion: DeLoad effectively addresses limitations of existing approaches and demonstrates significant improvements in both QoE metrics and bandwidth efficiency in real-world commercial short video platforms.

Abstract: Short video streaming has become a dominant paradigm in digital media, characterized by rapid swiping interactions and diverse media content. A key technical challenge is designing an effective preloading strategy that dynamically selects and prioritizes download tasks from an evolving playlist, balancing Quality of Experience (QoE) and bandwidth efficiency under practical commercial constraints. However, real world analysis reveals critical limitations of existing approaches: (1) insufficient adaptation of download task sizes to dynamic conditions, and (2) watch time prediction models that are difficult to deploy reliably at scale. In this paper, we propose DeLoad, a novel preloading framework that addresses these issues by introducing dynamic task sizing and a practical, multi dimensional watch time estimation method. Additionally, a Deep Reinforcement Learning (DRL) enhanced agent is trained to optimize the download range decisions adaptively. Extensive evaluations conducted on an offline testing platform, leveraging massive real world network data, demonstrate that DeLoad achieves significant improvements in QoE metrics (34.4% to 87.4% gain). Furthermore, after deployment on a large scale commercial short video platform, DeLoad has increased overall user watch time by 0.09% while simultaneously reducing rebuffering events and 3.76% bandwidth consumption.

[572] PIRA: Pan-CDN Intra-video Resource Adaptation for Short Video Streaming

Chunyu Qiao, Tong Liu, Yucheng Zhang, Zhiwei Fan, Pengjin Xie, Zhen Wang, Liang Liu

Main category: cs.MM

TL;DR: PIRA is a dynamic CDN resource selection algorithm that optimizes Quality of Experience (QoE) and cost in real-time for short video streaming, achieving significant improvements in startup delay, rebuffering time, and traffic costs.

Details

Motivation: CDN resource selection is critical for maintaining QoE while controlling traffic costs in large-scale short video platforms. Current approaches show that higher QoE CDNs come at greater financial cost, and connection quality fluctuates even within single videos, creating a dynamic trade-off between QoE and cost that remains insufficiently investigated.

Method: PIRA integrates QoE and cost through a mathematical model and introduces an intra-video control theoretic CDN resource selection approach. It employs state space pruning and adaptive parameter adjustment to efficiently solve the high-dimensional optimization problem and reduce computation overheads.

Result: In large-scale production experiments with 450,000 users over two weeks, PIRA outperformed the production baseline by achieving 2.1% reduction in startup delay, 15.2% shorter rebuffering time, and 10% lower average unit traffic cost.

Conclusion: PIRA effectively balances user experience and financial cost at scale, demonstrating its practical effectiveness in dynamic CDN resource selection for short video streaming platforms.

Abstract: In large scale short video platforms, CDN resource selection plays a critical role in maintaining Quality of Experience (QoE) while controlling escalating traffic costs. To better understand this phenomenon, we conduct in the wild network measurements during video playback in a production short video system. The results reveal that CDNs delivering higher average QoE often come at greater financial cost, yet their connection quality fluctuates even within a single video underscoring a fundamental and dynamic trade off between QoE and cost. However, the problem of sustaining high QoE under cost constraints remains insufficiently investigated in the context of CDN selection for short video streaming. To address this, we propose PIRA, a dynamic resource selection algorithm that optimizes QoE and cost in real time during video playback. PIRA formally integrating QoE and cost by a mathematical model, and introduce a intra video control theoretic CDN resource selection approach which can balance QoE and cost under network dynamics. To reduce the computation overheads, PIRA employs state space pruning and adaptive parameter adjustment to efficiently solve the high dimensional optimization problem. In large scale production experiments involving 450,000 users over two weeks, PIRA outperforms the production baseline, achieving a 2.1% reduction in start up delay, 15.2% shorter rebuffering time, and 10% lower average unit traffic cost, demonstrating its effectiveness in balancing user experience and financial cost at scale.

Xiangyu Li, Ran Su, Liangliang Liu

Main category: cs.MM

TL;DR: M3ST-DTI is a multi-task learning model for drug-target interaction prediction that integrates textual, structural, and functional features using multi-stage alignment and fusion mechanisms to improve prediction accuracy.

Details

Motivation: Existing DTI prediction approaches fail to capture deep intra-modal feature interactions and effective cross-modal alignment, limiting predictive performance and generalization.

Method: Uses multi-task learning with three feature types (textual, structural, functional), self-attention mechanisms, hybrid pooling graph attention, MCA with Gram loss for early alignment, BCA for fine-grained interactions, and deep orthogonal fusion to reduce redundancy.

Result: Extensive evaluations show M3ST-DTI consistently outperforms state-of-the-art methods across diverse metrics on benchmark datasets.

Conclusion: The proposed multi-stage integration and alignment approach effectively addresses limitations in existing DTI prediction methods, achieving superior performance through comprehensive feature fusion and interaction modeling.

Abstract: Accurate prediction of drug-target interactions (DTI) is pivotal in drug discovery. However, existing approaches often fail to capture deep intra-modal feature interactions or achieve effective cross-modal alignment, limiting predictive performance and generalization. To address these challenges, we propose M3ST-DTI, a multi-task learning model that enables multi-stage integration and alignment of multi modal features for DTI prediction. M3ST-DTI incorporates three types of features-textual, structural, and functional and enhances intra-modal representations using self-attention mechanisms and a hybrid pooling graph attention module. For early-stage feature alignment and fusion, the model in tegrates MCA with Gram loss as a structural constraint. In the later stage, a BCA module captures fine-grained interactions between drugs and targets within each modality, while a deep orthogonal fusion module mitigates feature redundancy.Extensive evaluations on benchmark datasets demonstrate that M3ST-DTI consistently outperforms state-of-the art methods across diverse metrics

eess.AS

[574] Hearing Health in Home Healthcare: Leveraging LLMs for Illness Scoring and ALMs for Vocal Biomarker Extraction

Yu-Wen Chen, William Ho, Sasha M. Vergez, Grace Flaherty, Pallavi Gupta, Zhihong Zhang, Maryam Zolnoori, Margaret V. McDonald, Maxim Topaz, Zoran Kostic, Julia Hirschberg

Main category: eess.AS

TL;DR: This paper explores using LLMs and ALMs for automatic health assessment from voice data in home care settings, integrating SOAP notes and vital signs to create illness scores and analyzing vocal biomarkers.

Details

Motivation: The growing demand for home healthcare requires tools to support care delivery, particularly automatic health assessment from voice using real-world home care visit data.

Method: Uses LLMs to integrate SOAP notes from audio transcripts and vital signs into illness scores, and employs ALMs with multi-stage preprocessing to extract speech segments and describe vocal biomarkers.

Result: LLMs effectively estimate illness scores aligned with clinical outcomes, with SOAP notes being more informative than vital signs. ALMs successfully identify health-related acoustic patterns from home care recordings.

Conclusion: LLMs and ALMs have significant potential to leverage heterogeneous in-home visit data for improved patient monitoring and care delivery.

Abstract: The growing demand for home healthcare calls for tools that can support care delivery. In this study, we explore automatic health assessment from voice using real-world home care visit data, leveraging the diverse patient information it contains. First, we utilize Large Language Models (LLMs) to integrate Subjective, Objective, Assessment, and Plan (SOAP) notes derived from unstructured audio transcripts and structured vital signs into a holistic illness score that reflects a patient’s overall health. This compact representation facilitates cross-visit health status comparisons and downstream analysis. Next, we design a multi-stage preprocessing pipeline to extract short speech segments from target speakers in home care recordings for acoustic analysis. We then employ an Audio Language Model (ALM) to produce plain-language descriptions of vocal biomarkers and examine their association with individuals’ health status. Our experimental results benchmark both commercial and open-source LLMs in estimating illness scores, demonstrating their alignment with actual clinical outcomes, and revealing that SOAP notes are substantially more informative than vital signs. Building on the illness scores, we provide the first evidence that ALMs can identify health-related acoustic patterns from home care recordings and present them in a human-readable form. Together, these findings highlight the potential of LLMs and ALMs to harness heterogeneous in-home visit data for better patient monitoring and care.

[575] Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network

Zhanhong He, Hanyu Meng, David Huang, Roberto Togneri

Main category: eess.AS

TL;DR: Proposes an efficient multi-task network for piano dynamic estimation that jointly predicts dynamic levels, change points, beats, and downbeats using a multi-scale network with Bark-scale specific loudness input, achieving state-of-the-art results with significantly reduced model size.

Details

Motivation: Estimating piano dynamics from audio recordings is a fundamental challenge in computational music analysis, and existing methods often use large models with limited input lengths.

Method: Uses a multi-task network that jointly predicts four targets (dynamic levels, change points, beats, downbeats) from shared latent representation. Employs multi-scale network backbone with Bark-scale specific loudness as input feature instead of log-Mel, enabling 60-second audio input (double typical beat tracking length).

Result: Achieves state-of-the-art results on MazurkaBL dataset across all tasks. Reduces model size from 14.7M to 0.5M parameters while maintaining performance.

Conclusion: Sets new benchmark for piano dynamic estimation and provides a powerful, compact tool for large-scale, resource-efficient analysis of musical expression.

Abstract: Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. These four targets form the metrical structure of dynamics in the music score. Inspired by recent vocal dynamic research, we use a multi-scale network as the backbone, which takes Bark-scale specific loudness as the input feature. Compared to log-Mel as input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential input. We use a 60-second audio length in audio segmentation, which doubled the length of beat tracking commonly used. Evaluated on the public MazurkaBL dataset, our model achieves state-of-the-art results across all tasks. This work sets a new benchmark for piano dynamic estimation and delivers a powerful and compact tool, paving the way for large-scale, resource-efficient analysis of musical expression.

[576] Adaptive Per-Channel Energy Normalization Front-end for Robust Audio Signal Processing

Hanyu Meng, Vidhyasaharan Sethu, Eliathamby Ambikairajah, Qiquan Zhang, Haizhou Li

Main category: eess.AS

TL;DR: A novel adaptive audio front-end with neural controller that dynamically tunes parameters during inference, outperforming fixed front-ends in various acoustic conditions.

Details

Motivation: Current learnable audio front-ends have fixed parameters after training, lacking flexibility during inference and limiting robustness in dynamic acoustic environments.

Method: Simplified LEAF architecture integrated with a neural controller that dynamically tunes Per-Channel Energy Normalization using current and buffered past subband energies for input-dependent adaptation.

Result: Consistently outperforms prior fixed and learnable front-ends on multiple audio classification tasks under both clean and complex acoustic conditions.

Conclusion: Neural adaptability represents a promising direction for next-generation audio front-ends, enabling dynamic parameter adjustment during inference.

Abstract: In audio signal processing, learnable front-ends have shown strong performance across diverse tasks by optimizing task-specific representation. However, their parameters remain fixed once trained, lacking flexibility during inference and limiting robustness under dynamic complex acoustic environments. In this paper, we introduce a novel adaptive paradigm for audio front-ends that replaces static parameterization with a closed-loop neural controller. Specifically, we simplify the learnable front-end LEAF architecture and integrate a neural controller for adaptive representation via dynamically tuning Per-Channel Energy Normalization. The neural controller leverages both the current and the buffered past subband energies to enable input-dependent adaptation during inference. Experimental results on multiple audio classification tasks demonstrate that the proposed adaptive front-end consistently outperforms prior fixed and learnable front-ends under both clean and complex acoustic conditions. These results highlight neural adaptability as a promising direction for the next generation of audio front-ends.

[577] MVDR Beamforming for Cyclostationary Processes

Giovanni Bologni, Martin Bo Møller, Richard Heusdens, Richard C. Hendriks

Main category: eess.AS

TL;DR: The paper introduces cMVDR, a cyclic MVDR beamformer that extends conventional MVDR by exploiting spectral correlations in cyclostationary noise sources like musical instruments and engines, achieving up to 5 dB SI-SDR improvement.

Details

Motivation: Conventional beamformers assume stationary noise, which prevents them from exploiting frequency correlations in cyclostationary noise sources that have periodically varying statistics.

Method: Extends MVDR beamformer using frequency-shifted (FRESH) filtering to combine shifted input versions, with a data-driven strategy to estimate resonant frequencies via periodogram analysis for handling inharmonicity.

Result: cMVDR achieves up to 5 dB gain in scale-invariant signal-to-distortion ratio (SI-SDR) over conventional MVDR, with performance improving with increasing spectral correlation, and remains effective even with single microphone.

Conclusion: The cyclic MVDR beamformer successfully leverages both spatial and spectral correlations in cyclostationary noise, providing significant noise reduction improvements particularly in low-SNR scenarios.

Abstract: Conventional acoustic beamformers assume that noise is stationary within short time frames. This assumption prevents them from exploiting correlations between frequencies in almost-periodic noise sources such as musical instruments, fans, and engines. These signals exhibit periodically varying statistics and are better modeled as cyclostationary processes. This paper introduces the cyclic MVDR (cMVDR) beamformer, an extension of the conventional MVDR that leverages both spatial and spectral correlations to improve noise reduction, particularly in low-SNR scenarios. The method builds on frequency-shifted (FRESH) filtering, where shifted versions of the input are combined to attenuate or amplify components that are coherent across frequency. To address inharmonicity, where harmonic partials deviate from exact integer multiples of the fundamental frequency, we propose a data-driven strategy that estimates resonant frequencies via periodogram analysis and computes the frequency shifts from their spacing. Analytical and experimental results demonstrate that performance improves with increasing spectral correlation. On real recordings, the cMVDR achieves up to 5 dB gain in scale-invariant signal-to-distortion ratio (SI-SDR) over the MVDR and remains effective even with a single microphone. Code is available at https://github.com/Screeen/cMVDR.

[578] ProLAP: Probabilistic Language-Audio Pre-Training

Toranosuke Manabe, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Komatsu

Main category: eess.AS

TL;DR: ProLAP introduces probabilistic modeling for language-audio joint representation learning to handle many-to-many relationships, using hierarchical inclusion loss and mask repulsive loss to capture semantic hierarchies even with small datasets.

Details

Motivation: Traditional language-audio frameworks use deterministic embeddings assuming one-to-one correspondence, but real-world relationships are many-to-many (one audio can have multiple captions and vice versa).

Method: Proposes Probabilistic Language-Audio Pre-training (ProLAP) that models multiplicity as probability distributions in joint embedding space, with hierarchical inclusion loss for semantic hierarchy and mask repulsive loss for training efficiency.

Result: ProLAP outperforms existing deterministic approaches on audio-text retrieval tasks and demonstrates plausible semantic hierarchy capture in audio traversal tasks.

Conclusion: ProLAP effectively handles many-to-many language-audio relationships through probabilistic modeling and hierarchical learning objectives, achieving superior performance even with limited data.

Abstract: Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is inherently many-to-many: one audio segment can be described by multiple captions and vice versa. To address this, we propose Probabilistic Language-Audio Pre-training (ProLAP), which models multiplicity as the spread of probability distributions in a joint language-audio embedding space. To train the intra-modal hierarchical relationship effectively, we also introduce two objectives: (i) hierarchical inclusion loss to promote semantic hierarchical understanding of inputs and (ii) mask repulsive loss to improve the efficiency of learning when optimizing the hierarchical inclusion loss. With this training strategy, our model can learn the hierarchical structure inherent in the data even from small datasets, in contrast to prior probabilistic approaches that rely on large-scale datasets. In our experiments, ProLAP outperforms existing deterministic approaches on audio-text retrieval tasks. Moreover, through experiments on the audio traversal task introduced in this paper, we demonstrate that ProLAP captures the plausible semantic hierarchy.

[579] SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

Wenxi Chen, Xinsheng Wang, Ruiqi Yan, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiquan Li, Yuzhe Liang, Hanlin Wen, Shunshun Yin, Ming Tao, Xie Chen

Main category: eess.AS

TL;DR: SAC is a neural speech codec with semantic-acoustic dual-stream quantization that achieves superior reconstruction quality and semantic representation by disentangling semantic and acoustic modeling into two optimized streams.

Details

Motivation: Existing speech codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks for speech language models.

Method: Proposes SAC with semantic-acoustic dual-stream quantization, disentangling semantic and acoustic modeling into two dedicated streams that can be optimized for their respective roles.

Result: SAC achieves strong reconstruction performance across diverse bitrates under clean and noisy conditions, with high UTMOS and WER scores. It substantially outperforms state-of-the-art codecs in semantic representation, reaching levels comparable to SSL continuous embeddings.

Conclusion: The dual-stream design effectively disentangles speech components, offering new potential for controllable speech applications while providing superior perceptual quality, intelligibility, and semantic representation.

Abstract: Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models (SLMs). However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, demonstrating superior perceptual quality and intelligibility. Moreover, SAC substantially outperforms state-of-the-art codecs in semantic representation, achieving a level comparable to that of self-supervised learning (SSL) continuous embeddings. Finally, our analysis of speech disentanglement highlights the effectiveness of the dual-stream design, offering new potential for controllable speech applications.

[580] Diffusion Buffer for Online Generative Speech Enhancement

Bunlong Lay, Rostislav Makarov, Simon Welker, Maris Hillemann, Timo Gerkmann

Main category: eess.AS

TL;DR: The paper introduces Diffusion Buffer, a generative diffusion-based speech enhancement model that enables online processing with only one neural network call per frame, significantly reducing latency while maintaining quality.

Details

Motivation: Traditional generative speech enhancement models require multiple neural network calls per frame, making them computationally expensive for online applications. Predictive models work online but generative models generally perform better on unseen data.

Method: Proposes Diffusion Buffer that aligns physical time with diffusion time-steps, progressively denoising frames through time. Uses a carefully designed 2D convolutional UNet architecture and Data Prediction loss instead of Denoising Score Matching for flexible latency-quality trade-off.

Result: Drastically reduces algorithmic latency from 320-960 ms to 32-176 ms while increasing performance. The online Diffusion Buffer outperforms predictive approaches on unseen noisy speech data.

Conclusion: The Diffusion Buffer successfully enables generative diffusion models to work efficiently in online speech enhancement scenarios, achieving both low latency and superior performance on unseen data compared to predictive models.

Abstract: Online Speech Enhancement was mainly reserved for predictive models. A key advantage of these models is that for an incoming signal frame from a stream of data, the model is called only once for enhancement. In contrast, generative Speech Enhancement models often require multiple calls, resulting in a computational complexity that is too high for many online speech enhancement applications. This work presents the Diffusion Buffer, a generative diffusion-based Speech Enhancement model which only requires one neural network call per incoming signal frame from a stream of data and performs enhancement in an online fashion on a consumer-grade GPU. The key idea of the Diffusion Buffer is to align physical time with Diffusion time-steps. The approach progressively denoises frames through physical time, where past frames have more noise removed. Consequently, an enhanced frame is output to the listener with a delay defined by the Diffusion Buffer, and the output frame has a corresponding look-ahead. In this work, we extend upon our previous work by carefully designing a 2D convolutional UNet architecture that specifically aligns with the Diffusion Buffer’s look-ahead. We observe that the proposed UNet improves performance, particularly when the algorithmic latency is low. Moreover, we show that using a Data Prediction loss instead of Denoising Score Matching loss enables flexible control over the trade-off between algorithmic latency and quality during inference. The extended Diffusion Buffer equipped with a novel NN and loss function drastically reduces the algorithmic latency from 320 - 960 ms to 32 - 176 ms with an even increased performance. While it has been shown before that offline generative diffusion models outperform predictive approaches in unseen noisy speech data, we confirm that the online Diffusion Buffer also outperforms its predictive counterpart on unseen noisy speech data.

[581] Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

Xiangzhu Kong, Huang Hao, Zhijian Ou

Main category: eess.AS

TL;DR: SHTNet is a lightweight spherical harmonic transform framework that addresses cross-array generalization in multi-channel ASR through geometry-invariant sound field decomposition, spatio-spectral attention fusion, and robust training methods.

Details

Motivation: To overcome cross-array generalization challenges in multi-channel automatic speech recognition, where conventional methods struggle with diverse microphone array geometries.

Method: Three key innovations: 1) SHT-based spatial sound field decomposition for geometry-invariant processing, 2) Spatio-Spectral Attention Fusion Network combining spatial modeling, attention mechanisms, and spectral noise suppression, 3) Rand-SHT training with random channel selection and array geometry reconstruction.

Result: Achieves 39.26% average CER across heterogeneous arrays (circular, square, binaural) on Aishell-4, Alimeeting, and XMOS datasets, with 97.1% fewer computations than conventional neural beamformers.

Conclusion: SHTNet provides an effective lightweight solution for cross-array generalization in multi-channel ASR by decoupling signal processing from array geometry and achieving significant computational efficiency.

Abstract: This paper presents SHTNet, a lightweight spherical harmonic transform (SHT) based framework, which is designed to address cross-array generalization challenges in multi-channel automatic speech recognition (ASR) through three key innovations. First, SHT based spatial sound field decomposition converts microphone signals into geometry-invariant spherical harmonic coefficients, isolating signal processing from array geometry. Second, the Spatio-Spectral Attention Fusion Network (SSAFN) combines coordinate-aware spatial modeling, refined self-attention channel combinator, and spectral noise suppression without conventional beamforming. Third, Rand-SHT training enhances robustness through random channel selection and array geometry reconstruction. The system achieves 39.26% average CER across heterogeneous arrays (e.g., circular, square, and binaural) on datasets including Aishell-4, Alimeeting, and XMOS, with 97.1% fewer computations than conventional neural beamformers.

[582] Post-training for Deepfake Speech Detection

Wanying Ge, Xin Wang, Xuechen Liu, Junichi Yamagishi

Main category: eess.AS

TL;DR: Post-training approach adapts SSL models for deepfake speech detection using large multilingual dataset, achieving strong robustness and outperforming state-of-the-art detectors.

Details

Motivation: Bridge the gap between general pre-training and domain-specific fine-tuning for deepfake speech detection.

Method: Post-training SSL models using large-scale multilingual speech dataset (56k+ hours genuine + 18k+ hours artifacts) across 100+ languages.

Result: Post-trained models show strong robustness and generalization to unseen deepfake speech, consistently surpass existing SOTA detectors when fine-tuned.

Conclusion: Post-training effectively adapts SSL models for deepfake detection, providing robust performance and outperforming existing methods.

Abstract: We introduce a post-training approach that adapts self-supervised learning (SSL) models for deepfake speech detection by bridging the gap between general pre-training and domain-specific fine-tuning. We present AntiDeepfake models, a series of post-trained models developed using a large-scale multilingual speech dataset containing over 56,000 hours of genuine speech and 18,000 hours of speech with various artifacts in over one hundred languages. Experimental results show that the post-trained models already exhibit strong robustness and generalization to unseen deepfake speech. When they are further fine-tuned on the Deepfake-Eval-2024 dataset, these models consistently surpass existing state-of-the-art detectors that do not leverage post-training. Model checkpoints and source code are available online.

eess.IV

[583] Conformal Lesion Segmentation for 3D Medical Images

Binyu Tan, Zhiyuan Wang, Jinhao Duan, Kaidi Xu, Heng Tao Shen, Xiaoshuang Shi, Fumin Shen

Main category: eess.IV

TL;DR: Proposes Conformal Lesion Segmentation (CLS) - a risk-constrained framework that uses conformal prediction to calibrate data-driven thresholds, ensuring test-time false negative rate (FNR) remains below target tolerance with statistical guarantees.

Details

Motivation: Existing medical image segmentation models use fixed thresholds (e.g., 0.5) without statistical guarantees on key metrics like false negative rate, undermining reliable deployment in high-stakes clinical applications like 3D lesion segmentation.

Method: CLS holds out a calibration set to analyze threshold settings under FNR tolerance, defines FNR-specific loss function, identifies critical thresholds for calibration data, and determines approximate 1-α quantile as test-time confidence threshold using conformal prediction.

Result: Validated on six 3D-LS datasets across five backbone models, CLS provides rigorous FNR constraint while yielding more precise and reliable segmentations with statistical soundness.

Conclusion: CLS enables deployment of risk-aware segmentation in clinical practice by generalizing statistical regularities from calibration to test data, offering principled risk control for medical image segmentation.

Abstract: Medical image segmentation serves as a critical component of precision medicine, enabling accurate localization and delineation of pathological regions, such as lesions. However, existing models empirically apply fixed thresholds (e.g., 0.5) to differentiate lesions from the background, offering no statistical guarantees on key metrics such as the false negative rate (FNR). This lack of principled risk control undermines their reliable deployment in high-stakes clinical applications, especially in challenging scenarios like 3D lesion segmentation (3D-LS). To address this issue, we propose a risk-constrained framework, termed Conformal Lesion Segmentation (CLS), that calibrates data-driven thresholds via conformalization to ensure the test-time FNR remains below a target tolerance $\varepsilon$ under desired risk levels. CLS begins by holding out a calibration set to analyze the threshold setting for each sample under the FNR tolerance, drawing on the idea of conformal prediction. We define an FNR-specific loss function and identify the critical threshold at which each calibration data point just satisfies the target tolerance. Given a user-specified risk level $\alpha$, we then determine the approximate $1-\alpha$ quantile of all the critical thresholds in the calibration set as the test-time confidence threshold. By conformalizing such critical thresholds, CLS generalizes the statistical regularities observed in the calibration set to new test data, providing rigorous FNR constraint while yielding more precise and reliable segmentations. We validate the statistical soundness and predictive performance of CLS on six 3D-LS datasets across five backbone models, and conclude with actionable insights for deploying risk-aware segmentation in clinical practice.

[584] RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation

Juntao Jiang, Jiangning Zhang, Weixuan Liu, Muxuan Gao, Xiaobin Hu, Zhucun Xue, Yong Liu, Shuicheng Yan

Main category: eess.IV

TL;DR: RWKV-UNet integrates RWKV structure into U-Net to overcome CNN’s long-range dependency limitations and transformer’s computational complexity, achieving SOTA performance on 11 medical image segmentation datasets.

Details

Motivation: Address limitations of CNNs in capturing long-range dependencies and transformers' high computational complexity in medical image segmentation.

Method: Integrate RWKV structure into U-Net with Global-Local Spatial Perception blocks combining CNNs and RWKVs, and Cross-Channel Mix module for multi-scale feature fusion.

Result: Achieves state-of-the-art performance on 11 benchmark datasets for various medical image segmentation tasks, with smaller variants balancing accuracy and efficiency.

Conclusion: RWKV-UNet effectively enhances long-range dependency capture and contextual understanding while maintaining computational efficiency, making it suitable for clinical applications.

Abstract: In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model’s ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.

[585] A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography

Yui Lo, Yuqian Chen, Dongnan Liu, Leo Zekelman, Jarrett Rushmore, Yogesh Rathi, Nikos Makris, Alexandra J. Golby, Fan Zhang, Weidong Cai, Lauren J. O’Donnell

Main category: eess.IV

TL;DR: Tract2Shape is a multimodal deep learning framework that predicts white matter tractography shape measures from geometric and scalar features, outperforming SOTA models and showing strong cross-dataset generalization.

Details

Motivation: Conventional methods for computing white matter shape measures are computationally expensive and time-consuming for large-scale datasets due to voxel-based representations, creating a need for more efficient approaches.

Method: Proposed Tract2Shape framework uses multimodal inputs (geometric point cloud and scalar tabular features) with dimensionality reduction via PCA to predict five primary shape components, trained and evaluated on HCP-YA and PPMI datasets.

Result: Tract2Shape outperforms SOTA deep learning models across all ten shape measures on HCP-YA dataset, achieving highest average Pearson’s r and lowest nMSE. Maintains strong performance on unseen PPMI dataset, demonstrating cross-dataset generalization.

Conclusion: Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures, supporting scalable analysis across datasets and laying foundation for future large-scale white matter shape analysis.

Abstract: Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson’s r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson’s r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.

[586] Regression is all you need for medical image translation

Sebastian Rassmann, David Kügler, Christian Ewert, Martin Reuter

Main category: eess.IV

TL;DR: YODA is a 2.5D diffusion-based framework for medical image translation that uses expectation-approximation sampling and regression sampling to produce noise-free images efficiently, challenging the superiority of traditional GANs and diffusion models in medical applications.

Details

Motivation: GANs and diffusion models risk introducing hallucinations and noise replication in medical imaging where accuracy and fidelity are crucial, unlike in natural image synthesis where creativity and realism are strengths.

Method: Proposes YODA with two sampling approaches: (1) Expectation-Approximation (ExpA) sampling that averages multiple samples to approximate the expected value, and (2) regression sampling that retains initial DM prediction and omits iterative refinement for single-step noise-free images.

Result: YODA outperforms eight state-of-the-art DMs and GANs across five multi-modal datasets, with regression sampling being more efficient and matching/exceeding image quality of full diffusion sampling even with ExpA. Translated images are interchangeable with or superior to physical acquisitions.

Conclusion: Iterative refinement in diffusion models only enhances perceptual realism without benefiting information translation. YODA challenges the presumed superiority of DMs and GANs over computationally cheap regression models for high-quality medical image translation.

Abstract: While Generative Adversarial Nets (GANs) and Diffusion Models (DMs) have achieved impressive results in natural image synthesis, their core strengths - creativity and realism - can be detrimental in medical applications, where accuracy and fidelity are paramount. These models instead risk introducing hallucinations and replication of unwanted acquisition noise. Here, we propose YODA (You Only Denoise once - or Average), a 2.5D diffusion-based framework for medical image translation (MIT). Consistent with DM theory, we find that conventional diffusion sampling stochastically replicates noise. To mitigate this, we draw and average multiple samples, akin to physical signal averaging. As this effectively approximates the DM’s expected value, we term this Expectation-Approximation (ExpA) sampling. We additionally propose regression sampling YODA, which retains the initial DM prediction and omits iterative refinement to produce noise-free images in a single step. Across five diverse multi-modal datasets - including multi-contrast brain MRI and pelvic MRI-CT - we demonstrate that regression sampling is not only substantially more efficient but also matches or exceeds image quality of full diffusion sampling even with ExpA. Our results reveal that iterative refinement solely enhances perceptual realism without benefiting information translation, which we confirm in relevant downstream tasks. YODA outperforms eight state-of-the-art DMs and GANs and challenges the presumed superiority of DMs and GANs over computationally cheap regression models for high-quality MIT. Furthermore, we show that YODA-translated images are interchangeable with, or even superior to, physical acquisitions for several medical applications.

[587] SimCortex: Collision-free Simultaneous Cortical Surfaces Reconstruction

Kaveh Moradkhani, R Jarrett Rushmore, Sylvain Bouix

Main category: eess.IV

TL;DR: SimCortex is a deep learning framework that reconstructs all brain cortical surfaces from T1-weighted MRI while preserving topological properties, reducing overlaps and self-intersections compared to current methods.

Details

Motivation: Current cortical surface reconstruction methods struggle with complex cortical geometries, topological requirements, and often produce surfaces with overlaps, self-intersections, and topological defects.

Method: First segments T1w image into nine-class tissue label map, generates collision-free initial surface meshes, then applies multiscale diffeomorphic deformations using stationary velocity fields integrated via scaling-and-squaring for smooth, topology-preserving transformations.

Result: SimCortex dramatically reduces surface overlaps and self-intersections while maintaining state-of-the-art geometric accuracy on standard datasets.

Conclusion: The method successfully overcomes limitations of current approaches by providing topology-preserving cortical surface reconstruction with significantly reduced surface collisions.

Abstract: Accurate cortical surface reconstruction from magnetic resonance imaging (MRI) data is crucial for reliable neuroanatomical analyses. Current methods have to contend with complex cortical geometries, strict topological requirements, and often produce surfaces with overlaps, self-intersections, and topological defects. To overcome these shortcomings, we introduce SimCortex, a deep learning framework that simultaneously reconstructs all brain surfaces (left/right white-matter and pial) from T1-weighted(T1w) MRI volumes while preserving topological properties. Our method first segments the T1w image into a nine-class tissue label map. From these segmentations, we generate subject-specific, collision-free initial surface meshes. These surfaces serve as precise initializations for subsequent multiscale diffeomorphic deformations. Employing stationary velocity fields (SVFs) integrated via scaling-and-squaring, our approach ensures smooth, topology-preserving transformations with significantly reduced surface collisions and self-intersections. Evaluations on standard datasets demonstrate that SimCortex dramatically reduces surface overlaps and self-intersections, surpassing current methods while maintaining state-of-the-art geometric accuracy.

[588] Adapting Medical Vision Foundation Models for Volumetric Medical Image Segmentation via Active Learning and Selective Semi-supervised Fine-tuning

Jin Yang, Daniel S. Marcus, Aristeidis Sotiras

Main category: eess.IV

TL;DR: Proposes Active Source-Free Domain Adaptation (ASFDA) method to efficiently adapt Medical Vision Foundation Models to target domains for medical image segmentation using active learning to select informative samples without source data access.

Details

Motivation: Medical Vision Foundation Models (Med-VFMs) need efficient adaptation to target domains, but current methods lack systematic approaches for selecting informative samples to maximize performance with minimal fine-tuning budget.

Method: Uses Active Learning with two query metrics: Diversified Knowledge Divergence (DKD) to measure source-target knowledge gap and diversity, and Anatomical Segmentation Difficulty (ASD) to evaluate segmentation difficulty. Also employs Selective Semi-supervised Fine-tuning.

Result: The method enables efficient adaptation of Med-VFMs to target domains by selecting the most informative samples, maximizing performance while minimizing the number of samples needed for fine-tuning.

Conclusion: ASFDA provides an effective framework for adapting Med-VFMs to new medical domains efficiently through strategic sample selection and semi-supervised fine-tuning, addressing the challenge of limited annotated data in medical imaging.

Abstract: Medical Vision Foundation Models (Med-VFMs) have superior capabilities of interpreting medical images due to the knowledge learned from self-supervised pre-training with extensive unannotated images. To improve their performance on adaptive downstream evaluations, especially segmentation, a few samples from target domains are selected randomly for fine-tuning them. However, there lacks works to explore the way of adapting Med-VFMs to achieve the optimal performance on target domains efficiently. Thus, it is highly demanded to design an efficient way of fine-tuning Med-VFMs by selecting informative samples to maximize their adaptation performance on target domains. To achieve this, we propose an Active Source-Free Domain Adaptation (ASFDA) method to efficiently adapt Med-VFMs to target domains for volumetric medical image segmentation. This ASFDA employs a novel Active Learning (AL) method to select the most informative samples from target domains for fine-tuning Med-VFMs without the access to source pre-training samples, thus maximizing their performance with the minimal selection budget. In this AL method, we design an Active Test Time Sample Query strategy to select samples from the target domains via two query metrics, including Diversified Knowledge Divergence (DKD) and Anatomical Segmentation Difficulty (ASD). DKD is designed to measure the source-target knowledge gap and intra-domain diversity. It utilizes the knowledge of pre-training to guide the querying of source-dissimilar and semantic-diverse samples from the target domains. ASD is designed to evaluate the difficulty in segmentation of anatomical structures by measuring predictive entropy from foreground regions adaptively. Additionally, our ASFDA method employs a Selective Semi-supervised Fine-tuning to improve the performance and efficiency of fine-tuning by identifying samples with high reliability from unqueried ones.

[589] Curriculum Learning with Synthetic Data for Enhanced Pulmonary Nodule Detection in Chest Radiographs

Pranav Sambhu, Om Guin, Madhav Sambhu, Jinho Cha

Main category: eess.IV

TL;DR: Curriculum learning with diffusion-based synthetic augmentation improves pulmonary nodule detection, especially for difficult cases with low size, brightness, and contrast.

Details

Motivation: Conventional AI models struggle with detecting difficult pulmonary nodules due to data imbalance and limited annotation, particularly for nodules with low size, brightness, and contrast.

Method: Used Faster R-CNN with FPN backbone trained on hybrid dataset (NODE21, VinDr-CXR, CheXpert, and 11,206 DDPM-generated synthetic images) with curriculum learning guided by difficulty scores based on size, brightness, and contrast.

Result: Curriculum model achieved mean AUC of 0.95 vs 0.89 baseline (p<0.001), with sensitivity 70% vs 48% and accuracy 82% vs 70%. Consistent improvements across all difficulty levels.

Conclusion: Curriculum-guided synthetic augmentation enhances model robustness and generalization for pulmonary nodule detection, with more anatomically focused attention.

Abstract: This study evaluates whether integrating curriculum learning with diffusion-based synthetic augmentation can enhance the detection of difficult pulmonary nodules in chest radiographs, particularly those with low size, brightness, and contrast, which often challenge conventional AI models due to data imbalance and limited annotation. A Faster R-CNN with a Feature Pyramid Network (FPN) backbone was trained on a hybrid dataset comprising expert-labeled NODE21 (1,213 patients; 52.4 percent male; mean age 63.2 +/- 11.5 years), VinDr-CXR, CheXpert, and 11,206 DDPM-generated synthetic images. Difficulty scores based on size, brightness, and contrast guided curriculum learning. Performance was compared to a non-curriculum baseline using mean average precision (mAP), Dice score, and area under the curve (AUC). Statistical tests included bootstrapped confidence intervals, DeLong tests, and paired t-tests. The curriculum model achieved a mean AUC of 0.95 versus 0.89 for the baseline (p < 0.001), with improvements in sensitivity (70 percent vs. 48 percent) and accuracy (82 percent vs. 70 percent). Stratified analysis demonstrated consistent gains across all difficulty bins (Easy to Very Hard). Grad-CAM visualizations confirmed more anatomically focused attention under curriculum learning. These results suggest that curriculum-guided synthetic augmentation enhances model robustness and generalization for pulmonary nodule detection.

[590] Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion

Xiaojian Ding, Lin Zhao, Xian Li, Xiaoying Zhu

Main category: eess.IV

TL;DR: Proposes HSACC, a novel incomplete multi-view clustering framework using hierarchical semantic alignment and cooperative completion to handle missing views and achieve robust cross-view fusion.

Details

Motivation: Existing deep incomplete multi-view clustering methods suffer from static fusion strategies and two-stage pipelines, leading to suboptimal fusion and error propagation issues.

Method: HSACC uses dual-level semantic space design: low-level ensures consistency via mutual information maximization, high-level uses adaptive view weights based on distributional affinity for weighted fusion. Implicitly recovers missing views through aligned latent projections and jointly optimizes reconstruction and clustering.

Result: HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies confirm effectiveness of hierarchical alignment and dynamic weighting, parameter analysis shows robustness to hyperparameter variations.

Conclusion: The proposed HSACC framework effectively addresses incomplete multi-view clustering challenges through hierarchical semantic alignment and cooperative completion, achieving superior performance and robustness.

Abstract: Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model’s robustness to hyperparameter variations.

[591] Computer Navigated Spinal Surgery Using Magnetic Resonance Imaging and Augmented Reality

Songyuan Lu, Jingwen Hui, Jake Weeks, David B. Berry, Fanny Chapelin, Frank Talke

Main category: eess.IV

TL;DR: A radiation-free surgical navigation system using MRI and AR with fiducial markers for spinal pain management procedures, showing comparable accuracy to conventional fluoroscopy-based methods.

Details

Motivation: Current spinal pain management procedures like RFA and ESI rely on fluoroscopy which exposes patients and physicians to ionizing radiation. There is a need for radiation-free alternatives.

Method: Combines MRI with fiducial ArUco marker-based AR. Uses high-resolution MRI scans converted to surface meshes with Laplacian smoothing. Stereo camera tracks single or dual fiducial markers for real-time patient pose tracking. Custom AR software overlays MRI images onto the patient.

Result: Dual-ArUco marker tracking increased accuracy and reduced average needle misplacement distance compared to single-marker procedures. Average needle misplacement (2 mm) is comparable to conventional fluoroscopy techniques.

Conclusion: The radiation-free system demonstrates promise as an alternative to fluoroscopy by improving image-guided spinal navigation while eliminating radiation exposure.

Abstract: Current spinal pain management procedures, such as radiofrequency ablation (RFA) and epidural steroid injection (ESI), rely on fluoroscopy for needle placement which exposes patients and physicians to ionizing radiation. In this paper, we investigate a radiation-free surgical navigation system for spinal pain management procedures that combines magnetic resonance imaging (MRI) with fiducial ArUco marker-based augmented reality (AR). High-resolution MRI scans of a lumbar spinal phantom were obtained and assembled as a surface mesh. Laplacian smoothing algorithms were then applied to smoothen the surface and improve the model fidelity. A commercially available stereo camera (ZED2) was used to track single or dual fiducial ArUco markers on the patient to determine the patient’s real-time pose. Custom AR software was applied to overlay the MRI image onto the patient, allowing the physician to see not only the outer surface of the patient but also the complete anatomy of the patient below the surface. Needle-insertion trials on a 3D-printed 3-vertebra phantom showed that dual-ArUco marker tracking increased the accuracy of needle insertions and reduced the average needle misplacement distance compared to single-ArUco marker procedures. The average needle misplacement is comparable to the average deviation of 2 mm for conventional epidural techniques using fluoroscopy. Our radiation-free system demonstrates promise to serve as an alternative to fluoroscopy by improving image-guided spinal navigation.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Modeling Layered Consciousness with Multi-Agent Large Language Models

[2] Outraged AI: Large language models prioritise emotion over cost in fairness enforcement

[3] POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

[4] Advances in Pre-trained Language Models for Domain-Specific Text Classification: A Systematic Review

[5] Atomic Literary Styling: Mechanistic Manipulation of Prose Generation in Neural Language Models

[6] JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs

[7] CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections

[8] Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models

[9] Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings, Fine-Tuned Transformers and LLMs

[10] Diagnosing Representation Dynamics in NER Model Extension

[11] MLMA: Towards Multilingual with Mamba Based Architectures

[12] Bayesian Low-Rank Factorization for Robust Model Adaptation

[13] AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM

[14] Adapting Language Balance in Code-Switching Speech

[15] Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

[16] SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone

[17] Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution

[18] From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

[19] Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

[20] Language Models as Semantic Augmenters for Sequential Recommenders

[21] LightMem: Lightweight and Efficient Memory-Augmented Generation

[22] Chain-of-Thought Reasoning Improves Context-Aware Translation with Large Language Models

[23] Na Prática, qual IA Entende o Direito? Um Estudo Experimental com IAs Generalistas e uma IA Jurídica

[24] Does Reasoning Help LLM Agents Play Dungeons and Dragons? A Prompt Engineering Experiment

[25] LLMs Encode How Difficult Problems Are

[26] Extracting Rule-based Descriptions of Attention Features in Transformers

[27] Automatic Prompt Generation via Adaptive Selection of Prompting Techniques

[28] CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models

[29] Multi-Agent Collaboration via Evolving Orchestration

[30] Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

[31] MARCUS: An Event-Centric NLP Pipeline that generates Character Arcs from Narratives

[32] DelvePO: Direction-Guided Self-Evolving Framework for Flexible Prompt Optimization

[33] Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

[34] BrailleLLM: Braille Instruction Tuning with Large Language Models for Braille Domain Tasks

[35] From Retrieval to Generation: Unifying External and Parametric Knowledge for Medical Question Answering

[36] ECG-LLM – training and evaluation of domain-specific large language models for electrocardiography

[37] Combining Distantly Supervised Models with In Context Learning for Monolingual and Cross-Lingual Relation Extraction

[38] KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers

[39] KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs

[40] Towards Fair ASR For Second Language Speakers Using Fairness Prompted Finetuning

[41] MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models

[42] Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

[43] Chain-of-Conceptual-Thought: Eliciting the Agent to Deeply Think within the Response

[44] Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

[45] Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

[46] ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks

[47] DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

[48] CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

[49] IMB: An Italian Medical Benchmark for Question Answering

[50] DART: A Structured Dataset of Regulatory Drug Documents in Italian for Clinical NLP

[51] How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices

[52] Identity-Aware Large Language Models require Cultural Reasoning

[53] Building Trust in Clinical LLMs: Bias Analysis and Dataset Transparency

[54] Large language models for folktale type automation based on motifs: Cinderella case study

[55] Beyond the Explicit: A Bilingual Dataset for Dehumanization Detection in Social Media

[56] Dynamical model parameters from ultrasound tongue kinematics

[57] Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering

[58] SemiAdapt and SemiLoRA: Efficient Domain Adaptation for Transformer-based Low-Resource Language Translation with a Case Study on Irish

[59] Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation

[60] Topoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting

[61] AI use in American newspapers is widespread, uneven, and rarely disclosed

[62] KAT-Coder Technical Report

[63] WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

[64] Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health Monitoring

[65] MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

[66] Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

[67] Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

[68] How Do LLMs Use Their Depth?

[69] A Survey of Automatic Hallucination Evaluation on Natural Language Generation

[70] Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models

[71] When Text Embedding Meets Large Language Model: A Comprehensive Survey

[72] Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty

[73] WHAT-IF: Exploring Branching Narratives by Meta-Prompting Large Language Models

[74] DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models

[75] FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model

[76] DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

[77] Temporal Alignment of LLMs through Cycle Encoding for Long-Range Time Representations