Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 179]
cs.CV [Total: 172]
cs.AI [Total: 78]
cs.SD [Total: 6]
cs.LG [Total: 167]
cs.MA [Total: 6]
cs.MM [Total: 4]
eess.AS [Total: 6]
eess.IV [Total: 13]

cs.CL

[1] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

Ashish Kattamuri, Ishita Prasad, Meetu Malhotra, Arpita Vats, Rahul Raja, Albert Lie

Main category: cs.CL

TL;DR: A new framework combining Group Relative Policy Optimization (GRPO) with multilingual contrastive reward signals improves Text-to-SQL systems by enhancing both execution accuracy and semantic alignment across languages, achieving significant performance gains with minimal training data.

Details

Motivation: Current Text-to-SQL methods focus only on executable queries and overlook semantic alignment challenges. There's a significant performance drop (6 percentage points on average) when moving from English to other languages, highlighting the need for better cross-lingual semantic accuracy.

Method: Proposes a framework combining Group Relative Policy Optimization (GRPO) with multilingual contrastive reward signals to enhance semantic similarity between SQL generation and user intent. Uses reinforcement learning with contrastive rewards for directed semantic alignment.

Result: On MultiSpider dataset, fine-tuning LLaMA-3-3B with GRPO improved execution accuracy to 87.4% (+26 pp) and semantic accuracy to 52.29% (+32.86 pp). Adding contrastive reward further improved semantic accuracy to 59.14% (+6.85 pp, up to +10 pp for Vietnamese). The 3B model outperformed zero-shot 8B model in execution accuracy (88.86% vs 81.43%) using only 3,000 training examples.

Conclusion: The framework demonstrates that contrastive rewards can significantly improve Text-to-SQL system performance for semantic alignment across languages without requiring large-scale training datasets, enabling smaller models to outperform larger ones.

Abstract: Current Text-to-SQL methods are evaluated and only focused on executable queries, overlooking the semantic alignment challenge – both in terms of the semantic meaning of the query and the correctness of the execution results. Even execution accuracy itself shows significant drops when moving from English to other languages, with an average decline of 6 percentage points across non-English languages. We address these challenges by presenting a new framework that combines Group Relative Policy Optimization (GRPO) within a multilingual contrastive reward signal to enhance both task efficiency and semantic accuracy in Text-to-SQL systems in cross-lingual scenarios. Our method teaches models to obtain better correspondence between SQL generation and user intent by combining a reward signal based on semantic similarity. On the seven-language MultiSpider dataset, fine-tuning the LLaMA-3-3B model with GRPO improved the execution accuracy up to 87.4 percent (+26 pp over zero-shot) and semantic accuracy up to 52.29 percent (+32.86 pp). Adding our contrastive reward signal in the GRPO framework further improved the average semantic accuracy to 59.14 percent (+6.85 pp, up to +10 pp for Vietnamese). Our experiments showcase that a smaller, parameter-efficient 3B LLaMA model fine-tuned with our contrastive reward signal outperforms a much larger zero-shot 8B LLaMA model, with an uplift of 7.43 pp in execution accuracy (from 81.43 percent on the 8B model to 88.86 percent on the 3B model), and nearly matches its semantic accuracy (59.14 percent vs. 68.57 percent) – all using just 3,000 reinforcement learning training examples. These results demonstrate how we can improve the performance of Text-to-SQL systems with contrastive rewards for directed semantic alignment, without requiring large-scale training datasets.

[2] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening

Ratna Kandala, Akshata Kishore Moharir, Divya Arvinda Nayak

Main category: cs.CL

TL;DR: The paper proposes a Generative Operational Framework using LLMs to translate technical XAI outputs into clinically relevant narratives for mental health screening, addressing the lab-to-clinic gap.

Details

Motivation: Current XAI techniques like SHAP and LIME produce technically faithful outputs but fail to deliver clinically actionable insights that clinicians can use or patients can understand, creating a barrier to real-world adoption in mental health screening.

Method: The Generative Operational Framework leverages Large Language Models as a central translation engine that ingests raw technical outputs from diverse XAI tools and synthesizes them with clinical guidelines (via RAG) to automatically generate human-readable, evidence-backed clinical narratives.

Result: The framework directly addresses key operational barriers including workflow integration, bias mitigation, and stakeholder-specific communication, moving beyond isolated data points toward integrated, actionable AI.

Conclusion: This approach provides a strategic roadmap for delivering trustworthy AI in clinical practice by bridging the gap between technical transparency and human utility through generative XAI systems.

Abstract: Explainable Artificial Intelligence (XAI) has been presented as the critical component for unlocking the potential of machine learning in mental health screening (MHS). However, a persistent lab-to-clinic gap remains. Current XAI techniques, such as SHAP and LIME, excel at producing technically faithful outputs such as feature importance scores, but fail to deliver clinically relevant, actionable insights that can be used by clinicians or understood by patients. This disconnect between technical transparency and human utility is the primary barrier to real-world adoption. This paper argues that this gap is a translation problem and proposes the Generative Operational Framework, a novel system architecture that leverages Large Language Models (LLMs) as a central translation engine. This framework is designed to ingest the raw, technical outputs from diverse XAI tools and synthesize them with clinical guidelines (via RAG) to automatically generate human-readable, evidence-backed clinical narratives. To justify our solution, we provide a systematic analysis of the components it integrates, tracing the evolution from intrinsic models to generative XAI. We demonstrate how this framework directly addresses key operational barriers, including workflow integration, bias mitigation, and stakeholder-specific communication. This paper also provides a strategic roadmap for moving the field beyond the generation of isolated data points toward the delivery of integrated, actionable, and trustworthy AI in clinical practice.

[3] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Shinwoo Park, Hyejin Park, Hyeseon Ahn, Yo-Sub Han

Main category: cs.CL

TL;DR: STELA is a publicly verifiable watermarking framework that balances text quality and detection robustness by modulating watermark strength based on linguistic degrees of freedom using part-of-speech n-gram modeling.

Details

Motivation: Current watermarking methods rely on model-specific signals like token-level entropy, which prevents public verification since detection requires access to model logits. There's a need for publicly verifiable watermarking that maintains the quality-robustness trade-off.

Method: STELA dynamically adjusts watermark strength using part-of-speech n-gram-modeled linguistic indeterminacy - weakening signals in grammatically constrained contexts to preserve quality and strengthening them in linguistically flexible contexts to enhance detectability.

Result: STELA surpasses prior methods in detection robustness across typologically diverse languages (English, Chinese, Korean) and operates without requiring access to model logits, enabling publicly verifiable detection.

Conclusion: STELA provides an effective solution for publicly verifiable watermarking that balances text quality and detection robustness by leveraging linguistic degrees of freedom, making it suitable for trustworthy AI governance.

Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.

[4] A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer’s Disease

Yangyang Li

Main category: cs.CL

TL;DR: A hybrid word embedding method combining Doc2Vec and ELMo with linguistic features achieves 91% accuracy and 97% AUC for early Alzheimer’s detection through language analysis.

Details

Motivation: Early detection of Alzheimer's Disease is crucial for timely treatment and reducing healthcare costs. Language capability changes serve as early indicators of AD, making NLP-based diagnosis valuable.

Method: Hybrid word embedding combining Doc2Vec and ELMo vectors to obtain sentence perplexity scores, enriched with linguistic features for syntax/semantic analysis. Feature vectors fed into logistic regression with fine-tuned hyperparameters (regularization, learning rate, vector sizes).

Result: Achieved 91% classification accuracy and 97% AUC, outperforming existing NLP models (88% accuracy). Model shows stability with low standard deviations (accuracy: 0.0403, AUC: 0.0174) across random data splits.

Conclusion: The proposed method is accurate, stable, and suitable for large-scale AD screening and as a complementary tool for doctors in AD detection.

Abstract: Early detection of Alzheimer’s Disease (AD) is greatly beneficial to AD patients, leading to early treatments that lessen symptoms and alleviating financial burden of health care. As one of the leading signs of AD, language capability changes can be used for early diagnosis of AD. In this paper, I develop a robust classification method using hybrid word embedding and fine-tuned hyperparameters to achieve state-of-the-art accuracy in the early detection of AD. Specifically, we create a hybrid word embedding based on word vectors from Doc2Vec and ELMo to obtain perplexity scores of the sentences. The scores identify whether a sentence is fluent or not and capture semantic context of the sentences. I enrich the word embedding by adding linguistic features to analyze syntax and semantics. Further, we input an embedded feature vector into logistic regression and fine tune hyperparameters throughout the pipeline. By tuning hyperparameters of the machine learning pipeline (e.g., model regularization parameter, learning rate and vector size of Doc2Vec, and vector size of ELMo), I achieve 91% classification accuracy and an Area Under the Curve (AUC) of 97% in distinguishing early AD from healthy subjects. Based on my knowledge, my model with 91% accuracy and 97% AUC outperforms the best existing NLP model for AD diagnosis with an accuracy of 88% [32]. I study the model stability through repeated experiments and find that the model is stable even though the training data is split randomly (standard deviation of accuracy = 0.0403; standard deviation of AUC = 0.0174). This affirms our proposed method is accurate and stable. This model can be used as a large-scale screening method for AD, as well as a complementary examination for doctors to detect AD.

[5] Users as Annotators: LLM Preference Learning from Comparison Mode

Zhongze Cai, Xiaocheng Li

Main category: cs.CL

TL;DR: The paper proposes a method to collect pairwise preference data from user annotations by generating responses from different models, using asymmetry to infer user data quality through a behavior model and EM algorithm.

Details

Motivation: To leverage user-generated pairwise preference data for LLM alignment, addressing the trade-off between user expertise in judging their own queries and the lack of quality control in these labels.

Method: Generate two responses from different models/versions, use asymmetry to infer user data quality through a proposed user behavior model, and apply an expectation-maximization algorithm to estimate latent user quality factors for data filtering.

Result: The approach effectively captures user behavior and enables successful data filtering for LLM alignment in downstream tasks.

Conclusion: User annotation from comparison mode with quality inference through asymmetric model responses and EM-based filtering is an effective alternative to professional human annotation for collecting pairwise preference data.

Abstract: Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data – user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user’s data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users’ annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.

[6] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

Chao Han, Yijuan Liang, Zihao Xuan, Daokuan Wu, Wei Zhang, Xiaoyu Shen

Main category: cs.CL

TL;DR: Informed routing is a new paradigm that improves LLM efficiency by assessing both token importance and recoverability, using a Lightweight Feature Forecaster to enable execute-or-approximate decisions instead of greedy routing.

Details

Motivation: Current dynamic token-level computation allocation methods use greedy routing which causes irreversible information loss and suboptimal token selection, limiting LLM deployment due to high inference costs.

Method: Proposes informed routing with Lightweight Feature Forecaster (LFF) - a small predictive module that estimates unit outputs before routing decisions, enabling execute-or-approximate policy.

Result: Achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels, matches or surpasses strong baselines without final LoRA fine-tuning, reduces training time by over 50%.

Conclusion: Informed routing effectively addresses limitations of greedy routing, preserving model fidelity while drastically reducing computation costs in LLM inference.

Abstract: The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing–a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token’s immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit’s output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: https://github.com/EIT-NLP/informed-routing

[7] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

Minsik Choi, Hyegang Son, Changhoon Kim, Young Geun Kim

Main category: cs.CL

TL;DR: HIES (Head Importance-Entropy Score) is a novel pruning criterion that combines head importance scores with attention entropy to improve transformer model compression, achieving better quality and stability than HIS-only methods.

Details

Motivation: Transformer models face efficiency challenges due to multiple layers and attention heads. Existing gradient-based pruning methods using Head Importance Scores (HIS) have limitations as they only capture gradient-driven contributions and overlook attention pattern diversity.

Method: Proposed HIES criterion integrates head importance scores with attention entropy to provide complementary evidence on per-head contribution, addressing the limitations of HIS-only methods.

Result: HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing accuracy or stability.

Conclusion: The HIES method effectively overcomes limitations of HIS-only pruning by incorporating attention entropy, achieving superior model compression results while maintaining both accuracy and stability.

Abstract: Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.

[8] ConDABench: Interactive Evaluation of Language Models for Data Analysis

Avik Dutta, Priyanshu Gupta, Hosein Hasanbeig, Rahul Pratap Singh, Harshit Nigam, Sumit Gulwani, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari

Main category: cs.CL

TL;DR: ConDABench is a framework for generating conversational data analysis benchmarks and evaluating tools on them, addressing the limitations of existing benchmarks that don’t capture real-world complexities or support interactivity.

Details

Motivation: Real-world data analysis involves under-specified goals and unclean data, requiring user interaction to understand intent. Existing benchmarks lack support for these complexities and interactivity.

Method: Multi-agent workflow for generating realistic benchmarks from articles describing insights from public datasets, creating 1,420 conversational data analysis problems, plus an evaluation harness for systematic tool assessment.

Result: Evaluation shows newer LLMs solve more instances but aren’t necessarily better at tasks requiring sustained, long-form engagement.

Conclusion: ConDABench enables measuring progress toward truly collaborative models that can complete complex interactive data analysis tasks.

Abstract: Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user’s intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.

[9] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

Debarun Bhattacharjya, Balaji Ganesan, Junkyu Lee, Radu Marinescu, Katsiaryna Mirylenka, Michael Glass, Xiao Shou

Main category: cs.CL

TL;DR: The paper investigates black-box uncertainty quantification methods for LLMs, proposing a similarity-based framework that uses consistency between generated outputs as a proxy for confidence.

Details

Motivation: Uncertainty quantification is crucial for trusted AI systems, and black-box methods offer advantages like robustness, adaptability, and computational efficiency without needing internal model access.

Method: Proposes a similarity-based aggregation framework that uses consistency between generated outputs to estimate confidence, including novel techniques that train confidence estimation models with small training sets.

Result: Empirical study across question answering, summarization, and text-to-SQL tasks shows that similarity-based methods yield better calibrated confidences than baseline approaches.

Conclusion: Similarity-based uncertainty quantification methods can effectively estimate LLM confidence without requiring internal model information, providing better calibration across diverse tasks.

Abstract: When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM’s generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.

[10] Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection

Weibin Cai, Reza Zafarani

Main category: cs.CL

TL;DR: A culture-aware hate speech detection framework that addresses biased training labels and cultural variations by constructing individual hate subspaces through label propagation and modeling cultural attribute combinations.

Details

Motivation: Existing hate speech detection methods overlook real-world complexities like biased training labels and varying cultural interpretations of hate, as well as challenges including data sparsity, cultural entanglement, and ambiguous labeling.

Method: Proposes a culture-aware framework that constructs individuals’ hate subspaces by modeling combinations of cultural attributes to address data sparsity, and uses label propagation to capture distinctive features for each combination to handle cultural entanglement and ambiguous labels.

Result: The method outperforms state-of-the-art approaches by 1.05% on average across all metrics in experiments.

Conclusion: The culture-aware framework effectively addresses cultural variations and biased labels in hate speech detection, with individual hate subspaces enhancing classification performance.

Abstract: Hate speech detection has been extensively studied, yet existing methods often overlook a real-world complexity: training labels are biased, and interpretations of what is considered hate vary across individuals with different cultural backgrounds. We first analyze these challenges, including data sparsity, cultural entanglement, and ambiguous labeling. To address them, we propose a culture-aware framework that constructs individuals’ hate subspaces. To alleviate data sparsity, we model combinations of cultural attributes. For cultural entanglement and ambiguous labels, we use label propagation to capture distinctive features of each combination. Finally, individual hate subspaces, which in turn can further enhance classification performance. Experiments show our method outperforms state-of-the-art by 1.05% on average across all metrics.

[11] Meronymic Ontology Extraction via Large Language Models

Dekai Zhang, Simone Conia, Antonio Rago

Main category: cs.CL

TL;DR: The paper presents a fully-automated method using large language models (LLMs) to extract product ontologies from raw review texts, outperforming existing BERT-based approaches.

Details

Motivation: Manual ontology construction is time-consuming and expensive, while ontologies are essential for organizing unstructured text data in domains like e-commerce where proper product organization is needed.

Method: Developed a fully-automated method that harnesses recent advancements in large language models (LLMs) to extract product ontologies in the form of meronymies from raw review texts.

Result: The ontologies produced by the LLM-based method surpass an existing BERT-based baseline when evaluated using an LLM-as-a-judge approach.

Conclusion: This work provides groundwork for using LLMs more generally in ontology extraction, both for product and other types of ontologies.

Abstract: Ontologies have become essential in today’s digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.

[12] NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Run Luo, Xiaobo Xia, Lu Wang, Longze Chen, Renke Shan, Jing Luo, Min Yang, Tat-Seng Chua

Main category: cs.CL

TL;DR: NExT-OMNI is an open-source omnimodal foundation model that uses discrete flow paradigms to achieve unified any-to-any cross-modal understanding and generation, outperforming previous models in multimodal interaction and cross-modal retrieval.

Details

Motivation: Existing multimodal models are constrained by autoregressive architectures that limit balanced integration of understanding and generation capabilities, and their redundant designs limit applicability to broader scenarios like cross-modal retrieval.

Method: Leverages discrete flow paradigms with metric-induced probability paths and kinetic optimal velocities to natively support any-to-any understanding and generation, using concise unified representations rather than task-decoupled designs.

Result: Achieves competitive performance on multimodal generation and understanding benchmarks, outperforms prior unified models in multi-turn multimodal interaction and cross-modal retrieval, with enhanced response efficiency.

Conclusion: NExT-OMNI demonstrates architectural advantages as a next-generation multimodal foundation model and advances research through open-source release of training details, data protocols, code, and model checkpoints.

Abstract: Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval. In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

[13] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Yutao Wu, Xiao Liu, Yinghui Li, Yifeng Gao, Yifan Ding, Jiale Ding, Xiang Zheng, Xingjun Ma

Main category: cs.CL

TL;DR: ADMIT is a knowledge poisoning attack for RAG-based fact-checking systems that flips decisions and creates deceptive justifications without accessing target models, achieving 86% success rate at extremely low poisoning rates.

Details

Motivation: To investigate knowledge poisoning in real-world fact-checking scenarios where credible evidence dominates retrieval, extending beyond prior work that focused on simpler settings.

Method: Proposed ADMIT (Adversarial Multi-Injection Technique) - a few-shot, semantically aligned poisoning attack that works without access to target LLMs, retrievers, or token-level control.

Result: ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 benchmarks, achieving 86% average attack success rate at 0.93×10^-6 poisoning rate, and remains robust against counter-evidence.

Conclusion: ADMIT exposes significant vulnerabilities in real-world RAG-based fact-checking systems, improving attack success rate by 11.2% over prior state-of-the-art attacks.

Abstract: Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs’ susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.

[14] Serialized EHR make for good text representations

Zhirong Chou, Quan Qin, Shi Li

Main category: cs.CL

TL;DR: SerialBEHRT extends SciBERT through additional pretraining on structured EHR sequences to better capture temporal dependencies in healthcare data, achieving superior performance in antibiotic susceptibility prediction.

Details

Motivation: Existing foundation models struggle to reconcile the tabular and event-based nature of EHRs with sequential language model priors, limiting their ability to capture longitudinal dependencies across patient encounters.

Method: Extends SciBERT through additional pretraining on structured EHR sequences, designed to encode temporal and contextual relationships among clinical events.

Result: SerialBEHRT achieves superior and more consistent performance compared to state-of-the-art EHR representation strategies in antibiotic susceptibility prediction.

Conclusion: Temporal serialization is important for foundation model pretraining in healthcare, enabling better capture of longitudinal dependencies in clinical data.

Abstract: The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.

[15] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar

Main category: cs.CL

TL;DR: DynaSpec introduces a dynamic shortlisting mechanism for speculative decoding that uses lightweight meta-classifiers to select token clusters context-dependently, improving drafting speed while maintaining verification accuracy.

Details

Motivation: Existing fixed-vocabulary shortlisting methods for speculative decoding are brittle, corpus-dependent, and suppress rare tokens, creating bottlenecks in LLM inference acceleration.

Method: Uses lightweight meta-classifiers to route contexts to token clusters, forming dynamic shortlists from union of top-k clusters, with parallel execution of draft encoding and meta shortlisting.

Result: Consistent gains in mean accepted length over fixed-shortlist baselines, enabling smaller shortlists without degrading acceptance across diverse tasks.

Conclusion: DynaSpec provides a robust, context-dependent dynamic shortlisting approach that speeds up drafting while maintaining exact verification, generalizing well across tasks.

Abstract: Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter’s output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter’s vocabulary to a fixed subset of the target model’s vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter’s shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter’s hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.

[16] On-device System of Compositional Multi-tasking in Large Language Models

Ondrej Bohdal, Konstantinos Theodosiadis, Asterios Mpatziakas, Dimitris Filippidis, Iro Spyrou, Christos Zonios, Anastasios Drosou, Dimosthenis Ioannidis, Kyeng-Hun Lee, Jijoong Moon, Hyeonmok Ko, Mete Ozay, Umberto Michieli

Main category: cs.CL

TL;DR: Proposes a novel approach for compositional multi-tasking (summarization + translation) using learnable projection layers on combined adapters, enabling efficient simultaneous task execution with reduced computational overhead.

Details

Motivation: Standard parameter-efficient fine-tuning approaches struggle with simultaneous execution of complex compositional tasks like generating translated summaries from long conversations.

Method: Adds a learnable projection layer on top of combined summarization and translation adapters, maintaining efficiency through reduced computational overhead compared to retraining or sequential processing.

Result: Developed an Android app demonstrating practical viability in on-device environments. Solution performs well and is fast in both cloud-based and on-device implementations.

Conclusion: The framework shows potential benefits for real-world applications demanding high-speed operation alongside resource constraints, highlighting effective integration of compositional tasks.

Abstract: Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.

[17] Language steering in latent space to mitigate unintended code-switching

Andrey Goncharov, Nikolai Kondusov, Alexey Zaytsev

Main category: cs.CL

TL;DR: Latent-space language steering is a lightweight inference-time method that uses PCA on parallel translations to identify language directions and steer token embeddings to control language identity, reducing code-switching while preserving semantics.

Details

Motivation: Multilingual LLMs often exhibit unintended code-switching, which reduces reliability in downstream tasks.

Method: Identify language directions via PCA on parallel translations and steer token embeddings along these axes to control language identity. Requires minimal parallel data for calibration.

Result: Achieved 95-99% language classification accuracy using a single principal component and reduced next-token distributional divergence by up to 42% across multiple language pairs on Qwen2.5 and Llama-3.2 models.

Conclusion: Language identity concentrates in final layers with near-perfect linear separability, and the approach effectively mitigates code-switching with negligible computational overhead.

Abstract: Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 42% across multiple language pairs on Qwen2.5 and Llama-3.2 models. We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.

[18] Revisiting the UID Hypothesis in LLM Reasoning Traces

Minju Gwak, Guijin Son, Jaehyung Kim

Main category: cs.CL

TL;DR: LLMs’ successful reasoning shows uneven information density patterns, contrary to human communication patterns, suggesting new directions for interpretable reasoning models.

Details

Motivation: To analyze information flow in LLM reasoning using psycholinguistic principles, as current CoT reasoning often produces unfaithful or hard-to-interpret intermediate steps.

Method: Introduced entropy-based metrics inspired by Uniform Information Density hypothesis to analyze information flow in reasoning traces across three mathematical benchmarks.

Result: Found that successful reasoning in LLMs is globally non-uniform with uneven swings in information density, contrasting with human communication patterns.

Conclusion: Challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.

Abstract: Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics – which posits that humans communicate by maintaining a stable flow of information – we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.

[19] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

Sicheng Lyu, Yu Gu, Xinyu Wang, Jerry Huang, Sitao Luan, Yufei Cui, Xiao-Wen Chang, Peng Lu

Main category: cs.CL

TL;DR: EvoEdit is a novel model editing strategy that uses sequential null-space alignment to mitigate catastrophic interference in large language models, enabling stable updates without compromising previous edits.

Details

Motivation: Current model editing approaches suffer from catastrophic interference where new edits compromise previously integrated updates, especially in sequential editing contexts where multiple updates are applied over time.

Method: EvoEdit performs sequential null-space alignment for each incoming edit, preserving both original and previously modified knowledge representations while maintaining output invariance on preserved knowledge.

Result: EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup, and effectively mitigates interference across long edit sequences.

Conclusion: The results highlight the need for principled approaches for LLMs in dynamically evolving information settings, with EvoEdit providing a simple yet effective solution with strong theoretical guarantees.

Abstract: Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.

[20] ConsistencyAI: A Benchmark to Assess LLMs’ Factual Consistency When Responding to Different Demographic Groups

Peter Banyas, Shristi Sharma, Alistair Simmons, Atharva Vispute

Main category: cs.CL

TL;DR: ConsistencyAI is a benchmark for measuring factual consistency of LLMs across different personas, showing models give different factual answers to identical questions based on user demographics.

Details

Motivation: To test whether LLMs provide factually inconsistent answers to users of different demographics asking identical questions, ensuring impartial evaluation without provider involvement.

Method: Tested 19 LLMs by querying them for 5 facts on 15 topics, repeated 100 times with different persona contexts. Used sentence embeddings and cross-persona cosine similarity to compute factual consistency scores.

Result: Factual consistency scores ranged from 0.9065 to 0.7896 (mean 0.8656). Grok-3 was most consistent, lightweight models least consistent. Consistency varies by topic - job market least consistent, G7 world leaders most consistent.

Conclusion: Both the LLM provider and topic significantly shape factual consistency. The benchmark supports reproducible evaluation and encourages persona-invariant prompting strategies.

Abstract: Is an LLM telling you different facts than it’s telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI’s Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.

[21] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Fabian Wenz, Omar Bouattour, Devin Yang, Justin Choi, Cecil Gregg, Nesime Tatbul, Çağatay Demiralp

Main category: cs.CL

TL;DR: BenchPress is a human-in-the-loop system that uses LLMs and RAG to accelerate creation of domain-specific text-to-SQL benchmarks by generating natural language descriptions from SQL queries, with human verification ensuring quality.

Details

Motivation: Existing text-to-SQL benchmarks focus on public datasets, but LLMs perform poorly on private enterprise data. Manual annotation of SQL logs to create benchmarks is time-consuming, expensive, and challenging for database administrators.

Method: Given SQL queries, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment.

Result: LLM-assisted annotation drastically reduces time and effort for creating high-quality benchmarks. Human verification with LLM suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness.

Conclusion: BenchPress provides researchers and practitioners with a mechanism for assessing text-to-SQL models on domain-specific workloads, streamlining custom benchmark creation for enterprise data.

Abstract: Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

[22] R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging

Mamadou K. Keita, Christopher Homan, Sebastien Diarra

Main category: cs.CL

TL;DR: The R2T framework integrates linguistic rules into neural network training through an adaptive loss function, achieving high accuracy on POS tagging with unlabeled text and serving as effective pre-training for NER tasks.

Details

Motivation: To develop a principled learning approach that trains models with explicit task constraints rather than relying solely on labeled examples, particularly for handling out-of-vocabulary words and low-resource scenarios.

Method: Rule-to-Tag (R2T) framework with adaptive loss function including regularization for OOV word handling, using multi-tiered linguistic rules integrated into neural network training objectives.

Result: R2T-BiLSTM achieved 98.2% accuracy on Zarma POS tagging using only unlabeled text, outperforming AfriBERTa fine-tuned on 300 labeled sentences. For NER, R2T pre-training with 50 labeled sentences outperformed baseline trained on 300 sentences.

Conclusion: R2T enables effective learning from unlabeled data through principled constraints, demonstrating strong performance on low-resource tasks and serving as a powerful pre-training method for complex NLP tasks.

Abstract: We introduce the Rule-to-Tag (R2T) framework, a hybrid approach that integrates a multi-tiered system of linguistic rules directly into a neural network’s training objective. R2T’s novelty lies in its adaptive loss function, which includes a regularization term that teaches the model to handle out-of-vocabulary (OOV) words with principled uncertainty. We frame this work as a case study in a paradigm we call principled learning (PrL), where models are trained with explicit task constraints rather than on labeled examples alone. Our experiments on Zarma part-of-speech (POS) tagging show that the R2T-BiLSTM model, trained only on unlabeled text, achieves 98.2% accuracy, outperforming baselines like AfriBERTa fine-tuned on 300 labeled sentences. We further show that for more complex tasks like named entity recognition (NER), R2T serves as a powerful pre-training step; a model pre-trained with R2T and fine-tuned on just 50 labeled sentences outperformes a baseline trained on 300.

[23] Harnessing Consistency for Robust Test-Time LLM Ensemble

Zhichen Zeng, Qi Yu, Xiao Lin, Ruizhong Qiu, Xuying Ning, Tianxin Wei, Yuchen Yan, Jingrui He, Hanghang Tong

Main category: cs.CL

TL;DR: CoRE is a plug-and-play technique that improves LLM ensemble robustness by addressing token-level and model-level inconsistencies through consistency-based filtering and agreement modeling.

Details

Motivation: LLM ensembles face robustness issues due to heterogeneous tokenization schemes and varying model expertise, leading to ensemble failures from token-level disagreements and model-level confidence disparities.

Method: CoRE uses token-level consistency (low-pass filter for uncertain tokens) and model-level consistency (promoting high-confidence outputs with minimal divergence) to enhance ensemble robustness.

Result: Extensive experiments show CoRE consistently improves ensemble performance and robustness across diverse benchmarks, model combinations, and ensemble strategies.

Conclusion: CoRE effectively addresses ensemble robustness issues by leveraging model consistency at both token and model levels, providing a seamless integration with existing ensemble methods.

Abstract: Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.

[24] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

A H M Rezaul Karim, Ozlem Uzuner

Main category: cs.CL

TL;DR: The MasonNLP system uses a general-domain LLM with RAG framework for wound-care VQA, achieving 3rd place in MEDIQA-WV 2025 by incorporating clinical exemplars without additional training.

Details

Motivation: To support clinical decision-making and patient care through natural language queries over medical images, specifically addressing wound-care VQA challenges.

Method: Uses a general-domain instruction-tuned LLM with RAG framework that incorporates textual and visual examples from in-domain data via simple indexing and fusion, with no extra training or complex re-ranking.

Result: Ranked 3rd among 19 teams and 51 submissions with average score of 41.37%, showing improved performance across dBLEU, ROUGE, BERTScore, and LLM-based metrics.

Conclusion: Lightweight RAG with general-purpose LLMs provides a simple and effective baseline for multimodal clinical NLP tasks, demonstrating that minimal inference-time layers can ground outputs in clinically relevant exemplars.

Abstract: Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs – a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking – provides a simple and effective baseline for multimodal clinical NLP tasks.

[25] ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing

Shivanshu Kumar, Gopalakrishnan Srinivasan

Main category: cs.CL

TL;DR: ShishuLM is an efficient language model architecture that reduces parameter count and KV cache requirements by approximating transformer blocks with MLPs, achieving 25% memory reduction and 40% latency improvement.

Details

Motivation: Transformers have substantial memory and computational overhead with architectural redundancies, presenting optimization opportunities without performance loss, especially important for Small Language Models in agentic AI systems.

Method: Leverages insights from AI interpretability and inference-time layer pruning to create ShishuLM, which approximates entire transformer blocks through Multi-Layer Perceptrons (MLPs) based on analysis that normalization with attention computation is roughly linear with input in moderate-context scenarios.

Result: ShishuLM provides up to 25% reduction in memory requirements and up to 40% improvement in latency during both training and inference compared to parent models, evaluated on two SLMs of different scales.

Conclusion: The approach provides insights for building more efficient SLM architectures from a pre-training standpoint, demonstrating significant optimization opportunities in transformer architectures.

Abstract: While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, presenting opportunities for optimization without compromising performance. Taking insights from research in AI interpretability and inference-time layer pruning, we introduce an efficient language model architecture, referred to as ShishuLM, which reduces both the parameter count and Key-Value (KV) cache requirements. Given the increasing importance of Small Language Models (SLMs) in agentic AI systems, we evaluate our approach on two SLMs of different scales. Our analysis reveals that for moderate-context scenarios, normalization coupled with attention computation is roughly linear with the input, enabling entire transformer blocks to be approximated through Multi-Layer Perceptrons (MLPs). Our results show that ShishuLM provides up to 25% reduction in memory requirements and up to 40% improvement in latency during both training and inference, compared to parent models. Our experimental and analytical findings provide insights towards building more efficient SLM architectures from a pre-training standpoint.

[26] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

Chenyu Zhang, Sharifa Alghowinem, Cynthia Breazeal

Main category: cs.CL

TL;DR: This paper introduces an ensemble-LLM framework for large-scale affect sensing in AI tutoring dialogues, analyzing emotional dynamics of students interacting with an LLM-powered tutor.

Details

Motivation: To understand the affective dynamics of LLM-mediated tutoring, as current research insufficiently addresses learners' evolving emotional states during AI tutoring interactions.

Method: Analyzed 16,986 conversational turns from 261 students across three institutions using PyTutor AI tutor. Generated zero-shot affect annotations from three LLMs (Gemini, GPT-4o, Claude) for valence, arousal, and learning-helpfulness, then fused through rank-weighted pooling and plurality consensus.

Result: Students typically report mildly positive affect and moderate arousal. Confusion and curiosity are frequent during problem solving, while frustration can derail progress. Emotional states are short-lived, with positive moments lasting slightly longer but being fragile. Negative emotions often resolve quickly, sometimes rebounding to positive states. Neutral moments frequently act as upward turning points.

Conclusion: The ensemble-LLM framework provides robust emotion profiling in tutoring dialogues, revealing opportunities for AI tutors to intervene at neutral junctures to steer students toward positive learning states.

Abstract: While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners’ evolving affective states. To achieve this, we analyzed two semesters’ worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners’ emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived–positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.

[27] Unlocking the Potential of Diffusion Language Models through Template Infilling

Junhoo Lee, Seungyeon Kim, Nojun Kwak

Main category: cs.CL

TL;DR: Template Infilling (TI) is a new conditioning method for Diffusion Language Models that generates structural templates first, then fills masked segments, improving performance on reasoning tasks.

Details

Motivation: Current DLMs use prefix-based prompting from autoregressive models, which limits their inference strategies. A tailored conditioning approach is needed for DLMs' unique generation process.

Method: Template Infilling (TI) generates structural templates first, then fills masked segments. Dynamic Segment Allocation (DSA) adaptively adjusts segment lengths based on generation confidence.

Result: Achieved 17.01% improvement over baseline on mathematical reasoning and code generation benchmarks. Also enables effective speedup in multi-token generation while maintaining quality.

Conclusion: TI provides a more effective conditioning methodology for DLMs than conventional prefix prompting, offering both performance gains and generation efficiency.

Abstract: Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs’ generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01$%$p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.

[28] Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

Elwin Huaman, Wendi Huaman, Jorge Luis Huaman, Ninfa Quispe

Main category: cs.CL

TL;DR: Integration of Quechua languages into Common Voice platform to address data scarcity, with Puno Quechua as a case study showing successful collection of 191.1 hours of speech data.

Details

Motivation: Under-resourced languages like Quechua face data scarcity hindering speech technology development, and Common Voice offers a community-driven solution for creating open speech datasets.

Method: Onboarding 17 Quechua languages into Common Voice platform, with detailed case study of Puno Quechua involving language onboarding and corpus collection of both reading and spontaneous speech data.

Result: Common Voice now hosts 191.1 hours of Quechua speech (86% validated), with Puno Quechua contributing 12 hours (77% validated), demonstrating the platform’s potential for under-resourced languages.

Conclusion: The work contributes to inclusive voice technology and digital empowerment of under-resourced language communities, with proposed research agenda addressing technical challenges and ethical considerations for indigenous data sovereignty.

Abstract: Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86% validated), with Puno Quechua contributing 12 hours (77% validated), highlighting the Common Voice’s potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.

[29] FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

Johann Pignat, Milena Vucetic, Christophe Gaudet-Blavignac, Jamil Zaghir, Amandine Stettler, Fanny Amrein, Jonatan Bonjour, Jean-Philippe Goldman, Olivier Michielin, Christian Lovis, Mina Bjelogrlic

Main category: cs.CL

TL;DR: FRACCO is a French annotated corpus of 1301 synthetic clinical oncology cases, translated from Spanish CANTEMIST, with expert annotations for morphology, topography, and histologic differentiation using ICD-O codes.

Details

Motivation: French oncology resources for natural language processing are scarce, creating a need for annotated datasets to develop clinical text processing tools.

Method: Created 1301 synthetic French clinical cases by translating Spanish CANTEMIST corpus. Expert-annotated with ICD-O codes for morphology, topography, and histologic differentiation, plus composite expression-level normalisations. Used automated matching with manual validation by five annotators.

Result: Produced 71,127 ICD-O normalisations covering 399 unique morphology codes, 272 topography codes, and 2,043 unique composite expressions. Dataset includes 2,549 morphology expressions, 3,143 topography expressions, and 11,144 composite expressions.

Conclusion: FRACCO provides a comprehensive reference standard for named entity recognition and concept normalisation in French oncology texts, addressing the scarcity of French clinical NLP resources.

Abstract: Developing natural language processing tools for clinical text requires annotated datasets, yet French oncology resources remain scarce. We present FRACCO (FRench Annotated Corpus for Clinical Oncology) an expert-annotated corpus of 1301 synthetic French clinical cases, initially translated from the Spanish CANTEMIST corpus as part of the FRASIMED initiative. Each document is annotated with terms related to morphology, topography, and histologic differentiation, using the International Classification of Diseases for Oncology (ICD-O) as reference. An additional annotation layer captures composite expression-level normalisations that combine multiple ICD-O elements into unified clinical concepts. Annotation quality was ensured through expert review: 1301 texts were manually annotated for entity spans by two domain experts. A total of 71127 ICD-O normalisations were produced through a combination of automated matching and manual validation by a team of five annotators. The final dataset representing 399 unique morphology codes (from 2549 different expressions), 272 topography codes (from 3143 different expressions), and 2043 unique composite expressions (from 11144 different expressions). This dataset provides a reference standard for named entity recognition and concept normalisation in French oncology texts.

[30] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

Filipe Laitenberger, Dawid Kopiczko, Cees G. M. Snoek, Yuki M. Asano

Main category: cs.CL

TL;DR: GateSkip is a residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs using sigmoid-linear gates, achieving up to 15% compute savings while maintaining accuracy.

Details

Motivation: To reduce inference compute costs in decoder-only language models by enabling selective skipping of less important tokens per layer, addressing the instability issues of early-exit or router-based methods that require extensive retraining.

Method: Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch’s output before re-entering the residual stream. During inference, tokens are ranked by gate values and low-importance ones are skipped using a per-layer budget. The method fine-tunes stably on pretrained models.

Result: On long-form reasoning, saves up to 15% compute while retaining over 90% of baseline accuracy. On instruction-tuned models, shows accuracy gains at full compute and matches baseline quality near 50% savings. Learned gates provide insights into transformer information flow.

Conclusion: GateSkip provides an effective approach for compute-efficient inference in decoder-only LMs, combining easily with other optimization techniques like quantization, pruning, and self-speculative decoding while offering interpretability through learned gate behavior.

Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch’s output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining over 90% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.

[31] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Jimin Lim, Arjun Damerla, Arthur Jiang, Nam Le

Main category: cs.CL

TL;DR: LLMs can perform sequential decision-making under uncertainty using only natural language feedback in multi-armed bandit environments, with Qwen3-4B achieving 89.2% best-arm selection rate, outperforming both larger LLMs and traditional algorithms.

Details

Motivation: To explore LLMs' ability to make sequential decisions under uncertainty using only natural language feedback, without numerical cues or explicit probabilities.

Method: Created a benchmark where LLMs interact with multi-armed bandit environments using purely textual feedback (‘you earned a token’), requiring models to infer latent reward structures from linguistic cues and adapt accordingly.

Result: Most LLMs underperformed compared to standard algorithms (Thompson Sampling, Epsilon Greedy, UCB, random), but Qwen3-4B achieved 89.2% best-arm selection rate, significantly outperforming both larger LLMs and traditional methods.

Conclusion: Probabilistic reasoning can emerge from language alone, and this benchmark represents progress toward evaluating decision-making capabilities in naturalistic, non-numeric contexts.

Abstract: Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, “you earned a token”, without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.

[32] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

Alexandre Galashov, Matt Jones, Rosemary Ke, Yuan Cao, Vaishnavh Nagarajan, Michael C. Mozer

Main category: cs.CL

TL;DR: The paper introduces ‘Catch Your Breath’ (CYB) methods that enable language models to dynamically request additional compute steps via <don’t know> outputs and tokens, allowing adaptive processing time based on token complexity.

Details

Motivation: To create language models that can autonomously scale compute resources per token, improving efficiency by allocating more processing time to complex tokens while avoiding unnecessary computation for simpler ones.

Method: Three CYB loss variants: CYB-AP (anytime prediction with time-discounted accuracy), CYB-VA (variational approach with stopping time distribution), and CYB-DP (computational budget penalty). Models can emit <don’t know> to request pauses for additional computation.

Result: CYB models achieve same performance with only one third of training data compared to baseline models, and half as much data as pause-enabled models with cross-entropy loss. Models pause strategically for complex tokens (plural nouns, ambiguous words) but not for simple tokens (contractions).

Conclusion: The CYB framework successfully enables language models to dynamically adapt compute allocation based on token complexity, significantly improving training efficiency and demonstrating intelligent pause behavior aligned with linguistic complexity.

Abstract: We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a <don’t know> output. If the model is granted a delay, a specialized token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use <don’t know> outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model’s task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.

[33] PAGE: Prompt Augmentation for text Generation Enhancement

Mauro Jose Pacchiotti, Luciana Ballejos, Mariel Ale

Main category: cs.CL

TL;DR: PAGE is a framework that uses lightweight auxiliary modules (classifiers/extractors) to enhance text generation by providing enriched inputs, improving quality and controllability without requiring additional generative models.

Details

Motivation: Natural language generative models often perform poorly on specific tasks or require large amounts of additional data for adjustments, creating a need for more efficient enhancement methods.

Method: PAGE employs simple auxiliary modules (classifiers or extractors) that provide inferences from input text, which are then used to construct enriched inputs for the generative model, creating a modular architecture adaptable to different tasks.

Result: A proof of concept in requirements engineering demonstrates improved quality of software requirements generation using an auxiliary classifier module.

Conclusion: PAGE offers a simpler, modular alternative to existing generation-assistance approaches that doesn’t require auxiliary generative models, making it easy to adapt to various tasks while improving generation quality and controllability.

Abstract: In recent years, natural language generative models have shown outstanding performance in text generation tasks. However, when facing specific tasks or particular requirements, they may exhibit poor performance or require adjustments that demand large amounts of additional data. This work introduces PAGE (Prompt Augmentation for text Generation Enhancement), a framework designed to assist these models through the use of simple auxiliary modules. These modules, lightweight models such as classifiers or extractors, provide inferences from the input text. The output of these auxiliaries is then used to construct an enriched input that improves the quality and controllability of the generation. Unlike other generation-assistance approaches, PAGE does not require auxiliary generative models; instead, it proposes a simpler, modular architecture that is easy to adapt to different tasks. This paper presents the proposal, its components and architecture, and reports a proof of concept in the domain of requirements engineering, where an auxiliary module with a classifier is used to improve the quality of software requirements generation.

Bolei Ma, Yong Cao, Indira Sen, Anna-Carolina Haensch, Frauke Kreuter, Barbara Plank, Daniel Hershcovich

Main category: cs.CL

TL;DR: LLMs should use open-ended text generation for realistic social simulations rather than constrained formats, as this captures richer viewpoints, reasoning, and reduces researcher bias.

Details

Motivation: Current LLM social simulations use multiple-choice or short-answer formats that overlook LLMs' generative nature and limit realistic representation of social phenomena.

Method: Propose open-ended text generation approach for LLM social simulations, drawing on survey methodology research and NLP advances to capture topics, viewpoints, and reasoning processes.

Result: Open-endedness improves measurement, supports exploration of unanticipated views, reduces directive bias, captures expressiveness, aids pretesting, and enhances methodological utility.

Conclusion: Researchers should develop practices and evaluation frameworks that leverage LLMs’ generative diversity rather than constraining it, creating synergies between NLP and social science.

Abstract: Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes “in” LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.

[35] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

Ariel Kamen

Main category: cs.CL

TL;DR: Comparative evaluation of 10 LLMs for text categorization using IAB taxonomy shows moderate performance (34% accuracy) with frequent over-categorization. Ensemble approach significantly improves results and eliminates hallucinations.

Details

Motivation: To systematically evaluate state-of-the-art LLMs' performance in hierarchical text categorization using consistent methodology and understand limitations of current approaches.

Method: Used uniform dataset of 8,660 human-annotated samples with identical zero-shot prompts across 10 LLMs. Evaluated with classic metrics (accuracy, precision, recall, F1) and LLM-specific indicators (hallucination, inflation, cost). Also developed ensemble method with multiple LLMs as independent experts.

Result: LLMs achieved moderate performance: 34% accuracy, 42% precision, 45% recall, 41% F1. High hallucination and inflation ratios. Gemini 1.5/2.0 Flash and GPT models offered best cost-performance balance. Ensemble method substantially improved accuracy, reduced inflation, and eliminated hallucinations.

Conclusion: Scaling and architectural improvements alone don’t ensure better categorization. Coordinated orchestration of models through ensemble approaches is more effective than sheer scale for achieving human-expert performance in text categorization.

Abstract: This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.

[36] Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, Sewon Min

Main category: cs.CL

TL;DR: The paper addresses the challenge of evaluating natural language math proofs by proposing ProofGrader, a fine-grained evaluator that scores proofs on a 0-7 scale using expert-annotated ProofBench dataset and achieves strong performance against human expert scores.

Details

Motivation: Current LLMs for mathematical reasoning focus on tasks with easily verifiable answers, but generating and verifying natural language math proofs remains an open challenge due to the absence of reliable fine-grained evaluators.

Method: Introduced ProofBench dataset with expert annotations, systematically explored evaluator design space across backbone models, input context, instructions and workflows. Developed ProofGrader combining strong reasoning backbone LM, reference solutions, marking schemes, and ensembling.

Result: ProofGrader achieves MAE of 0.926 against expert scores, significantly outperforming naive baselines. In best-of-n selection task at n=16, it achieves average score of 4.14, closing 78% of gap between naive binary evaluator (2.48) and human oracle (4.62).

Conclusion: ProofGrader demonstrates strong potential to advance downstream proof generation by providing reliable fine-grained evaluation of natural language math proofs, addressing a critical gap in mathematical reasoning research.

Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.

[37] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

Fali Wang, Jihai Chen, Shuhua Yang, Ali Al-Lawati, Linli Tang, Hui Liu, Suhang Wang

Main category: cs.CL

TL;DR: This paper systematically surveys SLM-LLM collaboration frameworks that combine small language models’ efficiency with large language models’ capabilities to address deployment challenges.

Details

Motivation: Large language models face high costs, latency, limited edge deployment, and reliability issues, while small language models offer complementary efficiency and adaptability benefits.

Method: Proposes a taxonomy with four collaboration objectives: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Reviews representative methods and summarizes design paradigms.

Result: The survey organizes existing SLM-LLM collaboration approaches and identifies key design patterns for combining specialized SLMs with generalized LLMs across different deployment scenarios.

Conclusion: Outlines open challenges and future directions toward achieving efficient, secure, and scalable SLM-LLM collaboration systems.

Abstract: Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs’ specialization and efficiency with LLMs’ generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.

[38] The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data

Zhaoyang Shang, Sibo Wei, Jianbin Guo, Rui Zhou, Lifeng Dong, Yin Luo

Main category: cs.CL

TL;DR: THTB is a cognitive science-inspired framework that selects high-quality instruction data for LLM fine-tuning by prioritizing higher-level cognitive instructions using quality filtering and hardness scoring, enabling models to outperform full-dataset training with only 5% of data.

Details

Motivation: Existing methods for selecting SFT data suffer from over-reliance on LLMs' internal knowledge, weak interpretability, and limited generalization, motivating the need for a more systematic and interpretable approach.

Method: THTB combines quality filtering with intrinsic and extrinsic hardness scoring to prioritize higher-level cognitive instructions, providing interpretable and quantifiable criteria for efficient SFT data selection and annotation guidance.

Result: Models trained on only 5% of data selected by THTB outperform full-dataset training, with superior generalization compared to LLM-only selection. In vertical domains, models trained on just 2% of data surpass models trained on much larger datasets.

Conclusion: THTB demonstrates strong potential for domain adaptation and efficient SFT, providing an interpretable framework for data selection that significantly reduces training costs while improving performance and generalization.

Abstract: Large Language Models (LLMs) excel in general tasks, but adapting them to specialized domains relies on high-quality supervised fine-tuning (SFT) data. Although existing methods can identify subsets of high-quality data and reduce training cost to some extent, their selection process still suffers from over-reliance on LLMs’ internal knowledge, weak interpretability, and limited generalization. To address these limitations, we propose THTB (The Harder The Better), a cognitive science-inspired framework for instruction data selection and annotation guidance. THTB prioritizes higher-level cognitive instructions by combining quality filtering with intrinsic and extrinsic hardness scoring, offering interpretable and quantifiable criteria for efficient SFT, both in data selection and annotation guidance. Experiments show that THTB enables models trained on only 5% of the data to outperform full-dataset training, while achieving superior generalization compared with LLM-only selection. In addition, THTB provides effective annotation guidance in vertical domains, enabling a model trained on just 2% of the data to surpass models trained on much larger datasets, demonstrating strong potential for domain adaptation. Our code, datasets, and models are available on https://github.com/DYJG-research/THTB.

[39] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi

Main category: cs.CL

TL;DR: The paper presents a comprehensive study on jailbreaking LLMs, developing a taxonomy of 50 strategies across 7 families, analyzing attack prevalence and success rates, benchmarking detection methods, and creating an Italian dataset of multi-turn adversarial dialogues.

Details

Motivation: Existing defenses against jailbreaking are limited to single-turn attacks, lack cross-language coverage, and use incomplete taxonomies that don't capture the full diversity of attack strategies or emphasize risk categories over techniques.

Method: Conducted a structured red-teaming challenge to develop a hierarchical taxonomy of 50 jailbreak strategies across 7 families, analyzed collected data on attack prevalence and success rates, benchmarked LLM detection methods with taxonomy-guided prompting, and compiled an Italian dataset of 1364 multi-turn adversarial dialogues.

Result: Created a comprehensive taxonomy covering 7 families of jailbreak strategies (impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, data poisoning), analyzed how specific strategies exploit model vulnerabilities, showed benefits of taxonomy-guided prompting for detection, and produced an annotated Italian dataset for studying gradual adversarial intent.

Conclusion: The study advances understanding of jailbreaking effectiveness through systematic taxonomy development, provides insights into attack strategy exploitation patterns, demonstrates improved detection with taxonomy guidance, and enables research on multi-turn adversarial interactions that bypass traditional safeguards.

Abstract: Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.

[40] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges

Misam Abbas

Main category: cs.CL

TL;DR: This paper benchmarks two authorship attribution methods (Style Embeddings and LLM Judge) on AI-generated content across six domains, finding complementary strengths that suggest hybrid approaches are needed.

Details

Motivation: As LLM-generated text becomes indistinguishable from human writing, reliable authorship attribution is increasingly important for detecting AI-generated content.

Method: Benchmarked fixed Style Embeddings and GPT-4o LLM Judge on the Human AI Parallel Corpus containing 600 balanced instances across six domains (academic, news, fiction, blogs, spoken transcripts, TV/movie scripts).

Result: Style Embeddings achieved 82% accuracy on GPT continuations vs 68% for LLM Judge. LLM Judge performed slightly better on LLaMA continuations (85% vs 81%) but not significantly. LLM Judge excelled in fiction and academic prose, while embeddings dominated in spoken and scripted dialogue.

Conclusion: Attribution is a multidimensional problem requiring hybrid strategies, as different methods show complementary strengths across different text domains and LLM types.

Abstract: Attributing authorship in the era of large language models (LLMs) is increasingly challenging as machine-generated prose rivals human writing. We benchmark two complementary attribution mechanisms , fixed Style Embeddings and an instruction-tuned LLM judge (GPT-4o) on the Human AI Parallel Corpus, an open dataset of 600 balanced instances spanning six domains (academic, news, fiction, blogs, spoken transcripts, and TV/movie scripts). Each instance contains a human prompt with both a gold continuation and an LLM-generated continuation from either GPT-4o or LLaMA-70B-Instruct. The Style Embedding baseline achieves stronger aggregate accuracy on GPT continuations (82 pct vs. 68 pct). The LLM Judge is slightly better than the Style embeddings on LLaMA continuations (85 pct vs. 81 pct) but the results are not statistically significant. Crucially, the LLM judge significantly outperforms in fiction and academic prose, indicating semantic sensitivity, whereas embeddings dominate in spoken and scripted dialogue, reflecting structural strengths. These complementary patterns highlight attribution as a multidimensional problem requiring hybrid strategies. To support reproducibility we provide code on GitHub and derived data on Hugging Face under the MIT license. This open framework provides a reproducible benchmark for attribution quality assessment in AI-generated content, along with a review of related literature influencing this work.

[41] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda

Main category: cs.CL

TL;DR: Narrow finetuning creates detectable biases in LLM activations that reveal the finetuning domain. These biases can be discovered through model diffing and used to generate text similar to the training data, posing potential security and interpretability concerns.

Details

Motivation: To understand how narrow finetuning affects LLM activations and whether these changes can reveal sensitive information about the training domain, which has implications for AI safety and interpretability research.

Method: Used model diffing techniques to analyze activation differences before and after finetuning, tested on synthetic document finetuning, emergent misalignment, subliminal learning, and taboo word guessing across various architectures (Gemma, LLaMA, Qwen) and scales (1B-32B parameters).

Result: Found strong biases in activations that can be interpreted to understand the finetuning domain. These biases allow generating text similar to the training data format and content. Mixing pretraining data during finetuning largely removes these biases.

Conclusion: Narrow finetuning leaves detectable traces in model activations that reveal training objectives, suggesting current practices in AI safety research using such models may not be realistic and highlighting the need for better training methods and case studies.

Abstract: Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.

[42] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Tuan T. Nguyen, John Le, Thai T. Vu, Willy Susilo, Heath Cooper

Main category: cs.CL

TL;DR: RAID is a framework that creates adversarial suffixes to jailbreak LLMs by optimizing continuous embeddings with refusal-aware regularization and coherence terms, achieving higher attack success rates than existing methods.

Details

Motivation: LLMs have safety vulnerabilities to jailbreak attacks that bypass their safety mechanisms, highlighting the need to systematically probe and understand these weaknesses.

Method: RAID relaxes discrete tokens to continuous embeddings, optimizes them with a joint objective including refusal-aware regularization and coherence terms, then uses critic-guided decoding to map embeddings back to tokens.

Result: Experiments show RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines on multiple open-source LLMs.

Conclusion: Embedding-space regularization is crucial for understanding and mitigating LLM jailbreak vulnerabilities.

Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

[43] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

Nicole Smith-Vaniz, Harper Lyon, Lorraine Steigner, Ben Armstrong, Nicholas Mattei

Main category: cs.CL

TL;DR: This paper analyzes political and moral biases in LLMs using Moral Foundations Theory, comparing LLM responses to human data across different prompting conditions.

Details

Motivation: LLMs are increasingly used for advice in sensitive domains, raising concerns about potential political and moral biases in their responses that could influence users.

Method: Applied Moral Foundations Theory framework to analyze LLM responses, comparing them with human data across three conditions: inherent responses, explicit political ideology prompting, and demographic-based role-playing.

Result: The study systematically examines whether LLMs demonstrate ideological leanings in their responses and how accurately they can represent different political perspectives.

Conclusion: Provides insights into the extent of political and demographic dependency in AI-generated responses, addressing gaps in understanding LLM moral biases compared to human data.

Abstract: Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.

[44] Schema for In-Context Learning

Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung, Varinia Bernales, Alan Aspuru-Guzik

Main category: cs.CL

TL;DR: SA-ICL introduces schema-based in-context learning that extracts abstract reasoning templates from examples to enhance LLM performance, achieving up to 36.19% improvement on science questions.

Details

Motivation: Traditional ICL lacks explicit knowledge retrieval and transfer mechanisms at the abstraction level, while cognitive science suggests humans use schemas (mental frameworks) to structure understanding of new information.

Method: Extracts building blocks of cognition from prior examples to create abstracted schemas - lightweight, structured templates of key inferential steps and their relationships - which augment the model’s reasoning process for novel questions.

Result: SA-ICL consistently boosts performance up to 36.19% on chemistry and physics questions from GPQA dataset, reduces reliance on demonstration quantity, and enhances interpretability. LLMs benefit significantly from explicit schema-based scaffolding.

Conclusion: SA-ICL bridges disparate ICL strategies and paves a new path for enhancing human-like reasoning in LLMs by making schema-based learning representations explicit rather than implicit.

Abstract: In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model’s reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.

[45] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Yuanchen Wu, Saurabh Verma, Justin Lee, Fangzhou Xiong, Poppy Zhang, Amel Awadelkarim, Xu Chen, Yubai Yuan, Shawndra Hill

Main category: cs.CL

TL;DR: PDO is a label-free prompt optimization framework that uses dueling-bandit approach with pairwise preference feedback from LLM judges, combining Double Thompson Sampling and Top-Performer Guided Mutation.

Details

Motivation: Traditional automatic prompt optimization methods require costly ground-truth labels, which are expensive and slow to collect in practice.

Method: Formulates prompt optimization as dueling-bandit problem using pairwise preference feedback from LLM judges, combines Double Thompson Sampling for informative comparisons with Top-Performer Guided Mutation to expand candidate pool.

Result: PDO consistently outperforms baseline methods on BIG-bench Hard and MS MARCO datasets, with ablation studies confirming effectiveness of both D-TS and prompt mutation components.

Conclusion: PDO provides an effective label-free approach to prompt optimization that can also incorporate partial labels to handle judge noise, offering practical advantages over label-dependent methods.

Abstract: Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.

[46] Interpreting the Latent Structure of Operator Precedence in Language Models

Dharunish Yugeswardeenoo, Harshil Nukala, Cole Blondin, Sean O Brien, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: LLMs encode operator precedence in their internal representations, with intermediate computations appearing in the residual stream after MLP blocks, and precedence linearly encoded in operator embeddings post attention layer.

Details

Motivation: To investigate whether LLMs encode operator precedence in their internal representations, as prior works focused on outputs and prompting strategies rather than internal computation structure.

Method: Used instruction-tuned LLaMA 3.2-3B model with dataset of arithmetic expressions with three operands and two operators, varying parentheses placement. Applied interpretability techniques including logit lens, linear classification probes, UMAP visualization, and introduced partial embedding swap technique.

Result: Intermediate computations are present in the residual stream, particularly after MLP blocks. The model linearly encodes precedence in each operator’s embeddings post attention layer. Partial embedding swap successfully modifies operator precedence.

Conclusion: LLMs do encode operator precedence in their internal representations, with specific patterns in residual stream and operator embeddings, enabling manipulation of precedence through embedding modifications.

Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator’s embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.

[47] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

Xingrui Zhuo, Jiapu Wang, Gongqing Wu, Zhongyuan Wang, Jichen Zhang, Shirui Pan, Xindong Wu

Main category: cs.CL

TL;DR: Proposes KRLM, a Knowledge Reasoning Language Model that coordinates LLM knowledge with KG context to address knowledge distortion and hallucinations in inductive knowledge graph reasoning.

Details

Motivation: Existing LLM-based KGFMs suffer from LLM knowledge distortion due to sparse KG context overshadowing intrinsic knowledge, and struggle to constrain generative hallucinations, limiting reasoning credibility.

Method: Designs Knowledge Reasoning Language (KRL) instruction format and tokenizer to align LLM knowledge with KG representations, proposes KRL attention layer with dynamic knowledge memory mechanism, and structure-aware next-entity predictor to constrain results.

Result: Extensive experiments on 25 real-world inductive KGR datasets demonstrate significant superiority in both zero-shot reasoning and fine-tuning scenarios.

Conclusion: KRLM effectively addresses knowledge distortion and hallucination issues in LLM-based knowledge graph reasoning through unified coordination between LLM knowledge and KG context.

Abstract: Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at https://anonymous.4open.science/r/KRLM-EA36 in both zero-shot reasoning and fine-tuning scenarios.

[48] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li

Main category: cs.CL

TL;DR: RAGCap-Bench is a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows, addressing limitations in current systems for multi-hop questions.

Details

Motivation: Current agentic RAG systems struggle with challenging multi-hop questions and their intermediate reasoning capabilities remain underexplored, despite RAG's ability to mitigate LLM limitations like factual errors and hallucinations.

Method: Analyzed outputs from state-of-the-art systems to identify common tasks and core capabilities, constructed a taxonomy of typical LLM errors, and designed targeted evaluation questions for intermediate tasks in agentic RAG workflows.

Result: Experiments show that ‘slow-thinking’ models with stronger RAGCap performance achieve better end-to-end results, validating the benchmark’s effectiveness.

Conclusion: The benchmark demonstrates the importance of enhancing intermediate capabilities in agentic RAG systems for improved performance on complex queries.

Abstract: Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that “slow-thinking” models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark’s validity and the importance of enhancing these intermediate capabilities.

[49] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

María Victoria Carro, Denise Alejandra Mester, Facundo Nieto, Oscar Agustín Stanchi, Guido Ernesto Bergman, Mario Alejandro Leiva, Eitan Sprejer, Luca Nicolás Forziati Gangi, Francisca Gauna Selasco, Juan Gustavo Corvalán, Gerardo I. Simari, María Vanina Martinez

Main category: cs.CL

TL;DR: AI debate experiments reveal models prefer sycophantic strategies over prior beliefs, sequential debate favors second debater, and paradoxically higher-quality arguments emerge when defending positions misaligned with prior beliefs.

Details

Motivation: To test whether language models adopt sycophantic strategies by aligning with judge personas rather than their prior beliefs in subjective debate settings, addressing limitations of existing debate experiments that rely on objective datasets with ground truth.

Method: Applied debate to subjective questions, measured models’ prior beliefs, presented debaters with conflicting judge personas, compared sequential vs simultaneous debate protocols, and assessed persuasiveness and argument quality when defending positions consistent vs inconsistent with prior beliefs.

Result: Models prefer defending stances aligned with judge persona over prior beliefs, sequential debate introduces bias favoring second debater, models are more persuasive when defending positions aligned with prior beliefs, but arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison.

Conclusion: Results inform human judges for better training signals and contribute to aligned AI systems, revealing important persuasion dynamics in human-AI interaction where models exhibit sycophantic behavior and produce paradoxically higher-quality arguments when arguing against their beliefs.

Abstract: The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models’ prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge’s presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.

[50] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Shrey Pandit, Xuan-Phi Nguyen, Yifei Ming, Austin Xu, Jiayu Wang, Caiming Xiong, Shafiq Joty

Main category: cs.CL

TL;DR: The paper introduces a data synthesis pipeline that generates question-answer pairs by progressively increasing task complexity until a baseline web agent fails, enabling training of more effective web agents with greater tool-use diversity.

Details

Motivation: Current methods for creating instruction-tuning datasets for web-based deep research agents lack fine-grained control over difficulty and quality, and often conflate data and training effects, making it hard to evaluate data effectiveness.

Method: Two-pronged data synthesis pipeline that generates question-answer pairs by progressively increasing task complexity until a baseline web agent fails, with the baseline agent serving multiple roles including attempting questions, validating factuality, checking for alternative answers, and enforcing filtering.

Result: The synthesized dataset enables training of more effective web agents than existing datasets, exhibiting twice the diversity in tool-use actions and achieving stronger performance while avoiding repetitive tool-calling behaviors.

Conclusion: The proposed data synthesis method produces higher quality training data that better captures the complexity required for long-horizon reasoning in web-based deep research agents.

Abstract: Web-based ‘deep research’ agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.

[51] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

Ivan Lee, Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: Readability alone doesn’t predict coherence in small language models; statistical simplicity (n-gram diversity) is a stronger predictor than simplified language.

Details

Motivation: To challenge the interpretation that readability (simplified language) is key for coherence in small language models, and investigate what properties actually support capability emergence.

Method: Constructed synthetic datasets with matched structure but varied readability, comparing models trained on complex vs simplified language, and analyzed statistical simplicity using n-gram diversity metrics.

Result: Models trained on complex, adult-level text performed comparably to those trained on simplified language, with even faster development of coherence. Statistical simplicity was a stronger predictor of learnability than readability.

Conclusion: Readability alone doesn’t explain coherence emergence in small language models; statistical simplicity is more important, and anthropomorphizing model training by drawing parallels to human cognitive development should be done cautiously with empirical basis.

Abstract: Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability – characterized by accessible vocabulary, familiar narrative structure, and simple syntax – plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training – drawing parallels to human cognitive development without empirical basis – and argue for more precise reasoning about what properties actually support capability emergence in small models.

[52] Element2Vec: Build Chemical Element Representation from Text for Property Prediction

Yuanhao Li, Keyuan Lai, Tianqi Wang, Qihao Liu, Jiawei Ma, Yuan-Chao Hu

Main category: cs.CL

TL;DR: Element2Vec uses language models to generate embeddings from Wikipedia text for chemical elements, creating both general-purpose and attribute-specific vectors to predict element properties, addressing data sparsity with test-time training.

Details

Motivation: Traditional methods fail to model complex relationships in chemical element properties, and existing AI approaches suffer from hallucinations and lack interpretability. There's a need for better representation of elements from natural language text.

Method: Parse Wikipedia text for elements, use language models to generate global embeddings and local attribute-highlighted vectors, and implement test-time training with self-attention to reduce prediction errors from vanilla regression.

Result: The method effectively represents chemical elements from natural language, addressing challenges of text distribution discrepancy and limited data (only 118 known elements with sparse property data).

Conclusion: This work paves the way for advancing AI-driven discovery in materials science by providing interpretable element representations that can handle complex relationships and data sparsity issues.

Abstract: Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.

[53] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

Peng Kuang, Yanli Wang, Xiaoyu Han, Yaowenqi Liu, Kaidi Xu, Haohan Wang

Main category: cs.CL

TL;DR: The paper proposes a calibrated weighting method for process reward models (PRMs) that significantly improves test-time scaling efficiency by optimally combining LLM and PRM signals, outperforming standard approaches while using only 21.3% of computation.

Details

Motivation: Recent benchmarks show that simple majority voting sometimes outperforms standard PRM-based selection, raising questions about how to effectively utilize PRM verification signals for test-time scaling.

Method: Developed a theoretical framework for optimal signal combination, revealing that weighted aggregation with calibrated weights is optimal. Proposed efficient pre-computation methods to calibrate weighting functions that capture the complex interplay between LLMs and PRMs.

Result: Extensive experiments across 5 LLMs and 7 PRMs show the calibration method significantly boosts TTS efficiency, surpassing vanilla weighted majority voting while using only 21.3% of computation.

Conclusion: Investing in intelligent aggregation strategies is more effective for performance gains than simply scaling test-time computation, demonstrating that calibrated weighting functions can substantially improve PRM utilization.

Abstract: Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.

[54] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

Ye Yuan, Mohammad Amin Shabani, Siqi Liu

Main category: cs.CL

TL;DR: FACTS is a fast, accurate, and privacy-compliant table summarization approach that generates reusable offline templates (SQL queries + Jinja2 templates) for query-focused table summarization, outperforming existing methods.

Details

Motivation: Existing approaches for query-focused table summarization have limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods have token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines lack robustness and scalability.

Method: FACTS introduces an agentic workflow that produces offline templates consisting of SQL queries and Jinja2 templates. These templates are reusable across multiple tables sharing the same schema and can be rendered into natural language summaries.

Result: Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods in query-focused table summarization.

Conclusion: FACTS establishes itself as a practical solution for real-world query-focused table summarization, offering fast summarization through reusable templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs.

Abstract: Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization.

[55] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

Daniel Adu Worae, Spyridon Mastorakis

Main category: cs.CL

TL;DR: An LLM-powered AI agent framework that converts IoT network packet captures into structured representations for interactive analysis, combining feature extraction, anomaly detection, summarization, and retrieval-augmented QA to achieve efficient traffic interpretation.

Details

Motivation: IoT networks generate diverse high-volume traffic requiring cross-layer interpretation rather than isolated detection to derive meaningful insights from both normal activity and potential threats.

Method: Integrates feature extraction, transformer-based anomaly detection, packet/flow summarization, threat intelligence enrichment, and retrieval-augmented question answering using an AI agent guided by LLM reasoning over indexed traffic artifacts.

Result: Hybrid retrieval combining lexical and semantic search with reranking substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared to dense-only retrieval, with low CPU, GPU, and memory overhead.

Conclusion: The framework achieves holistic and efficient interpretation of IoT network traffic through structured semantic enrichment and interactive analysis capabilities.

Abstract: Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.

[56] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs

Congying Liu, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao, Tiehan Cui

Main category: cs.CL

TL;DR: BioMedSearch is a multi-source biomedical information retrieval framework using LLMs that integrates literature, protein databases, and web search to handle complex biomedical queries through query decomposition and multi-source filtering, achieving significant accuracy improvements across all reasoning levels.

Details

Motivation: LLMs lack scientific rigor in biomedical content generation due to inability to access authoritative databases and often fabricate protein information, necessitating a framework that can integrate multiple biomedical data sources for accurate retrieval and reasoning.

Method: Integrates literature retrieval, protein database access, and web search with sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering to generate high-quality biomedical question-answering results.

Result: BioMedSearch consistently outperforms baseline models across all reasoning levels: Level 1 accuracy increased from 59.1% to 91.9%, Level 2 from 47.0% to 81.0%, and Level 3 from 36.3% to 73.4% on the BioMedMCQs dataset.

Conclusion: The framework effectively addresses LLMs’ limitations in biomedical reasoning by integrating multiple authoritative data sources and structured query processing, demonstrating substantial improvements in handling complex biomedical queries across different reasoning levels.

Abstract: Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: https://github.com/CyL-ucas/BioMed_Search

[57] LLMs Can Get “Brain Rot”!

Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, Zhangyang Wang

Main category: cs.CL

TL;DR: The LLM Brain Rot Hypothesis shows that continual exposure to low-quality web text causes lasting cognitive decline in LLMs, with measurable deterioration in reasoning, safety, and other capabilities.

Details

Motivation: To causally test whether data quality affects LLM cognitive capabilities by examining the effects of junk web text exposure through controlled experiments.

Method: Controlled experiments using real Twitter/X corpora with junk and control datasets via two operationalizations (engagement degree and semantic quality), with matched token scale and training operations across 4 LLM models.

Result: Continual pre-training on junk data causes significant declines in reasoning, long-context understanding, safety, and increases ‘dark traits’. Thought-skipping identified as primary error mechanism. Popularity better predictor than length for brain rot effects.

Conclusion: Data quality is a causal driver of LLM capability decay, reframing curation as a training-time safety problem and motivating routine cognitive health checks for deployed LLMs.

Abstract: We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges’ $g>0.3$) on reasoning, long-context understanding, safety, and inflating “dark traits” (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops $74.9 \rightarrow 57.2$ and RULER-CWE $84.4 \rightarrow 52.3$ as junk ratio rises from $0%$ to $100%$. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textit{training-time safety} problem and motivating routine “cognitive health checks” for deployed LLMs.

[58] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions

Siying Liu, Shisheng Zhang, Indu Bala

Main category: cs.CL

TL;DR: LLMs show systematic biases in drug-safety prediction, assigning higher adverse event likelihoods to disadvantaged groups despite socio-demographic attributes being clinically irrelevant.

Details

Motivation: To investigate whether LLMs incorporate socio-demographic information into adverse event predictions, despite such attributes being clinically irrelevant, and assess reliability in drug-safety applications.

Method: Used structured FAERS data with persona-based evaluation framework, testing ChatGPT-4o and Bio-Medical-Llama-3.8B across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion, plus three user roles (GP, specialist, patient).

Result: Systematic disparities in AE prediction accuracy found - disadvantaged groups (low education, unstable housing) assigned higher predicted AE likelihoods than privileged groups. Identified explicit bias (incorrect predictions reference persona attributes) and implicit bias (inconsistent predictions without explicit persona mentions).

Conclusion: Findings expose critical risks in applying LLMs to pharmacovigilance, highlighting urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.

Abstract: Large language models (LLMs) are increasingly applied in biomedical domains, yet their reliability in drug-safety prediction remains underexplored. In this work, we investigate whether LLMs incorporate socio-demographic information into adverse event (AE) predictions, despite such attributes being clinically irrelevant. Using structured data from the United States Food and Drug Administration Adverse Event Reporting System (FAERS) and a persona-based evaluation framework, we assess two state-of-the-art models, ChatGPT-4o and Bio-Medical-Llama-3.8B, across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion. We further evaluate performance across three user roles (general practitioner, specialist, patient) to reflect real-world deployment scenarios where commercial systems often differentiate access by user type. Our results reveal systematic disparities in AE prediction accuracy. Disadvantaged groups (e.g., low education, unstable housing) were frequently assigned higher predicted AE likelihoods than more privileged groups (e.g., postgraduate-educated, privately insured). Beyond outcome disparities, we identify two distinct modes of bias: explicit bias, where incorrect predictions directly reference persona attributes in reasoning traces, and implicit bias, where predictions are inconsistent, yet personas are not explicitly mentioned. These findings expose critical risks in applying LLMs to pharmacovigilance and highlight the urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.

[59] Big Reasoning with Small Models: Instruction Retrieval at Inference Time

Kenan Alkiek, David Jurgens, Vinod Vydiswaran

Main category: cs.CL

TL;DR: Instruction retrieval enables small language models to perform complex reasoning by retrieving structured procedures rather than generating them from scratch, achieving significant performance gains on specialized tasks without additional fine-tuning.

Details

Motivation: Small language models (SLMs) are efficient for local deployment but struggle with multi-step reasoning and domain-specific knowledge tasks. The goal is to enhance SLM reasoning capabilities without sacrificing their computational efficiency advantages.

Method: Build an Instruction Corpus by grouping similar training questions and creating structured instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps, providing structured guidance for reasoning rather than retrieving text passages.

Result: Instruction retrieval yields consistent performance improvements: 9.4% on MedQA (medical board exams), 7.9% on MMLU Professional Law, and 5.1% on MathQA across models from 3B to 14B parameters without additional fine-tuning.

Conclusion: Concise instructions outperform longer ones, and the improvement magnitude depends on model family and intrinsic reasoning ability. Instruction retrieval effectively enhances SLM reasoning capabilities while maintaining their computational efficiency benefits.

Abstract: Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.

[60] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

Fengbin Zhu, Xiang Yao Ng, Ziyang Liu, Chang Liu, Xianwei Zeng, Chao Wang, Tianhui Tan, Xuan Yao, Pengyang Shao, Min Xu, Zixuan Wang, Jing Wang, Xin Lin, Junfeng Li, Jingxian Zhu, Yang Zhang, Wenjie Wang, Fuli Feng, Richang Hong, Huanbo Luan, Ke-Wei Huang, Tat-Seng Chua

Main category: cs.CL

TL;DR: The paper introduces HisRubric, a hierarchical evaluation framework for deep research agents in financial analysis, and creates the FinDeepResearch benchmark with 15,808 grading items across 64 companies from 8 markets in 4 languages.

Details

Motivation: Existing literature lacks rigorous and systematic evaluation of deep research agents' capabilities in critical research analysis, particularly in corporate financial analysis.

Method: Proposed HisRubric framework with hierarchical analytical structure and fine-grained grading rubric that mirrors professional analyst workflow. Built FinDeepResearch benchmark and tested 16 methods including DR agents and LLMs with different capabilities.

Result: Extensive experiments revealed strengths and limitations of different approaches across diverse capabilities, financial markets, and languages. The findings provide valuable insights for future research.

Conclusion: The study addresses the evaluation gap for deep research agents in financial analysis through a systematic framework and benchmark, with results guiding future development in this field.

Abstract: Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent’s capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents’ capabilities in corporate financial analysis. This framework mirrors the professional analyst’s workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code will be made publicly available.

[61] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon

Main category: cs.CL

TL;DR: Fine-tuning AI models on authors’ complete works enables generation of literary text that experts prefer over human writing in both stylistic fidelity and quality, reversing initial disadvantages of in-context prompting.

Details

Motivation: To address legal concerns about AI using copyrighted books by testing whether AI can genuinely emulate authors' styles and generate high-quality literary text that could compete with human writing.

Method: Preregistered study comparing MFA-trained expert writers with ChatGPT, Claude, and Gemini using both in-context prompting and fine-tuning on authors’ complete works, with blind pairwise evaluations by 159 expert and lay readers.

Result: Fine-tuning completely reversed initial disadvantages: experts favored AI-generated text for stylistic fidelity (OR=8.16) and writing quality (OR=1.87), with outputs rarely detected as AI-generated (3% vs 97% for in-context prompting).

Conclusion: Author-specific fine-tuning enables AI to produce non-verbatim writing that readers prefer to expert human writing, providing empirical evidence relevant to copyright’s fair-use analysis regarding market effects.

Abstract: The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI’s ability to generate derivative content.Yet it’s unclear whether these models can generate high quality literary text while emulating authors’ styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors’ diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^8) & writing quality (OR=0.13, p<10^7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors’ complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright’s fourth fair-use factor, the “effect upon the potential market or value” of the source works.

[62] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Pengfei Wan, Ying-Cong Chen

Main category: cs.CL

TL;DR: MTI is a training-free framework that improves reasoning accuracy by selectively applying classifier-free guidance only at high-entropy tokens, achieving consistent gains with minimal overhead.

Details

Motivation: Reasoning uncertainty in LLMs is highly localized to a small subset of high-entropy tokens, suggesting efficiency can be improved by focusing interventions only where needed rather than applying computation broadly.

Method: Minimal Test-Time Intervention (MTI) with two components: selective CFG intervention at uncertain positions only, and lightweight negative-prompt guidance that reuses the main model’s KV cache for efficient unconditional decoding approximation.

Result: Consistent improvements across general, coding, and STEM tasks: +1.35% average gain on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning, while maintaining high efficiency.

Conclusion: MTI demonstrates that targeted, minimal interventions at uncertain token positions can significantly enhance reasoning accuracy and stability in LLMs without the computational overhead of broad test-time scaling approaches.

Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model’s KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.

[63] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Rose, Jesse C. Cresswell

Main category: cs.CL

TL;DR: A comprehensive taxonomy of error types in retrieval-augmented generation (RAG) systems, with practical solutions, annotated dataset, and auto-evaluation method for error tracking.

Details

Motivation: Understanding the range of errors in real-world RAG systems is crucial for robust deployment, as complex systems have many potential causes for erroneous outputs.

Method: Developed a new taxonomy of RAG error types, curated annotated dataset of erroneous responses, and proposed auto-evaluation method aligned with the taxonomy.

Result: Created comprehensive error classification framework with practical advice for addressing each error type, plus an auto-evaluation tool for development tracking.

Conclusion: The taxonomy and evaluation method provide practical tools for developers to systematically identify, track, and address errors in RAG systems during development and deployment.

Abstract: Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.

[64] The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Lukas Gienapp, Christopher Schröder, Stefan Schweter, Christopher Akiki, Ferdinand Schlatt, Arden Zimmermann, Phillipe Genêt, Martin Potthast

Main category: cs.CL

TL;DR: The German Commons is the largest collection of openly licensed German text (154.56B tokens) from 41 sources across 7 domains, enabling development of truly open German language models.

Details

Motivation: Address the critical scarcity of openly licensed German text for language model training, as most training corpora contain data of unclear licensing status, especially problematic for non-English languages.

Method: Systematic compilation from 41 verified sources across legal, scientific, cultural, political, news, economic, and web domains with comprehensive quality filtering, deduplication, and text formatting fixes.

Result: Created 154.56 billion tokens of high-quality German text with all domain subsets featuring licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance.

Conclusion: The German Commons fills the critical gap in openly licensed German pretraining data and enables development of truly open German language models, with fully reproducible and extensible corpus construction code.

Abstract: Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.

[65] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

Shehenaz Hossain, Haithem Afli

Main category: cs.CL

TL;DR: CRaFT is an explanation-based multilingual evaluation framework that assesses LLMs’ cultural reasoning across languages using four metrics, revealing significant cross-lingual variations in cultural understanding.

Details

Motivation: Current evaluation methods focus on answer accuracy but fail to capture cultural understanding. There's a need to assess how LLMs reason across different cultural contexts through their explanations rather than just correct answers.

Method: CRaFT evaluates model explanations using four metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. Applied to 50 culturally grounded questions from World Values Survey translated into Arabic, Bengali, and Spanish, testing three models (GPT, DeepSeek, FANAR) across 2,100+ answer-explanation pairs.

Result: Significant cross-lingual variation: Arabic reduces cultural fluency, Bengali enhances it, Spanish remains stable. GPT adapts better across languages but has lower consistency; FANAR shows stable but rigid reasoning. Cultural awareness emerges through linguistic framing rather than being intrinsic.

Conclusion: CRaFT provides a new framework for evaluating cross-cultural reasoning in multilingual settings, offering actionable insights for building culturally adaptive language models. Cultural understanding in LLMs is shaped by linguistic context rather than being an inherent capability.

Abstract: Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.

[66] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

César Guerra-Solano, Zhuochun Li, Xiang Lorraine Li

Main category: cs.CL

TL;DR: The paper evaluates linguistic biases in LLMs’ abstract reasoning capabilities across multiple languages using a GlobalGroup task inspired by NYT Connections, finding English modalities perform better and revealing performance disparities between open- and closed-source models.

Details

Motivation: To address the gap in evaluating linguistic biases in abstract reasoning tasks, as most previous work focused on reasoning tasks that rely on strategies or knowledge like commonsense or math, rather than "out-of-the-box thinking" required for everyday reasoning.

Method: Created GlobalGroup benchmark with five languages (English, Spanish, Chinese, Hindi, Arabic) in both native language and English translation, with game difficulty measurements for controlled comparison across languages.

Result: English modalities largely led to better performance in abstract reasoning tasks, and there were performance disparities between open- and closed-source models.

Conclusion: The study reveals significant linguistic biases in LLMs’ abstract reasoning capabilities, with English performing best, highlighting the need for more balanced multilingual reasoning evaluations.

Abstract: Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply “out-of-the-box thinking” to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds – English, Spanish, Chinese, Hindi, and Arabic – in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.

[67] Quantifying Phonosemantic Iconicity Distributionally in 6 Languages

George Flint, Kaustubh Kislay

Main category: cs.CL

TL;DR: This paper quantifies phonosemantic iconicity across 6 languages using distributional analysis of phonetic and semantic similarity spaces, discovering new systematic relationships and testing previously hypothesized alignments.

Details

Motivation: To investigate the degree to which systematic relationships between phonetics and semantics manifest at scale, both for previously identified and unidentified phenomena, challenging the common theory that language is largely arbitrary.

Method: Distributional approach analyzing alignment of morphemes’ phonetic and semantic similarity spaces across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, Tamil) using statistical measures.

Result: Discovered an array of interpretable phonosemantic alignments not previously identified, along with crosslinguistic patterns. Found support for some previously hypothesized alignments and mixed results for others.

Conclusion: Systematic phonosemantic relationships exist at scale across diverse languages, with both newly discovered patterns and validation of some existing hypotheses, suggesting language may be less arbitrary than commonly theorized.

Abstract: Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations–both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes’ phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.

[68] ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

Haziq Mohammad Khalid, Athikash Jeyaganthan, Timothy Do, Yicheng Fu, Sean O’Brien, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: ERGO improves LLM performance in multi-turn conversations by detecting uncertainty spikes via entropy and triggering adaptive prompt consolidation, achieving 56.6% performance gain.

Details

Motivation: LLMs suffer significant performance degradation in multi-turn conversations when information is presented incrementally, posing challenges to real-world usability.

Method: ERGO continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when sharp entropy spikes are detected.

Result: ERGO yields 56.6% average performance gain over baselines, increases aptitude by 24.7%, and decreases unreliability by 35.3% in multi-turn tasks with incrementally revealed instructions.

Conclusion: Uncertainty-aware interventions can improve both accuracy and reliability in conversational AI by treating uncertainty as a signal rather than a nuisance.

Abstract: Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. By treating uncertainty as a first class signal rather than a nuisance to eliminate, ERGO embraces variability in language and modeling, representing and responding to uncertainty. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude (peak performance capability) by 24.7%, and decreases unreliability (variability in performance) by 35.3%, demonstrating that uncertainty aware interventions can improve both accuracy and reliability in conversational AI.

[69] DROID: Dual Representation for Out-of-Scope Intent Detection

Wael Rashwan, Hossam M. Zawbaa, Sourav Dutta, Haytham Assem

Main category: cs.CL

TL;DR: DROID is a compact dual-encoder framework for out-of-scope intent detection that combines universal and domain-specific representations with a lightweight classifier, achieving significant performance improvements over state-of-the-art methods.

Details

Motivation: Existing approaches for out-of-scope intent detection often rely on strong distributional assumptions or auxiliary calibration modules, which can be limiting for practical deployment in task-oriented dialogue systems.

Method: Uses two complementary encoders: Universal Sentence Encoder for broad semantic generalization and domain-adapted Transformer-based Denoising Autoencoder for domain-specific distinctions. Fused representations are processed by a lightweight branched classifier with a single calibrated threshold. Incorporates synthetic and open-domain outlier augmentation for boundary learning.

Result: Outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6-15% for known intents and 8-20% for OOS intents, with most significant gains in low-resource settings. Uses only 1.5M trainable parameters.

Conclusion: Dual-encoder representations with simple calibration can yield robust, scalable, and reliable out-of-scope detection for neural dialogue systems.

Abstract: Detecting out-of-scope (OOS) user utterances remains a key challenge in task-oriented dialogue systems and, more broadly, in open-set intent recognition. Existing approaches often depend on strong distributional assumptions or auxiliary calibration modules. We present DROID (Dual Representation for Out-of-Scope Intent Detection), a compact end-to-end framework that combines two complementary encoders – the Universal Sentence Encoder (USE) for broad semantic generalization and a domain-adapted Transformer-based Denoising Autoencoder (TSDAE) for domain-specific contextual distinctions. Their fused representations are processed by a lightweight branched classifier with a single calibrated threshold that separates in-domain and OOS intents without post-hoc scoring. To enhance boundary learning under limited supervision, DROID incorporates both synthetic and open-domain outlier augmentation. Despite using only 1.5M trainable parameters, DROID consistently outperforms recent state-of-the-art baselines across multiple intent benchmarks, achieving macro-F1 improvements of 6–15% for known and 8–20% for OOS intents, with the most significant gains in low-resource settings. These results demonstrate that dual-encoder representations with simple calibration can yield robust, scalable, and reliable OOS detection for neural dialogue systems.

[70] Toward Cybersecurity-Expert Small Language Models

Matan Levi, Daniel Ohayon, Ariel Blobstein, Ravid Sagi, Ian Molloy, Yair Allouche

Main category: cs.CL

TL;DR: CyberPal 2.0 is a family of cybersecurity-expert small language models (4B-20B parameters) that outperforms larger frontier models on cybersecurity tasks through specialized training with enriched chain-of-thought instruction data.

Details

Motivation: LLMs have limited deployment in cybersecurity due to lack of high-quality, domain-specific models and training datasets, creating a gap that needs to be addressed.

Method: Created CyberPal 2.0 SLMs using SecKnowledge 2.0 pipeline for data enrichment and formatting, integrating expert-in-the-loop steering with LLM-driven multi-step grounding to generate higher-fidelity reasoning traces for security tasks.

Result: CyberPal 2.0 consistently outperforms baselines and matches/surpasses various open and closed-source frontier models despite being much smaller. On threat intelligence tasks, it ranks second only to Sec-Gemini v1. On threat investigation tasks, the 20B model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first.

Conclusion: Specialized small language models trained with domain-specific enriched data can achieve superior performance in cybersecurity tasks compared to larger general-purpose models, demonstrating the value of targeted domain adaptation.

Abstract: Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.

[71] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis

Darko Sasanski, Dimitar Peshevski, Riste Stojanov, Dimitar Trajanov

Main category: cs.CL

TL;DR: First systematic construction of a Macedonian recipe dataset through web scraping and structured parsing, addressing ingredient normalization challenges and analyzing distinctive ingredient combinations in Macedonian cuisine.

Details

Motivation: Macedonian recipes are under-represented in digital research despite the need for diverse, high-quality recipe datasets to capture regional culinary traditions in computational gastronomy.

Method: Web scraping and structured parsing of Macedonian recipes, with normalization of heterogeneous ingredient descriptions (units, quantities, descriptors), followed by exploratory analysis using Pointwise Mutual Information and Lift score to identify ingredient co-occurrence patterns.

Result: Created the first Macedonian recipe dataset and identified distinctive ingredient combinations that characterize Macedonian cuisine through frequency and co-occurrence analysis.

Conclusion: The dataset provides a new resource for studying food culture in underrepresented languages and offers insights into unique patterns of Macedonian culinary tradition.

Abstract: Computational gastronomy increasingly relies on diverse, high-quality recipe datasets to capture regional culinary traditions. Although there are large-scale collections for major languages, Macedonian recipes remain under-represented in digital research. In this work, we present the first systematic effort to construct a Macedonian recipe dataset through web scraping and structured parsing. We address challenges in processing heterogeneous ingredient descriptions, including unit, quantity, and descriptor normalization. An exploratory analysis of ingredient frequency and co-occurrence patterns, using measures such as Pointwise Mutual Information and Lift score, highlights distinctive ingredient combinations that characterize Macedonian cuisine. The resulting dataset contributes a new resource for studying food culture in underrepresented languages and offers insights into the unique patterns of Macedonian culinary tradition.

[72] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

Zhichao Wang, Andy Wong, Ruslan Belkin

Main category: cs.CL

TL;DR: RLSR replaces SFT with RL-based approach using semantic similarity rewards, achieving better instruction-following performance than SFT alone.

Details

Motivation: To leverage extensive SFT datasets in RL framework for improving base model's instruction-following ability, inspired by RFT's RL approach for domain adaptation.

Method: RLSR generates multiple responses per prompt and computes reward scores as cosine similarity in semantic embedding space between generated and human-labeled responses.

Result: RLSR outperformed SFT on instruction-following benchmarks: 26.34% vs 21.01% AlpacaEval win rate on Qwen-7B. Combined SFT+RLSR achieved 30.73% win rate.

Conclusion: RLSR effectively replaces or complements SFT, demonstrating superior instruction-following capability through RL-based semantic similarity rewards.

Abstract: After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model’s instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT’s 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.

Bingsheng Yao, Bo Sun, Yuanzhe Dong, Yuxuan Lu, Dakuo Wang

Main category: cs.CL

TL;DR: DPRF is a framework that iteratively refines persona profiles for LLM role-playing agents by identifying and mitigating cognitive divergences between generated behaviors and human ground truth, improving behavioral alignment across diverse scenarios.

Details

Motivation: Current LLM role-playing agents suffer from low persona fidelity due to manually-created profiles without proper validation of alignment with target individuals, undermining their behavioral authenticity.

Method: Dynamic Persona Refinement Framework (DPRF) that iteratively identifies cognitive divergence (through free-form or theory-grounded structured analysis) between generated behaviors and human ground truth, then refines persona profiles to mitigate these divergences.

Result: DPRF consistently improves behavioral alignment considerably over baseline personas and generalizes across five LLMs and four diverse behavior-prediction scenarios (formal debates, social media posts with mental health issues, public interviews, and movie reviews).

Conclusion: DPRF provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications like user simulation, social studies, and personalized AI.

Abstract: The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs’ behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences.We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews.DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios.Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.

[74] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Beomseok Kang, Jiwon Song, Jae-Joon Kim

Main category: cs.CL

TL;DR: LiteStage is a latency-aware layer skipping framework that accelerates multi-stage reasoning in small language models by combining stage-wise layer budget allocation with confidence-based early exit to reduce redundant decoding.

Details

Motivation: Multi-stage reasoning improves reasoning capability but increases latency. Existing adaptive acceleration techniques like layer skipping struggle to balance efficiency and accuracy due to stage-wise variation in skip sensitivity and redundant output token generation.

Method: Proposes LiteStage framework with two components: (1) stage-wise offline search for optimal layer budget allocation, and (2) online confidence-based generation early exit to suppress unnecessary decoding.

Result: Achieves up to 1.70x speedup with less than 4.0% accuracy loss on benchmarks including OBQA, CSQA, and StrategyQA, outperforming prior training-free layer skipping methods.

Conclusion: LiteStage effectively balances efficiency and accuracy in multi-stage reasoning by addressing stage-wise sensitivity variation and redundant token generation through optimized layer skipping and early exit mechanisms.

Abstract: Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

[75] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

Parsa Hejabi, Elnaz Rahmati, Alireza S. Ziabari, Morteza Dehghani

Main category: cs.CL

TL;DR: Flip-Flop Consistency (F²C) is an unsupervised training method that improves LLM robustness to prompt perturbations through consensus cross-entropy and representation alignment.

Details

Motivation: LLMs often produce inconsistent answers when faced with different phrasings of the same prompt, which reduces reliability and performance.

Method: F²C uses two components: Consensus Cross-Entropy (CCE) creates hard pseudo-labels via majority vote across prompt variations, and representation alignment loss pulls lower-confidence predictors toward the consensus.

Result: On 11 datasets across 4 NLP tasks, F²C increased agreement by 11.62%, improved mean F1 by 8.94%, and reduced performance variance by 3.29%. It also generalized well in out-of-domain evaluations.

Conclusion: F²C is an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations.

Abstract: Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at https://github.com/ParsaHejabi/Flip-Flop-Consistency-Unsupervised-Training-for-Robustness-to-Prompt-Perturbations-in-LLMs.

[76] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

Jihao Zhao, Zhiyuan Ji, Simin Niu, Hanyu Wang, Feiyu Xiong, Zhiyu Li

Main category: cs.CL

TL;DR: The paper proposes MoM framework that transforms RAG from passive chunking to proactive document memory extraction, enabling small language models to acquire human-like reading abilities through structured chunking, multi-path evaluation, and reverse reasoning.

Details

Motivation: Traditional RAG systems are limited by passive text chunking, which restricts knowledge internalization and reasoning capabilities. The research aims to simulate human cognitive processes during reading by proactively extracting document memories.

Method: Proposes Mixtures of scenario-aware document Memories (MoM) framework that: 1) Uses LLMs to generate document logical outlines for structured chunking, 2) Employs multi-path sampling and multi-perspective evaluation with metrics for chunk clarity and extraction completeness, 3) Incorporates reverse reasoning to deduce expert thinking paths, 4) Develops three-layer document memory retrieval mechanism.

Result: Extensive experiments across three domains show MoM resolves text chunking challenges in existing RAG systems, provides LLMs with semantically complete document memories, and enables SLMs to achieve human-centric intelligent text processing.

Conclusion: The MoM framework successfully transforms RAG from passive chunking to proactive understanding, paving the way for small language models to achieve human-like intelligent text processing capabilities through structured memory extraction and cognitive simulation.

Abstract: The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.

[77] Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation

Yilun Zheng, Dan Yang, Jie Li, Lin Shang, Lihui Chen, Jiahao Xu, Sitao Luan

Main category: cs.CL

TL;DR: DEG-RAG is a framework that denoises LLM-generated knowledge graphs through entity resolution and triple reflection, improving Graph-based RAG performance by creating more compact, higher-quality KGs.

Details

Motivation: Graph-based RAG systems rely on LLMs for KG construction, but this often produces noisy KGs with redundant entities and unreliable relationships, which degrades retrieval/generation performance and increases computational costs.

Method: Uses entity resolution to eliminate redundant entities and triple reflection to remove erroneous relations, with systematic evaluation of blocking strategies, embeddings, similarity metrics, and entity merging techniques.

Result: The approach drastically reduces graph size and consistently improves question answering performance across diverse Graph-based RAG variants.

Conclusion: DEG-RAG provides the first comprehensive exploration of entity resolution in LLM-generated KGs, demonstrating that denoising techniques yield significant performance improvements in Graph-based RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.

[78] Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

Rahul Nadkarni, Yanai Elazar, Hila Gonen, Noah A. Smith

Main category: cs.CL

TL;DR: The paper presents an experimental method for studying how training data affects language model behavior by intervening on data batches and retraining models to test data-behavior relationships.

Details

Motivation: To understand the relationship between training data and language model behavior, moving beyond observational analyses to experimental interventions.

Method: A recipe involving: 1) selecting evaluation items from benchmarks, 2) matching relevant documents to those items, 3) modifying documents, 4) retraining model checkpoints, and 5) measuring effects. Uses cooccurrence statistics and information retrieval to identify relevant training documents.

Result: The method supplements past observational analyses linking cooccurrence to model behavior, but shows that current methods for identifying relevant training documents don’t fully explain LMs’ ability to answer knowledge questions correctly.

Conclusion: The outlined recipe enables researchers to test hypotheses about how training data affects model behavior, with code made publicly available for future work.

Abstract: We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches – i.e., ``rewriting history’’ – and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM’s ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.

[79] PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering

Md Mahadi Hasan Nahid, Davood Rafiei

Main category: cs.CL

TL;DR: An agentic retrieval system using LLMs in a structured loop improves multi-hop QA by decomposing questions and iteratively selecting/adding evidence to achieve high precision and recall.

Details

Motivation: Retrieval is crucial for multi-hop QA where complex questions require gathering multiple evidence pieces, but existing methods struggle with balancing precision and recall.

Method: Three specialized agents: Question Analyzer decomposes questions, Selector focuses on precision by identifying relevant context, and Adder focuses on recall by bringing missing evidence. They interact iteratively.

Result: Achieves higher retrieval accuracy while filtering distracting content, enabling downstream QA models to surpass full-context accuracy with less irrelevant information across four benchmarks.

Conclusion: The agentic retrieval system consistently outperforms strong baselines on multi-hop QA benchmarks by effectively balancing precision and recall through structured agent interaction.

Abstract: Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks – HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG – demonstrates that our approach consistently outperforms strong baselines.

[80] Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

Lifu Tu, Yingbo Zhou, Semih Yavuz

Main category: cs.CL

TL;DR: Small multilingual models lag behind larger ones in retrieval tasks. This work investigates how to retrofit smaller models for better retrieval performance by optimizing training data scale, negative sampling, and data diversity.

Details

Motivation: Small multilingual models (<1B parameters) perform well on general tasks but consistently underperform larger models (>1B) in retrieval tasks, raising the question of whether smaller models can be specifically optimized for retrieval.

Method: Investigated key factors influencing multilingual embeddings: training data scale, negative sampling strategies, and data diversity. Found that hard negatives are essential and task diversity matters more than language diversity alone.

Result: Developed a compact ~300M parameter multilingual model that achieves retrieval performance comparable to or surpassing current strong 7B models, despite being much smaller.

Conclusion: Smaller models can be effectively retrofitted for retrieval tasks through strategic optimization of training factors, particularly by incorporating hard negatives and focusing on task diversity rather than just language diversity.

Abstract: Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (<1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (>1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.

[81] Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, Baosong Yang, Chen Cheng, Jialong Tang, Jiandong Jiang, Jianwei Zhang, Jijie Xu, Ming Yan, Minmin Sun, Pei Zhang, Pengjun Xie, Qiaoyu Tang, Qin Zhu, Rong Zhang, Shibin Wu, Shuo Zhang, Tao He, Tianyi Tang, Tingyu Xia, Wei Liao, Weizhou Shen, Wenbiao Yin, Wenmeng Zhou, Wenyuan Yu, Xiaobin Wang, Xiaodong Deng, Xiaodong Xu, Xinyu Zhang, Yang Liu, Yeqiu Li, Yi Zhang, Yong Jiang, Yu Wan, Yuxin Zhou

Main category: cs.CL

TL;DR: Qwen3Guard introduces multilingual safety guardrail models with two variants: Generative (for fine-grained tri-class safety judgments) and Stream (for real-time token-level safety monitoring during generation), addressing limitations of existing binary-only and non-streaming safety models.

Details

Motivation: Existing safety guardrail models have two major limitations: (1) binary safe/unsafe labels are inconsistent across diverse safety policies and can't accommodate varying safety tolerances, (2) they require complete model outputs before safety checks, making them incompatible with streaming inference and preventing timely intervention.

Method: Developed Qwen3Guard series with two specialized variants: Generative Qwen3Guard casts safety classification as instruction-following for tri-class judgments (safe, controversial, unsafe), and Stream Qwen3Guard uses token-level classification for real-time safety monitoring during incremental text generation. Available in three sizes (0.6B, 4B, 8B) supporting 119 languages.

Result: Achieves state-of-the-art performance across English, Chinese, and multilingual benchmarks in both prompt and response safety classification. Provides comprehensive, scalable, and low-latency safety moderation for global LLM deployments.

Conclusion: Qwen3Guard addresses critical limitations of existing safety models by enabling fine-grained safety judgments and real-time monitoring during streaming generation, making it suitable for diverse safety policies and timely intervention in LLM deployments.

Abstract: As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary “safe/unsafe” labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.

[82] Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL

Md Mahadi Hasan Nahid, Davood Rafiei, Weiwei Zhang, Yong Zhang

Main category: cs.CL

TL;DR: A context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem, combining table-first and column-first strategies with question decomposition techniques to improve Text-to-SQL performance.

Details

Motivation: Schema linking is a critical but underexplored component of Text-to-SQL systems, where current methods focus on SQL generation but neglect schema element retrieval, leading to hallucinations and execution failures.

Method: Proposes a bidirectional schema retrieval framework with two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection, augmented with question decomposition, keyword extraction, and keyphrase extraction.

Result: Significantly improves schema recall while reducing false positives on BIRD and Spider benchmarks. SQL generation using retrieved schema outperforms full-schema baselines and approaches oracle performance, narrowing the performance gap between full and perfect schema settings by 50%.

Conclusion: Schema linking is a powerful lever for enhancing Text-to-SQL accuracy and efficiency, and treating it as a standalone problem yields substantial improvements without requiring query refinement.

Abstract: Schema linking – the process of aligning natural language questions with database schema elements – is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.

[83] Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers

Ziye Xia, Sergei S. Ospichev

Main category: cs.CL

TL;DR: This paper proposes a prompt engineering-based method for analyzing key concept paths in academic papers using small language models and knowledge graph constraints to identify innovation points and rare paths.

Details

Motivation: The rapid growth of academic publications makes it difficult for scientists to track latest research. Existing paper databases only do basic concept matching and classification, failing to explore deep relational networks between concepts.

Method: Based on OpenAlex knowledge graph, analyzed 8,000 papers from Novosibirsk State University. Uses prompt engineering with small language models for key concept extraction and innovation identification, enhanced by knowledge graph constraint agents.

Result: Discovered strong correlation between paper key concept path distribution patterns and innovation points/rare paths. Fine-tuned Qwen and DeepSeek models achieved significant accuracy improvements, with models available on Hugging Face.

Conclusion: The proposed method successfully enables deeper analysis of concept relationships and innovation identification in academic papers through combined use of language models and knowledge graph constraints.

Abstract: In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.

[84] MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

Sathyanarayanan Ramamoorthy, Vishwa Shah, Simran Khanuja, Zaid Sheikh, Shan Jie, Ann Chia, Shearman Chua, Graham Neubig

Main category: cs.CL

TL;DR: MERLIN is a testbed system for Multilingual Multimodal Entity Linking, featuring a dataset with BBC news titles and images in 5 languages, containing 7,000 entity mentions linked to 2,500 Wikidata entities.

Details

Motivation: To address the challenge of entity linking when textual context is ambiguous or insufficient, particularly in multilingual settings where models may have limited capabilities.

Method: Created a dataset with BBC news article titles and corresponding images in Hindi, Japanese, Indonesian, Vietnamese, and Tamil. Benchmarked using multilingual and multimodal entity linking methods with language models like LLaMa-2 and Aya-23.

Result: Incorporating visual data improves entity linking accuracy, especially for ambiguous textual contexts and for models with weaker multilingual abilities.

Conclusion: Visual data enhances entity linking performance, particularly in multilingual scenarios where textual information alone may be insufficient or ambiguous.

Abstract: This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at https://github.com/rsathya4802/merlin

[85] MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda

Main category: cs.CL

TL;DR: MathMist is a parallel multilingual benchmark for mathematical reasoning with 21K aligned question-answer pairs across 7 languages, revealing LLMs’ persistent deficiencies in cross-lingual mathematical reasoning.

Details

Motivation: Existing benchmarks focus primarily on English or high-resource languages, leaving gaps in assessing multilingual and cross-lingual mathematical reasoning capabilities of LLMs.

Method: Created MathMist benchmark with aligned question-answer pairs across 7 languages covering high-, medium-, and low-resource settings. Evaluated diverse LLMs under zero-shot, chain-of-thought, and code-switched reasoning paradigms.

Result: Results show persistent deficiencies in LLMs’ ability to perform consistent mathematical reasoning across languages, with pronounced degradation in low-resource settings.

Conclusion: LLMs struggle with consistent and interpretable mathematical reasoning across different languages, particularly in low-resource linguistic settings, highlighting the need for improved multilingual mathematical reasoning capabilities.

Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs’ ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist

[86] Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine

Main category: cs.CL

TL;DR: LLMs exhibit deceptive behavior in ~26% of dialogue turns naturally, increasing to 57% when prompted to deceive. RLHF-trained models still show 43% deception. A new belief misalignment metric correlates better with human judgments, and a multi-turn RL method reduces deception by 77.6%.

Details

Motivation: LLMs' ability to produce deceptive outputs poses significant safety concerns in real-world applications like customer support, education, and healthcare, due to insufficient safeguards against hallucination, misinformation, and user manipulation.

Method: Proposed belief misalignment metric to quantify deception; evaluated across four dialogue scenarios using five established deception detection metrics; benchmarked eight state-of-the-art models; introduced multi-turn reinforcement learning methodology for fine-tuning.

Result: LLMs naturally exhibit deceptive behavior in ~26% of dialogue turns; when prompted to deceive, deceptiveness increases by up to 31%; RLHF-trained models still show 43% deception on average; new deception measure correlates more closely with human judgments than existing metrics; multi-turn RL fine-tuning reduces deceptive behaviors by 77.6%.

Conclusion: Deception in dialogue develops over interaction history, requiring multi-turn analysis rather than single-utterance approaches. The proposed multi-turn RL methodology effectively reduces deceptive behaviors in LLMs, addressing significant safety concerns in real-world applications.

Abstract: Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.

[87] Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Perapard Ngokpol, Kun Kerdthaisong, Pasin Buakhaw, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot

Main category: cs.CL

TL;DR: The paper introduces Beyond One World, a benchmark for evaluating LLMs’ ability to faithfully portray version-specific characters across different universes, focusing on canonical accuracy and moral reasoning alignment.

Details

Motivation: To address the underexplored capacity of LLMs to faithfully and consistently portray version-specific characters across different storytelling universes, using superhero canons as a rich testbed.

Method: Created a benchmark with 30 iconic heroes and 90 canon-specific versions, featuring two tasks: Canon Events (factual recall) and Moral Dilemmas (ethical scenarios). Proposed Think-Act Matching metric to quantify alignment between reasoning and actions.

Result: Three key findings: (1) Chain-of-thought improves coherence in weaker models but reduces accuracy in stronger ones; (2) Cross-version generalization remains challenging; (3) Models excel at either thinking or acting but rarely both.

Conclusion: Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, providing a challenging evaluation framework for role-playing LLMs.

Abstract: Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters – for example, superheroes across comic and cinematic universes – remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation (“thinking”) from outward decisions (“acting”). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.

[88] CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering

Ziad Elshaer, Essam A. Rashed

Main category: cs.CL

TL;DR: A confidence-driven multi-model framework that enhances medical question answering without fine-tuning by routing low-confidence queries to complementary helper models.

Details

Motivation: High-performing medical LLMs typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions.

Method: Two-stage architecture: confidence detection module assesses primary model’s certainty, and adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning.

Result: Achieves competitive performance with strong results in PubMedQA (95.0%) and MedMCQA (78.0%). Ablation studies confirm confidence-aware routing with multi-model collaboration substantially outperforms single-model approaches.

Conclusion: Strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, democratizing access to advanced medical AI in resource-limited settings.

Abstract: High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model’s certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0%) and MedMCQA (78.0%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.

[89] On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?

Anyun Zhuo, Xuefei Ning, Ningyuan Li, Yu Wang, Pinyan Lu

Main category: cs.CL

TL;DR: This paper investigates LLM resilience against character-level perturbations by inserting invisible Unicode control characters after each input character, finding that many LLMs maintain notable performance despite strong obfuscation.

Details

Motivation: To study LLM robustness against structured character-level perturbations and develop methods to discourage LLM misuse in sensitive applications like online exam systems.

Method: Introduces a practical method that inserts invisible Unicode control characters into text, creating character-level perturbations that fragment tokenization and reduce signal-to-noise ratio.

Result: Surprisingly, despite strong obfuscation, many LLMs maintain notable performance. The study examines robustness across model-, problem-, and noise-related configurations, exploring character-level tokenization handling and implicit vs explicit denoising mechanisms.

Conclusion: The findings on LLMs’ low-level robustness highlight risks of misuse and inform the reliability of deploying LLMs across diverse applications, particularly in scenarios requiring protection against automated misuse.

Abstract: This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce \nameshort{}, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textit{implicit} versus \textit{explicit} denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.

[90] From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program

Joseph E. Trujillo-Falcon, Monica L. Bozeman, Liam E. Llewellyn, Samuel T. Halvorson, Meryl Mizell, Stuti Deshpande, Bob Manning, Todd Fagin

Main category: cs.CL

TL;DR: The NWS is developing an AI-powered automated translation system for weather products to serve non-English speakers, using LLMs adapted for weather terminology and prioritizing languages based on GIS mapping of community needs.

Details

Motivation: To better serve the 68.8 million people in the U.S. who don't speak English at home and advance a Weather-Ready Nation by providing accessible weather information to all communities.

Method: Partnership with LILT using patented training process to adapt large language models for neural machine translation of weather terminology; GIS mapping to identify language needs; integration of ethical AI practices with transparency, fairness, and human oversight.

Result: Development of scalable translation system for Spanish, Simplified Chinese, Vietnamese and other languages; significant reduction in manual translation time; creation of experimental multilingual NWS products including warnings, forecasts, and educational campaigns.

Conclusion: The automated translation system brings the country closer to a national warning system that reaches all Americans, providing accurate, timely, and culturally relevant weather information through ethical AI implementation.

Abstract: To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program’s design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.

[91] PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora

Mykolas Sveistrys, Richard Kunert

Main category: cs.CL

TL;DR: The paper introduces pluri-hop questions that require aggregation across all documents in repetitive report corpora, proposes PluriHopWIND dataset with 48 multilingual questions from wind industry reports, and presents PluriHopRAG with document-level query decomposition and cross-encoder filtering that achieves 18-52% F1 improvements.

Details

Motivation: Existing QA systems struggle with questions requiring aggregation across all documents in repetitive report corpora (medical records, compliance filings, maintenance logs) where retrieval stopping points are unclear and missing even one passage significantly impacts results.

Method: Proposes PluriHopRAG architecture that decomposes queries into document-level subquestions and uses cross-encoder filtering to discard irrelevant documents before costly LLM reasoning, following a “check all documents individually, filter cheaply” approach.

Result: PluriHopWIND dataset is 8-40% more repetitive than common datasets with higher distractor density. Existing RAG approaches don’t exceed 40% F1 score, while PluriHopRAG achieves relative F1 improvements of 18-52% depending on base LLM.

Conclusion: PluriHopWIND exposes limitations of current QA systems on repetitive corpora, and PluriHopRAG demonstrates that exhaustive retrieval with early filtering outperforms top-k methods for pluri-hop questions requiring aggregation across all documents.

Abstract: Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a “check all documents individually, filter cheaply” approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG’s performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.

[92] MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Yingpeng Ning, Yuanyuan Sun, Ling Luo, Yanhua Wang, Yuchen Pan, Hongfei Lin

Main category: cs.CL

TL;DR: MedTrust-Guided Iterative RAG framework improves biomedical QA by reducing hallucinations through citation-aware reasoning, iterative retrieval-verification, and MedTrust-Align Module, achieving 2.4-2.7% accuracy gains over baselines.

Details

Motivation: Current RAG systems in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient evidence verification, undermining response reliability in medical contexts where accuracy is critical.

Method: Three innovations: citation-aware reasoning with Negative Knowledge Assertions, iterative retrieval-verification with Medical Gap Analysis, and MedTrust-Align Module using Direct Preference Optimization with verified positive and hallucination-aware negative samples.

Result: Outperforms competitive baselines on MedMCQA, MedQA, and MMLU-Med benchmarks, achieving best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.

Conclusion: The proposed framework effectively mitigates hallucinations in biomedical QA through structured grounding, iterative verification, and preference optimization, demonstrating significant improvements in factual consistency across multiple model architectures.

Abstract: Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.

[93] Suicidal Comment Tree Dataset: Enhancing Risk Assessment and Prediction Through Contextual Analysis

Jun Li, Qun Zhao

Main category: cs.CL

TL;DR: This paper shows that analyzing longitudinal comment trees from social media significantly improves suicidal risk prediction compared to single posts alone, using a Reddit dataset annotated with C-SSRS framework.

Details

Motivation: Previous studies focused on single social media posts for suicide detection, but users reveal intentions through historical posts and interactive comments over time. There's limited research on analyzing sequential comment trees for predicting evolving suicidal risk.

Method: Constructed a high-quality annotated dataset from Reddit with users’ posting history and comments, using a four-label annotation framework based on Columbia Suicide Severity Rating Scale (C-SSRS). Conducted statistical analysis and LLM experiments.

Result: Incorporating comment trees data significantly enhances both discrimination and prediction of user suicidal risk levels compared to single posts alone.

Conclusion: This research provides novel insights for improving detection accuracy of at-risk individuals and offers a valuable foundation for early suicide intervention strategies.

Abstract: Suicide remains a critical global public health issue. While previous studies have provided valuable insights into detecting suicidal expressions in individual social media posts, limited attention has been paid to the analysis of longitudinal, sequential comment trees for predicting a user’s evolving suicidal risk. Users, however, often reveal their intentions through historical posts and interactive comments over time. This study addresses this gap by investigating how the information in comment trees affects both the discrimination and prediction of users’ suicidal risk levels. We constructed a high-quality annotated dataset, sourced from Reddit, which incorporates users’ posting history and comments, using a refined four-label annotation framework based on the Columbia Suicide Severity Rating Scale (C-SSRS). Statistical analysis of the dataset, along with experimental results from Large Language Models (LLMs) experiments, demonstrates that incorporating comment trees data significantly enhances the discrimination and prediction of user suicidal risk levels. This research offers a novel insight to enhancing the detection accuracy of at-risk individuals, thereby providing a valuable foundation for early suicide intervention strategies.

[94] Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu

Main category: cs.CL

TL;DR: A self-supervised RL framework that derives reward signals directly from instructions and generates pseudo-labels for reward model training, eliminating dependency on external supervision for multi-constraint instruction following.

Details

Motivation: Language models struggle with multi-constraint instructions crucial for real-world applications, and existing RL approaches suffer from dependency on external supervision and sparse reward signals.

Method: Proposes label-free self-supervised RL with constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency.

Result: Achieves strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following tasks.

Conclusion: The proposed framework effectively addresses multi-constraint instruction following challenges without external supervision and generalizes well across diverse datasets.

Abstract: Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if

[95] Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation

Shiyao Ding, Takayuki Ito

Main category: cs.CL

TL;DR: Proposes ‘Your Next Token Prediction (YNTP)’ task to model individual communication styles using controlled human-agent conversations, addressing privacy concerns in collecting real SNS/email data.

Details

Motivation: LLMs struggle to generate responses reflecting how individuals truly communicate due to privacy concerns preventing collection of real SNS/email histories.

Method: Built multilingual benchmark with 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions to capture natural communication patterns.

Result: Established the first benchmark for YNTP and evaluated prompt-based and fine-tuning-based personalization methods, providing foundation for user-aligned language modeling.

Conclusion: YNTP enables modeling of users’ precise word choices and internal models through controlled conversations, advancing personalized language generation while addressing privacy limitations.

Abstract: Large language models (LLMs) excel at general next-token prediction but still struggle to generate responses that reflect how individuals truly communicate, such as replying to emails or social messages in their own style. However, real SNS or email histories are difficult to collect due to privacy concerns. To address this, we propose the task of “Your Next Token Prediction (YNTP)”, which models a user’s precise word choices through controlled human-agent conversations. We build a multilingual benchmark of 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions. This setup captures natural, daily-life communication patterns and enables analysis of users’ internal models. We evaluate prompt-based and fine-tuning-based personalization methods, establishing the first benchmark for YNTP and a foundation for user-aligned language modeling. The dataset is available at: https://github.com/AnonymousHub4Submissions/your-next-token-prediction-dataset-100

[96] Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, Kam-Fai Wong

Main category: cs.CL

TL;DR: The paper proposes an Explore to Evolve paradigm to create WebAggregatorQA dataset and WebAggregator foundation models that significantly improve web agents’ information aggregation capabilities, outperforming GPT-4.1 and approaching Claude-3.7-sonnet performance.

Details

Motivation: Existing open-source deep research agents focus primarily on information-seeking but overlook essential information aggregation capabilities, limiting their ability to support in-depth research.

Method: Explore to Evolve paradigm: proactive online exploration to collect grounded web evidence, then self-evolving aggregation programs using 12 high-level logical types to synthesize verifiable QA pairs. Built WebAggregatorQA dataset (10K samples across 50K websites, 11 domains) and developed WebAggregator foundation models based on SmolAgents framework.

Result: WebAggregator-8B matches GPT-4.1 performance, while 32B variant surpasses GPT-4.1 by >10% on GAIA-text and approaches Claude-3.7-sonnet. On WebAggregatorQA benchmark, Claude-3.7-sonnet achieves only 28% and GPT-4.1 scores 25.8%, showing agents struggle with information aggregation even when retrieving all references.

Conclusion: Current web agents lack strong information aggregation capabilities, highlighting the need to strengthen this aspect in foundation models. The proposed approach successfully addresses this gap and demonstrates superior performance on challenging aggregation tasks.

Abstract: Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents’ information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.

[97] LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, Yu Zhang, Biqing Huang

Main category: cs.CL

TL;DR: LiRA is a training framework that improves cross-lingual representations for low-resource languages through anchored alignment and language-aware reasoning, addressing performance gaps between high-resource and low-resource languages.

Details

Motivation: Large language models perform well on high-resource languages but struggle with low-resource languages due to limited training data, machine-translation noise, and unstable cross-lingual alignment.

Method: LiRA consists of two modules: Arca (anchors low-resource languages to English via anchor-based alignment and multi-agent collaborative encoding) and LaSR (adds language-aware reasoning head with consistency regularization).

Result: Experiments show consistent gains and robustness across low-resource benchmarks in cross-lingual retrieval, semantic similarity, and reasoning under few-shot and noise-amplified settings.

Conclusion: LiRA effectively improves cross-lingual representations for low-resource languages while strengthening retrieval and reasoning capabilities, with both modules contributing to the performance gains.

Abstract: As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca’s multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.

[98] Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

Reid T. Johnson, Michelle D. Pain, Jordan D. West

Main category: cs.CL

TL;DR: NLT framework replaces programmatic JSON tool calling with natural language outputs in LLMs, improving tool calling accuracy by 18.4 percentage points and reducing output variance by 70%.

Details

Motivation: To eliminate task interference and format constraints that degrade tool call performance in LLMs by decoupling tool selection from response generation.

Method: Natural Language Tools (NLT) framework that replaces programmatic JSON tool calling with natural language outputs, evaluated across 10 models and 6,400 trials in customer service and mental health domains.

Result: 18.4 percentage point improvement in tool calling accuracy, 70% reduction in output variance. Open-weight models showed largest gains, surpassing closed-weight alternatives. Improvements persisted under prompt perturbations and extended capabilities to models without native support.

Conclusion: NLT framework significantly improves tool calling performance, with implications for model training in both reinforcement learning and supervised fine-tuning stages.

Abstract: We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.

[99] Efficient Seq2seq Coreference Resolution Using Entity Representations

Matt Grenander, Shay B. Cohen, Mark Steedman

Main category: cs.CL

TL;DR: The paper proposes a compressed representation method to improve efficiency of seq2seq coreference models in incremental settings like dialogue, achieving near state-of-the-art performance while significantly reducing computational cost.

Details

Motivation: Seq2seq coreference models achieve state-of-the-art performance but lack flexibility and efficiency for incremental settings where text must be processed sequentially, such as in dialogue systems.

Method: Proposes a compressed representation that extracts and re-organizes entity-level tokens while discarding the majority of other input tokens to improve efficiency in incremental coreference resolution.

Result: On OntoNotes, achieves only 0.6 CoNLL F1 points below full-prefix incremental baseline with 1.8 compression ratio. On LitBank, surpasses state-of-the-art performance while handling singleton mentions.

Conclusion: Discarding a wide portion of tokens in seq2seq resolvers is a feasible strategy for incremental coreference resolution, maintaining strong performance while significantly improving efficiency.

Abstract: Seq2seq coreference models have introduced a new paradigm for coreference resolution by learning to generate text corresponding to coreference labels, without requiring task-specific parameters. While these models achieve new state-of-the-art performance, they do so at the cost of flexibility and efficiency. In particular, they do not efficiently handle incremental settings such as dialogue, where text must processed sequentially. We propose a compressed representation in order to improve the efficiency of these methods in incremental settings. Our method works by extracting and re-organizing entity-level tokens, and discarding the majority of other input tokens. On OntoNotes, our best model achieves just 0.6 CoNLL F1 points below a full-prefix, incremental baseline while achieving a compression ratio of 1.8. On LitBank, where singleton mentions are annotated, it passes state-of-the-art performance. Our results indicate that discarding a wide portion of tokens in seq2seq resolvers is a feasible strategy for incremental coreference resolution.

[100] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs

Kyubyung Chae, Gihoon Kim, Gyuseong Lee, Taesup Kim, Jaejin Lee, Heejin Kim

Main category: cs.CL

TL;DR: This paper addresses the gap in evaluating sovereign LLMs by creating a dataset and framework to assess socio-cultural alignment and technical robustness, finding that current models don’t fully meet claims about serving target users well and may compromise safety.

Details

Motivation: There's growing interest in sovereign LLMs tailored to specific socio-cultural contexts, but a shortage of frameworks to verify their alignment with users' backgrounds and their safety/technical robustness.

Method: Constructed a new dataset and introduced an analytic framework for extracting and evaluating socio-cultural elements of sovereign LLMs alongside technical robustness assessments.

Result: Sovereign LLMs support low-resource languages but don’t always serve target users well as claimed, and pursuing untested claims may lead to underestimating critical safety attributes.

Conclusion: Advancing sovereign LLMs requires more extensive evaluation with broader, well-grounded practical criteria beyond current claims.

Abstract: Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users’ socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.

[101] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Xeron Du, Tianyu Zheng, Yichi Zhang, Letian Ni, Yuyang Cheng, Qiguang Chen, Jingzhe Ding, Shengda Long, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Ge Zhang, Wenhao Huang, Wanxiang Che, Chenghua Lin

Main category: cs.CL

TL;DR: Current preference learning methods fail when objective quality signals are removed. WritingPreferenceBench dataset shows standard reward models achieve only 52.7% accuracy, while generative reward models with reasoning chains reach 81.8%. High variance across genres suggests RLHF learns objective error detection rather than subjective quality preferences.

Details

Motivation: Current preference learning methods show significant performance degradation when objective quality signals are removed, indicating they may not effectively capture subjective quality preferences like creativity and emotional resonance.

Method: Created WritingPreferenceBench dataset with 1,800 human-annotated preference pairs across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. Compared sequence-based reward models, zero-shot language model judges, and generative reward models with explicit reasoning chains.

Result: Sequence-based reward models achieved 52.7% mean accuracy, zero-shot language model judges performed at 53.9%, while generative reward models with reasoning chains achieved 81.8% accuracy. High within-model variance across genres (18.2% to 81.8%) with standard deviations averaging 10.1%. No consistent improvement with model scale (27B vs 8B parameters).

Conclusion: Current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences. Successful preference modeling may require intermediate reasoning representations rather than direct classification.

Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models–the standard architecture for RLHF–achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.

[102] Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

Kedi Chen, Zhikai Lei, Xu Guo, Xuecheng Wu, Siyuan Zeng, Jianghao Yin, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Qipeng Guo, Kai Chen, Wei Zhang

Main category: cs.CL

TL;DR: CodeSeq is a synthetic post-training dataset for LLMs that teaches inductive reasoning through number sequence problems with iterative correction and reinforcement learning, improving reasoning performance while preserving out-of-distribution capabilities.

Details

Motivation: Current inductive reasoning research faces challenges: existing data lacks complex internal patterns, and current methods don't provide precise thinking processes or difficulty control. The goal is to improve LLMs' inductive reasoning capabilities through better training approaches.

Method: Created CodeSeq dataset from number sequences packaged as algorithmic problems for general term generation. Uses supervised finetuning with iterative corrections from failed test cases, plus reinforcement learning with a novel Case-Synergy Solvability Scaling Reward that considers both problem solvability and self-directed case generation success.

Result: Models trained with CodeSeq show improved performance on various reasoning tasks while maintaining out-of-distribution (OOD) performance.

Conclusion: CodeSeq effectively enhances LLMs’ inductive reasoning capabilities through its synthetic dataset and training methodology, enabling autonomous case generation and self-checking while preserving generalization abilities.

Abstract: Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models’ OOD performance.

[103] RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Qing Yang, Zhenghao Liu, Junxin Wang, Yangfan Du, Pengcheng Huang, Tong Xiao

Main category: cs.CL

TL;DR: RLAIF-SPA framework uses reinforcement learning with AI feedback to improve emotional expressiveness in text-to-speech synthesis by optimizing semantic accuracy and prosodic-emotional alignment.

Details

Motivation: Current TTS systems achieve near-human quality in neutral speech but struggle with emotional expressiveness, often producing emotionally flat speech due to reliance on costly annotations or indirect optimization objectives.

Method: Proposes RLAIF-SPA framework using Reinforcement Learning from AI Feedback (RLAIF) with ASR for semantic accuracy and LLM for prosodic-emotional label alignment across four dimensions: Structure, Emotion, Speed, and Tone.

Result: Outperforms Chat-TTS with 26.1% reduction in WER, 9.1% increase in SIM-O, and over 10% improvement in human evaluation on Libri Speech dataset.

Conclusion: The RLAIF-SPA framework effectively enhances emotional expressiveness and intelligibility in TTS synthesis through direct optimization of semantic accuracy and prosodic-emotional alignment.

Abstract: Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.

[104] Intent Clustering with Shared Pseudo-Labels

I-Fan Lin, Faegheh Hasibi, Suzan Verberne

Main category: cs.CL

TL;DR: Training-free, label-free intent clustering method using lightweight LLMs that generates pseudo-labels for texts and performs multi-label classification, achieving comparable or better results than baselines without requiring known cluster counts.

Details

Motivation: Address limitations of current approaches that rely on costly commercial LLMs with limited transparency and require knowing the number of clusters in advance, which is unrealistic in practical settings.

Method: Generate pseudo-labels for each text using LLMs, then perform multi-label classification in this pseudo-label space, leveraging the hypothesis that texts from same clusters share more labels and thus have closer embeddings.

Result: Evaluation on four benchmark sets shows comparable or better performance than recent baselines, with simplicity and computational efficiency.

Conclusion: The method is effective for low-resource scenarios and stable across multiple models and datasets, providing human-readable pseudo-labels instead of direct similarity matching.

Abstract: In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.

[105] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs

Linyue Ma, Yilong Xu, Xiang Long, Zhi Zheng

Main category: cs.CL

TL;DR: The paper proposes ’nugget-as-rubric’ paradigm for verifiable reward modeling in search-augmented LLMs, addressing limitations of existing rule-based and generative rewards, and introduces Search-Gen-V, a 4B-parameter efficient generative verifier.

Details

Motivation: Existing reward modeling for search-augmented LLMs faces limitations: rule-based rewards are fragile to expression variations and can't handle long-form workloads, while generative rewards lack verifiability and stability for long-form tasks in dynamic corpora with high computational costs.

Method: Proposes ’nugget-as-rubric’ paradigm treating atomic information points as structured evaluation criteria; designs automatic rubric construction pipeline using query rewriting for long-form tasks; introduces Search-Gen-V, a 4B-parameter efficient generative verifier trained via distillation and two-stage strategy.

Result: Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.

Conclusion: The proposed ’nugget-as-rubric’ paradigm and Search-Gen-V verifier provide a unified, verifiable solution for reward modeling in search-augmented LLMs that addresses limitations of existing approaches while maintaining efficiency and scalability.

Abstract: Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, “nugget-as-rubric”, which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question’s information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.

[106] Semantic Prosody in Machine Translation: the English-Chinese Case of Passive Structures

Xinyue Ma, Pol Pastells, Mireia Farrús, Mariona Taulé

Main category: cs.CL

TL;DR: Teaching machine translation models about semantic prosody of Chinese BEI passives through fine-tuning improves translation accuracy by properly handling negative connotations.

Details

Motivation: Current machine translation models cannot handle semantic prosody - the collocational meaning formed through consistent co-occurrence patterns. Since literal translations may have different semantic prosody, this linguistic property needs attention for accurate translations.

Method: Created a dataset of English-Chinese sentence pairs demonstrating negative semantic prosody of Chinese BEI passives. Fine-tuned OPUS-MT, NLLB-600M and mBART50 models with this dataset for English-Chinese translation task.

Result: Fine-tuned MT models perform better at using BEI passives for translating unfavorable content and avoid using them for neutral and favorable content. In multilingual NLLB-600M, this semantic prosody knowledge transfers to other language pairs like Spanish-Chinese.

Conclusion: The approach successfully teaches machine translation models about semantic prosody, improving translation accuracy and enabling cross-lingual transfer of this linguistic knowledge.

Abstract: Semantic prosody is a collocational meaning formed through the co-occurrence of a linguistic unit and a consistent series of collocates, which should be treated separately from semantic meaning. Since words that are literal translations of each other may have different semantic prosody, more attention should be paid to this linguistic property to generate accurate translations. However, current machine translation models cannot handle this problem. To bridge the gap, we propose an approach to teach machine translation models about semantic prosody of a specific structure. We focus on Chinese BEI passives and create a dataset of English-Chinese sentence pairs with the purpose of demonstrating the negative semantic prosody of BEI passives. Then we fine-tune OPUS-MT, NLLB-600M and mBART50 models with our dataset for the English-Chinese translation task. Our results show that fine-tuned MT models perform better on using BEI passives for translating unfavourable content and avoid using it for neutral and favourable content. Also, in NLLB-600M, which is a multilingual model, this knowledge of semantic prosody can be transferred from English-Chinese translation to other language pairs, such as Spanish-Chinese.

[107] Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms

Xingmeng Zhao, Dan Schumacher, Veronica Rammouz, Anthony Rios

Main category: cs.CL

TL;DR: A human-centered framework using user stories and multi-agent discussions helps identify diverse AI healthcare risks beyond just privacy and well-being concerns.

Details

Motivation: Rapid AI development in healthcare introduces risks of bias, privacy violations, and unequal access, with current automated risk detection methods reducing human engagement in understanding harms.

Method: Human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment.

Result: Participants who read stories recognized a broader range of harms across all 13 harm types, while those without stories focused primarily on privacy and well-being (58.3%).

Conclusion: Storytelling helps participants speculate about a broader range of harms and benefits and think more creatively about AI’s impact on users.

Abstract: Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI’s impact on users.

[108] AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, Peng Qi

Main category: cs.CL

TL;DR: AutoRubric-R1V is a framework that combines reinforcement learning with verifiable rewards (RLVR) and process-level supervision using automatically collected rubric-based generative rewards to improve multimodal reasoning.

Details

Motivation: Current MLLMs using RLVR often lead to spurious reasoning because only final-answer correctness is rewarded, lacking process-level supervision.

Method: Uses scalable self-aggregation to distill consistent reasoning checkpoints from successful trajectories, enabling automatic rubric construction without human annotation or teacher models. Jointly leverages rubric-based and outcome rewards.

Result: Achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.

Conclusion: AutoRubric-R1V effectively addresses spurious reasoning in MLLMs by integrating process-level supervision through automatic rubric-based rewards.

Abstract: Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.

[109] Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code

Manar Abdelatty, Maryam Nouh, Jacob K. Rosenstein, Sherief Reda

Main category: cs.CL

TL;DR: Pluto is a benchmark and evaluation framework for assessing LLM-generated Verilog designs, focusing on synthesis efficiency metrics (area, delay, power) beyond just functional correctness.

Details

Motivation: Existing benchmarks for LLM-generated hardware design focus primarily on functional correctness but lack comprehensive evaluation of synthesis efficiency metrics and optimized baselines.

Method: Developed Pluto framework with 114 problems featuring self-checking testbenches and multiple Pareto-optimal reference implementations to evaluate LLM-generated Verilog designs.

Result: State-of-the-art LLMs achieve 78.3% functional correctness at pass@1, but lag in synthesis efficiency: 63.8% area efficiency, 65.9% delay efficiency, and 64.0% power efficiency at eff@1 compared to expert implementations.

Conclusion: Efficiency-aware evaluation frameworks like Pluto are needed to drive progress in hardware-focused LLM research, as current LLMs fall short in synthesis optimization despite good functional performance.

Abstract: Large Language Models (LLMs) are increasingly used to automate hardware design tasks, including the generation of Verilog code. While early benchmarks focus primarily on functional correctness, efficient hardware design demands additional optimization for synthesis metrics such as area, delay, and power. Existing benchmarks fall short in evaluating these aspects comprehensively: they often lack optimized baselines or testbenches for verification. To address these gaps, we present Pluto, a benchmark and evaluation framework designed to assess the efficiency of LLM-generated Verilog designs. Pluto presents a comprehensive evaluation set of 114 problems with self-checking testbenches and multiple Pareto-optimal reference implementations. Experimental results show that state-of-the-art LLMs can achieve high functional correctness, reaching 78.3% at pass@1, but their synthesis efficiency still lags behind expert-crafted implementations, with area efficiency of 63.8%, delay efficiency of 65.9%, and power efficiency of 64.0% at eff@1. This highlights the need for efficiency-aware evaluation frameworks such as Pluto to drive progress in hardware-focused LLM research.

[110] COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes

Yunwen Li, Shuangshuang Ying, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Tianyu Zheng, Xeron Du, Qiguang Chen, Jiajun Shi, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Stephen Huang, Wanxiang Che, Chenghua Lin, Eli Zhang

Main category: cs.CL

TL;DR: COIG-Writer is a Chinese creative writing dataset with 1,665 triplets across 51 genres, featuring reverse-engineered prompts, creative reasoning processes, and final texts. The research reveals creative writing requires narrative logic (from process supervision) and linguistic expression (from general data), with optimal performance requiring at least 1:12 creative-to-general data ratio.

Details

Motivation: Large language models have systematic deficiencies in creative writing, especially in non-English contexts where training data is scarce and lacks process-level supervision.

Method: Created COIG-Writer dataset through systematic reverse-engineering of high-quality Chinese creative texts, comprising triplets with prompts, detailed creative reasoning, and final texts across 51 genres.

Result: Process supervision is effective but requires stabilization with general data (1:12 ratio needed). Creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap). Lexical diversity inversely correlates with creative quality (TTR paradox).

Conclusion: Creative excellence emerges from interaction between logical scaffolding (narrative logic) and linguistic grounding, similar to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.

Abstract: Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.

[111] Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Hwiyeol Jo, Joosung Lee, Jaehone Lee, Sang-Woo Lee, Joonsuk Park, Kang Min Yoo

Main category: cs.CL

TL;DR: Answer Regeneration framework improves reasoning model performance by using additional inference with “Answer:” prompt to extract final answers, making evaluation more robust and reliable.

Details

Motivation: Current answer extraction methods for reasoning models are highly sensitive and affect performance evaluation. Need for more robust and reliable evaluation framework.

Method: Answer Regeneration: Additional model inference with prior input/output prefaced by “Answer:” prompt, then extract final answer from regenerated output.

Result: Improved performance and enhanced robustness across math problems and open-ended QA tasks. Framework is extraction-rule-agnostic.

Conclusion: Answer Regeneration provides more reliable model evaluation results and could be applied broadly to reasoning model assessment.

Abstract: Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt “Answer:”. The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.

[112] Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking

Ziqi Dai, Xin Zhang, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang

Main category: cs.CL

TL;DR: This paper compares contrastive learning (CL) and supervised fine-tuning (SFT) for LLM-based reranking, finding SFT superior due to stronger weighting mechanisms, achieving state-of-the-art results on MRB benchmark.

Details

Motivation: To determine which training objective (contrastive learning vs supervised fine-tuning) is better suited for LLM-based reranking systems, given the divergence in effectiveness between BERT-style encoders and LLMs.

Method: Comprehensive comparison and analysis between CL and SFT for reranking using universal multimodal retrieval (UMR) as experimental platform. Decomposed objectives into weight and direction components, conducted probing experiments and large-scale training.

Result: SFT provides substantially stronger weighting scheme than CL, while preferred scoring direction shows no clear winner. SFT achieves new state-of-the-art rerankers on MRB benchmark.

Conclusion: SFT has consistent advantage over CL for LLM reranking due to superior weighting mechanisms, providing valuable insights for future research and applications in this area.

Abstract: In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ‘‘yes’’ (resp. ‘’no’’) token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.

[113] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Yu Zhou, Sohyun An, Haikang Deng, Da Yin, Clark Peng, Cho-Jui Hsieh, Kai-Wei Chang, Nanyun Peng

Main category: cs.CL

TL;DR: Multimodal generative models suffer significant performance degradation (32-48%) when processing dialectal English inputs, and current mitigation methods are ineffective. The authors propose an encoder-based strategy that improves dialect performance to match Standard American English while preserving SAE performance.

Details

Motivation: Contact languages like English have rich regional dialects, and dialect speakers often interact with generative models. However, it's unclear whether multimodal generative models can effectively produce content from dialectal textual inputs.

Method: Constructed a large-scale benchmark with 4200+ verified prompts across six English dialects. Evaluated 17 image/video generative models. Proposed an encoder-based mitigation strategy that teaches models to recognize dialect features while preserving Standard American English performance.

Result: Current models show 32.26-48.17% performance degradation with dialect inputs. Fine-tuning and prompt rewriting only improve dialect performance by <7% while degrading SAE performance. The proposed method raises performance on five dialects to match SAE (+34.4%) with near-zero cost to SAE performance.

Conclusion: Multimodal generative models struggle significantly with dialectal inputs, but the proposed encoder-based mitigation strategy effectively addresses this issue by enabling dialect recognition without compromising Standard American English performance.

Abstract: Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

[114] Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models

Guinan Su, Yanwu Yang, Li Shen, Lu Yin, Shiwei Liu, Jonas Geiping

Main category: cs.CL

TL;DR: A data-free, online test-time framework that adapts MoE routing decisions during text generation using self-supervision from generated sequences, achieving performance gains on reasoning tasks without external data.

Details

Motivation: Mixture-of-Experts models suffer from suboptimal routing decisions due to distribution shifts, and existing test-time adaptation methods require external data and focus on dense models, limiting applicability to MoE architectures.

Method: Cycles between two phases: optimizes routing decisions during prefill and regular intervals using self-supervision from generated sequences, then generates text normally. Uses lightweight additive vectors to update router logits in selected layers.

Result: Achieves consistent performance gains on challenging reasoning tasks (e.g., 5.5% improvement on HumanEval with OLMoE) while maintaining robustness to context shifts. Naturally complements existing techniques (e.g., 6% average gains with self-consistency on DeepSeek-V2-Lite).

Conclusion: The proposed data-free, online test-time framework effectively adapts MoE routing decisions during generation, improving performance on reasoning tasks without external supervision or data, while maintaining computational efficiency.

Abstract: Mixture-of-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment. While existing test-time adaptation methods could potentially address these issues, they primarily focus on dense models and require access to external data, limiting their practical applicability to MoE architectures. However, we find that, instead of relying on reference data, we can optimize MoE expert selection on-the-fly based only on input context. As such, we propose \textit{a data-free, online test-time framework} that continuously adapts MoE routing decisions during text generation without external supervision or data. Our method cycles between two phases: During the prefill stage, and later in regular intervals, we optimize the routing decisions of the model using self-supervision based on the already generated sequence. Then, we generate text as normal, maintaining the modified router until the next adaption. We implement this through lightweight additive vectors that only update router logits in selected layers, maintaining computational efficiency while preventing over-adaptation. The experimental results show consistent performance gains on challenging reasoning tasks while maintaining robustness to context shifts. For example, our method achieves a 5.5% improvement on HumanEval with OLMoE. Furthermore, owing to its plug-and-play property, our method naturally complements existing test-time scaling techniques, e.g., achieving 6% average gains when incorporated with self-consistency on DeepSeek-V2-Lite.

[115] Midtraining Bridges Pretraining and Posttraining Distributions

Emmy Liu, Graham Neubig, Chenyan Xiong

Main category: cs.CL

TL;DR: Midtraining - mixing high-quality instruction data at the end of pretraining - is most effective in math and code domains by reducing the syntactic gap between pretraining and posttraining data, outperforming continued pretraining with better in-domain performance and less forgetting.

Details

Motivation: Despite the popularity of midtraining in language model development, there is little scientific understanding of why this phase is effective or how it works.

Method: Conducted controlled experiments with language models pretrained from scratch and fine-tuned on supervised datasets in different domains, with ablations on midtraining timing and mixture weights using code as a case study.

Result: Midtraining is most effective in math and code domains, consistently outperforming continued pretraining in validation loss and reducing pretraining data forgetting. Earlier introduction of specialized data yields greater benefits.

Conclusion: Midtraining functions as a domain adaptation technique that provides better performance than continued pretraining through reduced forgetting of pretraining data.

Abstract: Recently, many language models have been pretrained with a “midtraining” phase, in which higher quality, often instruction-formatted data, is mixed in at the end of pretraining. Despite the popularity of this practice, there is little scientific understanding of this phase of model training or why it is effective. In this work, we conduct the first systematic investigation of midtraining through controlled experiments with language models pretrained from scratch and fine-tuned on supervised finetuning datasets in different domains. We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains, where midtraining can best reduce the syntactic gap between pretraining and posttraining data. In these cases, midtraining consistently outperforms continued pretraining in both in-domain validation loss as well as pretraining data forgetting after posttraining. We conduct ablations on the starting time of the midtraining phase and mixture weights of the midtraining data, using code midtraining as a case study, and find that timing has a greater impact than mixture weights, with earlier introduction of specialized data, yielding greater benefits in-domain as well as preserving general language modeling better. These findings establish midtraining as a domain adaptation technique that compared to continued pretraining yields better performance through reduced forgetting.

[116] Predicting Task Performance with Context-aware Scaling Laws

Kyle Montgomery, David Park, Jianhong Tu, Michael Bendersky, Beliz Gunel, Dawn Song, Chenguang Wang

Main category: cs.CL

TL;DR: A new framework that models downstream task performance as a function of training compute and context, validated on extended-context LLama-2 models across three reasoning tasks.

Details

Motivation: Conventional scaling laws focus on upstream metrics like cross-entropy loss but fail to capture downstream task performance where context plays a critical role.

Method: Proposed an interpretable framework that jointly models downstream performance based on training compute and provided context, validated on extended-context Llama-2-7B and Llama-2-13B across 65,500 instances spanning arithmetic reasoning, common sense reasoning, and machine translation.

Result: The framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as context increases.

Conclusion: The findings provide insights into the interplay between training compute and context utilization, offering guidance for designing more efficient long-context LLMs for diverse downstream tasks.

Abstract: Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.

[117] From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, Andre Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, Jinming Zhuang

Main category: cs.CL

TL;DR: MLIR-AIR is a compiler stack that bridges high-level workloads with spatial architectures like AMD NPUs, enabling efficient orchestration of compute and data movement through structured representations and compiler-managed scheduling.

Details

Motivation: General-purpose compilers fail to exploit modern spatial architectures due to their abstraction of parallelism, locality, and synchronization. As architectures increasingly require fine-grained control over data movement and compute placement, compiler infrastructure needs explicit mechanisms to fully utilize such hardware.

Method: Built on MLIR, MLIR-AIR introduces the AIR dialect with structured representations for asynchronous and hierarchical operations across compute and memory resources. It enables spatial scheduling, computation distribution across hardware regions, and communication-computation overlap without manual coordination.

Result: For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and matches performance of hand-optimized implementations. For multi-head attention, it enables fused implementations with ~150 lines of code, efficiently mapping complex workloads to spatial hardware.

Conclusion: MLIR-AIR successfully transforms high-level structured control flow into spatial programs that efficiently utilize NPU compute fabric and memory hierarchy through compiler-managed scheduling, asynchronous execution, tiling, and communication overlap.

Abstract: General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD’s NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR’s capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.

[118] LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Wenkai Yang, Weijie Liu, Ruobing Xie, Yiju Guo, Lulu Wu, Saiyong Yang, Yankai Lin

Main category: cs.CL

TL;DR: RLVR with self-verification is inefficient due to separate solution and verification templates. LaSeR simplifies this by using last-token self-rewarding scores derived from next-token probabilities, requiring only one extra token inference.

Details

Motivation: To address the inefficiency in previous RLVR methods that require separate prompt templates for solutions and self-verifications, which significantly reduces training and inference efficiency.

Method: Propose LaSeR algorithm that augments RLVR loss with MSE loss to align last-token self-rewarding scores with verifier-based reasoning rewards. Uses next-token probability distribution at the last token to compute self-rewarding scores with minimal extra cost.

Result: Method improves reasoning performance and equips models with self-rewarding capability, boosting inference-time scaling performance while requiring only one additional token inference.

Conclusion: LaSeR provides an efficient approach to unify reasoning and self-verification in LLMs through last-token self-rewarding, achieving better performance with minimal computational overhead.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model’s self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model’s next-token log-probability assigned to any pre-specified token at the solution’s last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model’s reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.

[119] Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Xujun Peng, Anoop Kumar, Jingyu Wu, Parker Glenn, Daben Liu

Main category: cs.CL

TL;DR: A new method combining synthetic data generation, triplet loss, and layer-wise model merging improves output consistency in RAG systems by ~47.5% over baseline.

Details

Motivation: LLMs generate inconsistent outputs for semantically equivalent inputs, which is problematic for reliable RAG systems, and current fine-tuning techniques are limited in addressing this consistency issue.

Method: Combines systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach using consistency-aware weights derived from intermediate layer activations.

Result: The merged model achieves ~47.5% improvement in response similarity over baseline, significantly enhancing output consistency.

Conclusion: The proposed approach offers a practical solution for increasing the reliability of industrial RAG systems by effectively integrating knowledge from specialized models to improve output consistency.

Abstract: Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem compounded by the scarcity of consistency-focused training data and the limitations of current fine-tuning techniques in enhancing output consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving a ~47.5% improvement in response similarity over the baseline, thus offering a practical solution for increasing the reliability of an industrial RAG system.

[120] MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics

Yuxing Lu, Xukai Zhao, J. Ben Tamo, Micky C. Nnamdi, Rui Peng, Shuang Zeng, Xingyu Hu, Jinzhuo Wang, May D. Wang

Main category: cs.CL

TL;DR: MetaBench is the first benchmark for evaluating LLMs in metabolomics, revealing that while models perform well on text generation, they struggle with cross-database identifier grounding and long-tail metabolites.

Details

Motivation: To systematically evaluate LLM capabilities in specialized scientific domains like metabolomics, which presents unique challenges with complex biochemical pathways, heterogeneous identifiers, and fragmented databases.

Method: Introduced MetaBench benchmark curated from authoritative public resources, evaluating 25 open- and closed-source LLMs across five essential metabolomics capabilities: knowledge, understanding, grounding, reasoning, and research.

Result: Models perform well on text generation tasks but struggle with cross-database identifier grounding even with retrieval augmentation, and performance decreases on long-tail metabolites with sparse annotations.

Conclusion: MetaBench provides essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.

[121] AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations

Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin

Main category: cs.CL

TL;DR: Machine learning models achieve over 80% accuracy in mental health screening using real-world interview data, with particularly strong performance on PTSD detection (89% accuracy, 98% recall).

Details

Motivation: Mental health disorders are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma. Primary care providers misidentify depression or anxiety in over 60% of cases, creating urgent need for scalable diagnostic tools.

Method: Evaluated machine learning models using 553 real-world semistructured interviews with ground-truth diagnoses. Benchmarked zero-shot prompting (GPT-4.1 Mini, MetaLLaMA) and fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Used shorter, focused context segments to improve recall.

Result: Models achieved over 80% accuracy across diagnostic categories, with PTSD detection reaching 89% accuracy and 98% recall. LoRA fine-tuning with lower-rank configurations (rank 8 and 16) maintained competitive performance. Focused narrative cues enhanced detection sensitivity.

Conclusion: LLM-based models offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powered early diagnosis. This enables integration into clinical workflows, especially in low-resource or high-stigma environments.

Abstract: Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.

[122] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying

Main category: cs.CL

TL;DR: IGPO is a reinforcement learning framework that addresses reward sparsity in multi-turn LLM agents by providing dense, intrinsic turn-level rewards based on information gain, improving training efficiency and performance.

Details

Motivation: Existing RL approaches for LLM agents rely on sparse outcome-based rewards, which cause problems in multi-turn settings: advantage collapse (identical rewards for all rollouts) and lack of fine-grained credit assignment for long-horizon tasks.

Method: IGPO models each interaction turn as incremental information acquisition and defines turn-level rewards as the marginal increase in the policy’s probability of producing the correct answer. It derives intrinsic rewards directly from the model’s belief updates and combines them with outcome-level supervision.

Result: Extensive experiments on in-domain and out-of-domain benchmarks show IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.

Conclusion: IGPO provides an effective solution to reward sparsity in multi-turn agent training through dense, intrinsic supervision based on information gain, enabling better credit assignment and learning signals without external reward models or costly estimation.

Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy’s probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model’s own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.

[123] LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Yiming Wang, Da Yin, Yuedong Cui, Ruichen Zheng, Zhiqian Li, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, Kai-Wei Chang

Main category: cs.CL

TL;DR: UI-Simulator is a scalable paradigm that generates synthetic UI trajectories for training digital agents, avoiding expensive real-world data collection. It includes a world simulator, guided rollout process, and trajectory wrapper to produce diverse training data.

Details

Motivation: Collecting diverse, large-scale UI trajectories for digital agents is prohibitively expensive in terms of human annotation, infrastructure, and engineering costs.

Method: UI-Simulator integrates a digital world simulator for diverse UI states, guided rollout process for coherent exploration, and trajectory wrapper for high-quality trajectories. UI-Simulator-Grow adds targeted scaling by prioritizing high-impact tasks and synthesizing informative variants.

Result: Experiments show UI-Simulator rivals or surpasses open-source agents trained on real UIs with better robustness. UI-Simulator-Grow matches Llama-3-70B-Instruct performance using only Llama-3-8B-Instruct as base model.

Conclusion: The targeted synthesis scaling paradigm can continuously and efficiently enhance digital agents by generating synthetic training data at scale.

Abstract: Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.

[124] TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Yinxi Li, Yuntian Deng, Pengyu Nie

Main category: cs.CL

TL;DR: TokDrift framework reveals that code LLMs are sensitive to minor formatting changes due to subword tokenization misalignment with programming grammar, causing substantial behavioral shifts even in large models.

Details

Motivation: Current LLMs for code use statistical subword tokenizers that don't align with programming language grammar, causing semantically identical code to be tokenized differently based on superficial factors like whitespace or naming.

Method: Introduced TokDrift framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization, then analyzed nine code LLMs including large models with over 30B parameters.

Result: Even minor formatting changes cause substantial shifts in model behavior. Layer-wise analysis shows the issue originates in early embeddings where subword segmentation fails to capture grammar token boundaries.

Conclusion: Misaligned tokenization is a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

Abstract: Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

[125] Attention Is All You Need for KV Cache in Diffusion LLMs

Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen

Main category: cs.CL

TL;DR: Elastic-Cache is a training-free method that adaptively recomputes KV caches for diffusion LLMs by selectively refreshing caches based on attention drift and depth-aware scheduling, achieving significant speedups (8.7-45.1×) while maintaining generation quality.

Details

Motivation: Prior methods recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy in computation.

Method: Proposes Elastic-Cache that jointly decides when to refresh (via attention-aware drift test on most-attended token) and where to refresh (via depth-aware schedule that recomputes from chosen layer onward while reusing shallow-layer caches and off-window MASK caches).

Result: Achieves consistent speedups: 8.7× on GSM8K (256 tokens), 45.1× on longer sequences, and 4.8× on HumanEval, while maintaining higher accuracy than baseline. Achieves 6.8× higher throughput on GSM8K than existing confidence-based approaches.

Conclusion: Elastic-Cache enables practical deployment of diffusion LLMs by reducing redundant computation and accelerating decoding with negligible loss in generation quality through adaptive, layer-aware cache updates.

Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods’ decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

[126] Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT)

Bushra Sabir, M. Ali Babar, Sharif Abuadbba

Main category: cs.CL

TL;DR: The paper proposes IT-DT, an interpretability and transparency-driven framework for detecting and transforming adversarial examples in transformer-based text classifiers like BERT and GPT-3.

Details

Motivation: Transformer-based classifiers are vulnerable to adversarial attacks, and existing defense methods lack interpretability, making it difficult to understand adversarial classifications and identify model vulnerabilities.

Method: IT-DT uses attention maps, integrated gradients, and model feedback for interpretable detection of adversarial examples, then employs pre-trained embeddings and model feedback to generate optimal word replacements for transformation. Human experts review results for transparency.

Result: Comprehensive experiments show IT-DT effectively detects and transforms adversarial examples, enhancing interpretability and transparency while enabling accurate identification and successful transformation of adversarial inputs.

Conclusion: By combining technical analysis with human expertise, IT-DT significantly improves the resilience and trustworthiness of transformer-based text classifiers against adversarial attacks.

Abstract: Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 have shown impressive performance in NLP. However, their vulnerability to adversarial examples poses a security risk. Existing defense methods lack interpretability, making it hard to understand adversarial classifications and identify model vulnerabilities. To address this, we propose the Interpretability and Transparency-Driven Detection and Transformation (IT-DT) framework. It focuses on interpretability and transparency in detecting and transforming textual adversarial examples. IT-DT utilizes techniques like attention maps, integrated gradients, and model feedback for interpretability during detection. This helps identify salient features and perturbed words contributing to adversarial classifications. In the transformation phase, IT-DT uses pre-trained embeddings and model feedback to generate optimal replacements for perturbed words. By finding suitable substitutions, we aim to convert adversarial examples into non-adversarial counterparts that align with the model’s intended behavior while preserving the text’s meaning. Transparency is emphasized through human expert involvement. Experts review and provide feedback on detection and transformation results, enhancing decision-making, especially in complex scenarios. The framework generates insights and threat intelligence empowering analysts to identify vulnerabilities and improve model robustness. Comprehensive experiments demonstrate the effectiveness of IT-DT in detecting and transforming adversarial examples. The approach enhances interpretability, provides transparency, and enables accurate identification and successful transformation of adversarial inputs. By combining technical analysis and human expertise, IT-DT significantly improves the resilience and trustworthiness of transformer-based text classifiers against adversarial attacks.

[127] Natural Language Processing RELIES on Linguistics

Juri Opitz, Shira Wein, Nathan Schneider

Main category: cs.CL

TL;DR: LLMs can generate fluent text without explicit linguistic modules, but linguistics remains crucial for NLP in six key areas: Resources, Evaluation, Low-resource settings, Interpretability, Explanation, and Study of language (RELIES).

Details

Motivation: To examine whether linguistic expertise is still relevant in NLP given LLMs' ability to generate fluent text without explicit grammatical or semantic modules, and to highlight areas where linguistics continues to contribute.

Method: The paper proposes the RELIES framework that identifies six major facets where linguistics contributes to NLP: Resources, Evaluation, Low-resource settings, Interpretability, Explanation, and Study of language.

Result: The analysis shows that despite LLMs’ capabilities, linguistics remains essential for developing resources, evaluating systems, handling low-resource scenarios, interpreting models, explaining outputs, and studying language systems.

Conclusion: Linguistics continues to play a vital role in NLP through the RELIES framework, emphasizing the enduring importance of studying machine systems in relation to human language systems.

Abstract: Large Language Models (LLMs) have become capable of generating highly fluent text in certain languages, without modules specially designed to capture grammar or semantic coherence. What does this mean for the future of linguistic expertise in NLP? We highlight several aspects in which NLP (still) relies on linguistics, or where linguistic thinking can illuminate new directions. We argue our case around the acronym RELIES that encapsulates six major facets where linguistics contributes to NLP: Resources, Evaluation, Low-resource settings, Interpretability, Explanation, and the Study of language. This list is not exhaustive, nor is linguistics the main point of reference for every effort under these themes; but at a macro level, these facets highlight the enduring importance of studying machine systems vis-`a-vis systems of human language.

[128] Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou

Main category: cs.CL

TL;DR: Ada-KV is a head-wise adaptive budget allocation strategy that improves KV cache compression by allocating compression budgets based on each attention head’s unique patterns, rather than uniform allocation.

Details

Motivation: Large Language Models face efficiency challenges due to growing KV cache size for long-sequence inference. Existing methods use uniform compression budgets across all attention heads, ignoring their unique attention patterns.

Method: Proposed Ada-KV, a head-wise adaptive budget allocation strategy that establishes a theoretical loss upper bound between pre- and post-eviction attention output to guide optimization. It’s plug-and-play and compatible with prior cache eviction methods.

Result: Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, in both question-aware and question-agnostic scenarios, show substantial quality improvements over existing methods.

Conclusion: Ada-KV effectively addresses KV cache efficiency challenges by adapting compression budgets to individual attention head patterns, significantly improving generation quality while maintaining compatibility with existing methods.

Abstract: Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.

[129] AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

Abhay Gupta, Philip Meng, Ece Yurtseven, Sean O’Brien, Kevin Zhu

Main category: cs.CL

TL;DR: AAVENUE is a new benchmark for evaluating LLM performance on African American Vernacular English (AAVE) vs Standard American English, revealing consistent performance gaps that highlight biases in NLP systems.

Details

Motivation: To address dialect-induced performance discrepancies and develop more inclusive natural language processing systems by detecting biases in NLU for African American Vernacular English.

Method: Created AAVENUE benchmark using LLM-based translation with few-shot prompting to convert GLUE and SuperGLUE tasks from SAE to AAVE, validated by fluent AAVE speakers, and compared with existing VALUE benchmark using multiple metrics.

Result: LLMs consistently perform better on Standard American English tasks than AAVE-translated versions across all evaluation metrics, revealing inherent biases in current models.

Conclusion: The performance gap between SAE and AAVE highlights the need for more inclusive NLP models, and the AAVENUE benchmark provides a tool for evaluating and addressing these dialect-based biases.

Abstract: Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE ({AAVE} {N}atural Language {U}nderstanding {E}valuation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models. We have open-sourced our source code on GitHub and created a website to showcase our work at https://aavenuee.github.io.

[130] MIO: A Foundation Model on Multimodal Tokens

Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang

Main category: cs.CL

TL;DR: MIO is a novel foundation model that enables end-to-end, autoregressive understanding and generation across speech, text, images, and videos using multimodal tokens, addressing limitations of existing multimodal models.

Details

Motivation: Current LLMs and MM-LLMs lack true any-to-any understanding and generation capabilities. While GPT-4o shows potential for omnidirectional input/output, it's closed-source and doesn't support multimodal interleaved sequences.

Method: Four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks using causal multimodal modeling with discrete tokens across four modalities.

Result: MIO exhibits competitive and sometimes superior performance compared to previous dual-modal baselines, any-to-any model baselines, and modality-specific baselines. It demonstrates advanced any-to-any capabilities like interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, and instructional image editing.

Conclusion: MIO successfully addresses the gap in true any-to-any multimodal understanding and generation, providing an open alternative to closed-source models while supporting complex multimodal interleaved sequences and advanced reasoning capabilities.

Abstract: In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

[131] SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

Yuxin Xiao, Shujian Zhang, Wenxuan Zhou, Marzyeh Ghassemi, Sanqiang Zhao

Main category: cs.CL

TL;DR: SFTMix is a Mixup-based method that improves LLM instruction tuning by interpolating examples with different confidence levels, eliminating the need for well-curated datasets.

Details

Motivation: Current instruction tuning methods rely heavily on high-quality supervised fine-tuning datasets that require proprietary LLMs or human annotation for data filtering, which is resource-intensive.

Method: SFTMix uses training dynamics to identify examples with varying confidence levels, interpolates them to bridge confidence gaps, and applies Mixup-based regularization on these interpolated examples.

Result: SFTMix demonstrates consistent improvements across LLM families and SFT datasets of varying sizes and qualities in both instruction-following and healthcare-specific tasks.

Conclusion: SFTMix provides an effective alternative to data curation for instruction tuning, showing compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.

Abstract: To acquire instruction-following capabilities, large language models (LLMs) undergo instruction tuning, where they are trained on instruction-response pairs using next-token prediction (NTP). Efforts to improve instruction tuning often focus on higher-quality supervised fine-tuning (SFT) datasets, typically requiring data filtering with proprietary LLMs or human annotation. In this paper, we take a different approach by proposing SFTMix, a novel Mixup-based recipe that elevates LLM instruction tuning without relying on well-curated datasets. We observe that LLMs exhibit uneven confidence across the semantic representation space. We argue that examples with different confidence levels should play distinct roles in instruction tuning: Confident data is prone to overfitting, while unconfident data is harder to generalize. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels. We then interpolate them to bridge the confidence gap and apply a Mixup-based regularization to support learning on these additional, interpolated examples. We demonstrate the effectiveness of SFTMix in both instruction-following and healthcare-specific SFT tasks, with consistent improvements across LLM families and SFT datasets of varying sizes and qualities. Extensive analyses across six directions highlight SFTMix’s compatibility with data selection, adaptability to compute-constrained scenarios, and scalability to broader applications.

[132] AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity

Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang

Main category: cs.CL

TL;DR: This paper analyzes AI-generated essays from LLMs, examining their impact on automated scoring systems and academic integrity detection.

Details

Motivation: The increasing use of AI-assisted writing in education and professional settings raises concerns about automated scoring accuracy and academic integrity detection capabilities.

Method: Used large-scale empirical data to benchmark characteristics and quality of essays generated by popular LLMs, testing automated scoring systems and AI detection methods.

Result: Found limitations in existing automated scoring systems (like e-rater) for AI-generated essays, but showed that detectors trained on one model’s essays can effectively identify texts from other models with high accuracy.

Conclusion: While current automated scoring systems need improvement for AI-generated content, effective detection of AI-generated essays remains feasible through cross-model detector training.

Abstract: The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems, such as e-rater, when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.

[133] Multi-Perspective Stance Detection

Benedetta Muscato, Praveen Bushipaka, Gizem Gezici, Lucia Passaro, Fosca Giannotti

Main category: cs.CL

TL;DR: Multi-perspective approach in NLP classification outperforms traditional single-label methods by incorporating diverse human annotations rather than aggregating them.

Details

Motivation: Traditional NLP methods aggregate multiple human annotations into single ground truth, disregarding annotator diversity and perspective differences.

Method: Investigated perspective-aware classification models in stance detection, examining how multiple annotations affect model accuracy and confidence.

Result: Multi-perspective approach yields better classification performance than baseline using single labels, showing superior results.

Conclusion: Designing perspective-aware AI models is essential for responsible and ethical AI, and achieves better performance than traditional approaches.

Abstract: Subjective NLP tasks usually rely on human annotations provided by multiple annotators, whose judgments may vary due to their diverse backgrounds and life experiences. Traditional methods often aggregate multiple annotations into a single ground truth, disregarding the diversity in perspectives that arises from annotator disagreement. In this preliminary study, we examine the effect of including multiple annotations on model accuracy in classification. Our methodology investigates the performance of perspective-aware classification models in stance detection task and further inspects if annotator disagreement affects the model confidence. The results show that multi-perspective approach yields better classification performance outperforming the baseline which uses the single label. This entails that designing more inclusive perspective-aware AI models is not only an essential first step in implementing responsible and ethical AI, but it can also achieve superior results than using the traditional approaches.

[134] Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs

Koshiro Saito, Sakae Mizuki, Masanari Ohi, Taishi Nakamura, Taihei Shiotani, Koki Maeda, Youmi Ma, Kakeru Hattori, Kazuki Fujii, Takumi Okamoto, Shigeki Ishida, Hiroya Takamura, Rio Yokota, Naoaki Okazaki

Main category: cs.CL

TL;DR: This paper explores why and how to build local LLMs, using Japanese as a case study. It examines what abilities transfer from other languages and identifies specific Japanese abilities that require local training.

Details

Motivation: To understand the rationale for building local LLMs, determine what should be learned from the target language, identify transferable abilities from other languages, and investigate language-specific scaling laws.

Method: Evaluated 35 Japanese, English, and multilingual LLMs on 19 benchmarks for Japanese and English. Used observational approach with correlation analysis and principal component analysis (PCA) to derive ability factors.

Result: Training on English text improves Japanese academic subject scores (JMMLU). Japanese text training is unnecessary for code generation, arithmetic reasoning, commonsense, and reading comprehension tasks. Japanese training improves Japanese knowledge QA and English-Japanese translation - these are identified as core Japanese abilities.

Conclusion: Japanese abilities (knowledge QA and translation) scale with computational budget for Japanese text, while many other abilities can be transferred from English training.

Abstract: Why do we build local large language models (LLMs)? What should a local LLM learn from the target language? Which abilities can be transferred from other languages? Do language-specific scaling laws exist? To explore these research questions, we evaluated 35 Japanese, English, and multilingual LLMs on 19 evaluation benchmarks for Japanese and English, taking Japanese as a local language. Adopting an observational approach, we analyzed correlations of benchmark scores, and conducted principal component analysis (PCA) on the scores to derive \textit{ability factors} of local LLMs. We found that training on English text can improve the scores of academic subjects in Japanese (JMMLU). In addition, it is unnecessary to specifically train on Japanese text to enhance abilities for solving Japanese code generation, arithmetic reasoning, commonsense, and reading comprehension tasks. In contrast, training on Japanese text could improve question-answering tasks about Japanese knowledge and English-Japanese translation, which indicates that abilities for solving these two tasks can be regarded as \textit{Japanese abilities} for LLMs. Furthermore, we confirmed that the Japanese abilities scale with the computational budget for Japanese text.

[135] Women, Infamous, and Exotic Beings: A Comparative Study of Honorific Usages in Wikipedia and LLMs for Bengali and Hindi

Sourabrata Mukherjee, Atharva Mehta, Sougata Saha, Akhil Arora, Monojit Choudhury

Main category: cs.CL

TL;DR: Large-scale analysis of third-person honorific usage in Hindi and Bengali Wikipedia reveals systematic patterns influenced by socio-demographic factors, with notable cross-linguistic differences and gender disparities. LLMs show divergent honorific preferences compared to Wikipedia norms.

Details

Motivation: To understand how third-person honorifics encode socio-pragmatic cues in South Asian languages and examine whether LLMs internalize these cultural-linguistic norms.

Method: Analyzed 10,000 Hindi and Bengali Wikipedia articles with socio-demographic annotations, then probed six LLMs using controlled generation and translation tasks over 1,000 culturally balanced entities.

Result: Honorifics are more prevalent in Bengali than Hindi; non-honorifics dominate for infamous, juvenile, and culturally exotic entities; men receive more honorifics than women in both languages. LLMs diverge from Wikipedia usage patterns.

Conclusion: LLMs exhibit gaps in socio-cultural alignment with real-world honorific usage, highlighting the need to study how they acquire, adapt, or distort social-linguistic norms.

Abstract: The obligatory use of third-person honorifics is a distinctive feature of several South Asian languages, encoding nuanced socio-pragmatic cues such as power, age, gender, fame, and social distance. In this work, (i) We present the first large-scale study of third-person honorific pronoun and verb usage across 10,000 Hindi and Bengali Wikipedia articles with annotations linked to key socio-demographic attributes of the subjects, including gender, age group, fame, and cultural origin. (ii) Our analysis uncovers systematic intra-language regularities but notable cross-linguistic differences: honorifics are more prevalent in Bengali than in Hindi, while non-honorifics dominate while referring to infamous, juvenile, and culturally exotic entities. Notably, in both languages, and more prominently in Hindi, men are more frequently addressed with honorifics than women. (iii) To examine whether large language models (LLMs) internalize similar socio-pragmatic norms, we probe six LLMs using controlled generation and translation tasks over 1,000 culturally balanced entities. We find that LLMs diverge from Wikipedia usage, exhibiting alternative preferences in honorific selection across tasks, languages, and socio-demographic attributes. These discrepancies highlight gaps in the socio-cultural alignment of LLMs and open new directions for studying how LLMs acquire, adapt, or distort social-linguistic norms. Our code and data are publicly available at https://github.com/souro/honorific-wiki-llm

[136] The simulation of judgment in LLMs

Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, Walter Quattrociocchi

Main category: cs.CL

TL;DR: LLMs show systematic differences from human evaluation patterns in news credibility assessment, relying more on lexical associations than contextual reasoning, creating an “epistemia” illusion of knowledge.

Details

Motivation: To examine how LLM evaluations are constructed, what assumptions they rely on, and how their strategies diverge from human evaluative processes, particularly in credibility assessment.

Method: Benchmarked six LLMs against expert ratings (NewsGuard, Media Bias/Fact Check) and human judgments using a structured agentic framework where both models and participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications.

Result: Despite output alignment, models showed consistent differences in observable criteria, with reliance on lexical associations and statistical priors rather than contextual reasoning, leading to political asymmetries and confusion of linguistic form with epistemic reliability.

Conclusion: Delegating judgment to LLMs may shift evaluative processes from normative reasoning toward pattern-based approximation, raising questions about their role in evaluative systems due to the “epistemia” phenomenon where surface plausibility replaces verification.

Abstract: Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings–NewsGuard and Media Bias/Fact Check–and against human judgments collected through a controlled experiment. We use news domains purely as a controlled benchmark for evaluative tasks, focusing on the underlying mechanisms rather than on news classification per se. To enable direct comparison, we implement a structured agentic framework in which both models and nonexpert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, our findings show consistent differences in the observable criteria guiding model evaluations, suggesting that lexical associations and statistical priors could influence evaluations in ways that differ from contextual reasoning. This reliance is associated with systematic effects: political asymmetries and a tendency to confuse linguistic form with epistemic reliability–a dynamic we term epistemia, the illusion of knowledge that emerges when surface plausibility replaces verification. Indeed, delegating judgment to such systems may affect the heuristics underlying evaluative processes, suggesting a shift from normative reasoning toward pattern-based approximation and raising open questions about the role of LLMs in evaluative processes.

[137] Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

Jingcheng Deng, Zhongtao Jiang, Liang Pang, Liwei Chen, Kun Xu, Zihao Wei, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: AutoRegEmbed is a contrastive learning method that addresses the conflict between LLMs’ generative nature and contrastive learning requirements by embedding conditional probability distributions through information compression and distribution alignment tasks.

Details

Motivation: LLM embeddings are inherently generative and distributive, conflicting with contrastive learning's requirement for embeddings to capture full-text semantics and align via cosine similarity, leading to inefficient utilization of LLMs' pre-training capabilities.

Method: Proposes AutoRegEmbed with two core tasks: information compression (encoding text into embedding space to capture global semantics) and conditional distribution alignment (aligning text embeddings with positive samples while reducing likelihood of generating negative samples).

Result: Significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models with the same amount of data.

Conclusion: AutoRegEmbed successfully resolves the conflict between LLMs’ generative nature and contrastive learning requirements, enabling more efficient utilization of LLMs’ pre-training capabilities for dense text encoding.

Abstract: A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs’ pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.

[138] Sentence Smith: Controllable Edits for Evaluating Text Embeddings

Hongji Li, Andrianos Michail, Reto Gubelmann, Simon Clematide, Juri Opitz

Main category: cs.CL

TL;DR: The Sentence Smith framework uses semantic parsing and generation to enable controllable text manipulation, creating hard negative pairs for evaluating text embedding models with transparent semantic shifts.

Details

Motivation: To achieve controllable and transparent text generation by overcoming limitations of earlier parsing-based approaches through modern parsers and safety mechanisms.

Method: Three-step framework: 1) Parse sentences into semantic graphs, 2) Apply human-designed semantic manipulation rules, 3) Generate text from manipulated graphs, with final entailment check for validity.

Result: Successfully produces high-quality texts validated by humans, creates hard negative pairs that challenge text embedding models, and enables fine-grained evaluation of semantic shifts.

Conclusion: The framework demonstrates that current methods can achieve near-goal performance for controllable text generation, providing resource-efficient and transparent evaluation of text embedding models.

Abstract: Controllable and transparent text generation has been a long-standing goal in NLP. Almost as long-standing is a general idea for addressing this challenge: Parsing text to a symbolic representation, and generating from it. However, earlier approaches were hindered by parsing and generation insufficiencies. Using modern parsers and a safety supervision mechanism, we show how close current methods come to this goal. Concretely, we propose the Sentence Smith framework for English, which has three steps: 1. Parsing a sentence into a semantic graph. 2. Applying human-designed semantic manipulation rules. 3. Generating text from the manipulated graph. A final entailment check (4.) verifies the validity of the applied transformation. To demonstrate our framework’s utility, we use it to induce hard negative text pairs that challenge text embedding models. Since the controllable generation makes it possible to clearly isolate different types of semantic shifts, we can evaluate text embedding models in a fine-grained way, also addressing an issue in current benchmarking where linguistic phenomena remain opaque. Human validation confirms that our transparent generation process produces texts of good quality. Notably, our way of generation is very resource-efficient, since it relies only on smaller neural networks.

[139] Probabilistic Reasoning with LLMs for k-anonymity Estimation

Jonathan Zheng, Sauvik Das, Alan Ritter, Wei Xu

Main category: cs.CL

TL;DR: Introduces BRANCH, a new LLM methodology for estimating k-privacy values of text documents by factorizing joint probability distributions and using Bayesian networks.

Details

Motivation: To address the need for probabilistic reasoning in AI systems to handle uncertainty in privacy risk assessment of user-generated documents containing sensitive information.

Method: BRANCH factorizes joint probability distribution of personal information as random variables, estimates probability of each factor separately using Bayesian network, and combines them to compute final k-value.

Result: Method successfully estimates k-value 73% of the time (13% improvement over o3-mini with chain-of-thought), and high-variance predictions are 37.47% less accurate on average.

Conclusion: BRANCH provides effective probabilistic reasoning for privacy risk assessment, with LLM uncertainty serving as a good accuracy indicator.

Abstract: Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the k-privacy value of a text-the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final k-value. Our experiments show that this method successfully estimates the k-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high-variance predictions are 37.47% less accurate on average.

[140] Revisiting Prompt Optimization with Large Reasoning Models-A Case Study on Event Extraction

Saurabh Srivastava, Ziyu Yao

Main category: cs.CL

TL;DR: LRMs like DeepSeek-R1 and OpenAI o1 still benefit from prompt optimization for complex tasks like event extraction, and LRMs make effective prompt optimizers.

Details

Motivation: To test whether Large Reasoning Models (LRMs) still require prompt engineering despite their strong reasoning capabilities, using event extraction as a case study.

Method: Compared two LRMs (DeepSeek-R1, o1) and two general LLMs (GPT-4o, GPT-4.5) as both task models and prompt optimizers for event extraction tasks.

Result: LRMs as task models benefit from prompt optimization, and LRMs as prompt optimizers produce more effective prompts. Findings generalize beyond event extraction.

Conclusion: Even advanced LRMs still require prompt optimization for complex tasks, and LRMs themselves are effective at optimizing prompts, showing stability in refining instructions.

Abstract: Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1 have demonstrated remarkable capabilities in various reasoning tasks. Their strong capability to generate and reason over intermediate thoughts has also led to arguments that they may no longer require extensive prompt engineering or optimization to interpret human instructions and produce accurate outputs. In this work, we aim to systematically study this open question, using the structured task of event extraction for a case study. We experimented with two LRMs (DeepSeek-R1 and o1) and two general-purpose Large Language Models (LLMs) (GPT-4o and GPT-4.5), when they were used as task models or prompt optimizers. Our results show that on tasks as complicated as event extraction, LRMs as task models still benefit from prompt optimization, and that using LRMs as prompt optimizers yields more effective prompts. Our finding also generalizes to tasks beyond event extraction. Finally, we provide an error analysis of common errors made by LRMs and highlight the stability and consistency of LRMs in refining task instructions and event guidelines.

[141] Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, Domenico Talia

Main category: cs.CL

TL;DR: A scalable benchmarking framework (CLEAR-Bias) was developed to assess LLM robustness against adversarial bias elicitation, revealing that no model is fully robust and bias resilience varies significantly across sociocultural dimensions.

Details

Motivation: Growing concerns about embedded biases in LLMs that can perpetuate stereotypes and undermine fairness, particularly given their vulnerability to adversarial attacks despite mitigation efforts.

Method: Systematic probing across multiple tasks targeting diverse sociocultural biases, quantifying robustness through safety scores using LLM-as-a-Judge approach (DeepSeek V3 identified as most reliable), and employing jailbreak techniques to reveal safety vulnerabilities.

Result: Bias resilience is uneven with age, disability, and intersectional biases being most prominent; some small models outperform larger ones in safety; no model is fully robust to adversarial elicitation; jailbreak attacks using low-resource languages or refusal suppression are effective across model families; successive LLM generations show slight safety gains; medical-domain fine-tuned models tend to be less safe.

Conclusion: Training and architecture may matter more than scale for bias resilience, but current LLMs remain vulnerable to adversarial bias elicitation, highlighting the need for continued safety improvements and systematic benchmarking.

Abstract: The growing integration of Large Language Models (LLMs) into critical societal domains has raised concerns about embedded biases that can perpetuate stereotypes and undermine fairness. Such biases may stem from historical inequalities in training data, linguistic imbalances, or adversarial manipulation. Despite mitigation efforts, recent studies show that LLMs remain vulnerable to adversarial attacks that elicit biased outputs. This work proposes a scalable benchmarking framework to assess LLM robustness to adversarial bias elicitation. Our methodology involves: (i) systematically probing models across multiple tasks targeting diverse sociocultural biases, (ii) quantifying robustness through safety scores using an LLM-as-a-Judge approach, and (iii) employing jailbreak techniques to reveal safety vulnerabilities. To facilitate systematic benchmarking, we release a curated dataset of bias-related prompts, named CLEAR-Bias. Our analysis, identifying DeepSeek V3 as the most reliable judge LLM, reveals that bias resilience is uneven, with age, disability, and intersectional biases among the most prominent. Some small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale. However, no model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective across model families. We also find that successive LLM generations exhibit slight safety gains, while models fine-tuned for the medical domain tend to be less safe than their general-purpose counterparts.

[142] Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?

Hansi Zeng, Kai Hui, Honglei Zhuang, Zhen Qin, Zhenrui Yue, Hamed Zamani, Dana Alon

Main category: cs.CL

TL;DR: The paper addresses the challenge of selecting pre-training checkpoints for optimal downstream fine-tuning performance by formulating it as a pairwise classification problem and introducing novel proxy metrics that significantly outperform traditional perplexity.

Details

Motivation: Traditional pre-training metrics like perplexity correlate well with model performance in scaling studies but have unclear predictive capacity at fixed model sizes, hindering effective model selection and development.

Method: Formulated checkpoint selection as pairwise classification problem, constructed dataset with 50 1B parameter LLM variants with varied pre-training configurations, and introduced novel unsupervised and supervised proxy metrics derived from pre-training.

Result: Demonstrated that conventional perplexity is misleading, while proposed proxy metrics reduce relative performance prediction error rate by over 50%. Showed practical utility in specific scenarios.

Conclusion: The work enables more efficient design of pre-training schemes optimized for various downstream tasks by providing better metrics for checkpoint selection.

Abstract: While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classification problem: predicting which of two LLMs, differing in their pre-training, will perform better after supervised fine-tuning (SFT). We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations, e.g., objectives or data, and evaluate them on diverse downstream tasks after SFT. We first conduct a study and demonstrate that the conventional perplexity is a misleading indicator. As such, we introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%. Despite the inherent complexity of this task, we demonstrate the practical utility of our proposed proxies in specific scenarios, paving the way for more efficient design of pre-training schemes optimized for various downstream tasks.

[143] ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Dongwon Noh, Donghyeok Koh, Junghun Yuk, Gyuwan Kim, Jaeyong Lee, Kyungtae Lim, Cheoneum Park

Main category: cs.CL

TL;DR: ScholarBench is a challenging bilingual benchmark for evaluating LLMs’ academic reasoning across 8 research domains with complex problem types, where even state-of-the-art models achieve only 54.3% average score.

Details

Motivation: Existing benchmarks lack scalability for complex academic tasks and deep expert knowledge evaluation, creating a need for more specialized assessment tools.

Method: Three-step construction process creating bilingual dataset with domain-specific attributes, five problem types across eight research domains, aligned with characteristic research methodologies.

Result: Created 5,031 Korean and 5,309 English examples; state-of-the-art models like o3-mini achieved only 0.543 average score, demonstrating high difficulty.

Conclusion: ScholarBench successfully addresses gaps in existing benchmarks and provides a challenging evaluation framework for LLMs’ academic reasoning capabilities across multiple domains and languages.

Abstract: Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

[144] Thinker: Learning to Think Fast and Slow

Stephen Chung, Wenyu Du, Jie Fu

Main category: cs.CL

TL;DR: A dual-process reasoning approach for LLMs that combines fast intuition with deliberate verification and refinement stages, improving accuracy while maintaining efficiency.

Details

Motivation: Current LLM reasoning is imprecise with redundant responses, lacking confidence and verification capabilities. Inspired by psychological Dual Process Theory to separate intuitive and deliberative reasoning.

Method: Four-stage QA task: Fast Thinking (answer within token budget), Verification (evaluate initial response), Slow Thinking (refine with deliberation), Summarization (distill into precise steps).

Result: Improved average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Fast Thinking alone achieved 25.2% accuracy using <1000 tokens.

Conclusion: Intuition and deliberative reasoning are distinct complementary systems that benefit from targeted training. The approach enables more efficient and accurate reasoning in LLMs.

Abstract: Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.

[145] Multilinguality Does not Make Sense: Investigating Factors Behind Zero-Shot Transfer in Sense-Aware Tasks

Roksana Goworek, Haim Dubossarsky

Main category: cs.CL

TL;DR: Multilinguality is not necessary for effective cross-lingual transfer in sense-aware tasks like polysemy and lexical semantic change; other factors like pretraining data differences and evaluation artifacts better explain perceived benefits.

Details

Motivation: To test the common assumption that training on more languages improves zero-shot cross-lingual transfer, specifically for sense-aware tasks.

Method: Large-scale analysis across 28 languages on polysemy and lexical semantic change tasks, examining factors like pretraining/fine-tuning data differences and evaluation artifacts.

Result: Multilinguality is not necessary for effective transfer; other factors better explain the perceived benefits of training on multiple languages.

Conclusion: Findings challenge assumptions about multilinguality in cross-lingual transfer and provide insights for low-resource languages, with released models and baselines for future research.

Abstract: Cross-lingual transfer is central to modern NLP, enabling models to perform tasks in languages different from those they were trained on. A common assumption is that training on more languages improves zero-shot transfer. We test this on sense-aware tasks-polysemy and lexical semantic change-and find that multilinguality is not necessary for effective transfer. Our large-scale analysis across 28 languages reveals that other factors, such as differences in pretraining and fine-tuning data and evaluation artifacts, better explain the perceived benefits of multilinguality. We also release fine-tuned models and provide empirical baselines to support future research. While focused on two sense-aware tasks, our findings offer broader insights into cross-lingual transfer, especially for low-resource languages.

[146] iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering

Shuai Wang, Yinan Yu

Main category: cs.CL

TL;DR: iQUEST is a KBQA framework that iteratively decomposes complex queries and uses GNNs to incorporate 2-hop neighbor information, improving multi-hop reasoning accuracy.

Details

Motivation: LLMs often have factual inaccuracies in knowledge-intensive scenarios, and multi-hop KBQA faces challenges in maintaining coherent reasoning paths and avoiding premature discarding of critical connections.

Method: Iterative query decomposition into simpler sub-questions combined with GNN-based 2-hop neighbor lookahead at each reasoning step.

Result: Consistent improvement across four benchmark datasets and four LLMs.

Conclusion: The dual approach of structured decomposition and neighbor lookahead strengthens reasoning and enables more effective path exploration.

Abstract: While Large Language Models (LLMs) excel at many natural language processing tasks, they often suffer from factual inaccuracies in knowledge-intensive scenarios. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To address these issues, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

[147] Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?

Michael Li, Nishant Subramani

Main category: cs.CL

TL;DR: Analysis of 25 transformer models shows hierarchical linguistic organization persists in modern LLMs: early layers handle syntax, middle layers semantics, and later layers discourse. Lexical information becomes nonlinear in deeper layers while inflectional morphology remains linear.

Details

Motivation: To understand how modern large language models encode linguistic information compared to early models like BERT and GPT-2, examining whether hierarchical organization patterns persist across diverse architectures and training regimes.

Method: Analyzed 25 models from classical architectures to modern LLMs using layer-by-layer probing across 8 linguistic tasks, with in-depth multilingual analysis of lexical identity and inflectional morphology, plus attention mechanisms, steering vectors, and pretraining checkpoint analysis.

Result: Found consistent hierarchical organization across models: early layers capture syntax, middle layers handle semantics and entity information, later layers encode discourse. Lexical information concentrates linearly in early layers but becomes nonlinear deeper, while inflectional morphology remains linearly accessible throughout.

Conclusion: Transformer models learn similar linguistic organization patterns regardless of architecture, size, or training regime, suggesting these hierarchical properties are fundamental for next token prediction in language modeling.

Abstract: Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. Building on classic BERTology work, we analyze 25 models spanning from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1), probing layer-by-layer representations across eight linguistic tasks in English. Consistent with earlier findings, we find that hierarchical organization persists in modern models: early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena. We dive deeper, conducting an in-depth multilingual analysis of two specific linguistic properties - lexical identity and inflectional morphology - that help disentangle form from meaning. We find that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional information remains linearly accessible throughout all layers. Additional analyses of attention mechanisms, steering vectors, and pretraining checkpoints reveal where this information resides within layers, how it can be functionally manipulated, and how representations evolve during pretraining. Taken together, our findings suggest that, even with substantial advances in LLM technologies, transformer models learn to organize linguistic information in similar ways, regardless of model architecture, size, or training regime, indicating that these properties are important for next token prediction. Our code is available at https://github.com/ml5885/model_internal_sleuthing

[148] Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?

Qingchuan Li, Jiatong Li, Zirui Liu, Mingyue Cheng, Yuting Zeng, Qi Liu, Tongxuan Liu

Main category: cs.CL

TL;DR: SoLT benchmark tests LLM-based logical translators by rewriting reasoning tasks into diverse linguistic forms while preserving logic. MenTaL method improves consistency by maintaining concept-symbol mapping tables.

Details

Motivation: Existing LLM-based logical translators fail to maintain consistent symbolic representations when the same concept appears in different linguistic forms, breaking logical coherence. Most benchmarks lack this real-world linguistic variation.

Method: Created SoLT benchmark that systematically rewrites reasoning datasets into diverse but logically equivalent forms. Proposed MenTaL method that explicitly builds concept-symbol mapping tables during translation to maintain consistency.

Result: Experiments show LLMs suffer from inconsistent symbol mapping under linguistic variation, causing significant accuracy drops. MenTaL brings stable performance improvements across diverse inputs.

Conclusion: Overlooking linguistic diversity hides key weaknesses in LLM-based translators. The work offers steps toward more reliable logical reasoning in varied real-world scenarios through systematic evaluation and consistency-enhancing methods.

Abstract: Logical reasoning with large language models (LLMs) has received growing attention. One mainstream approach translates natural language into formal logic and then applies symbolic solvers for deduction. While effective in many tasks, these LLM-based translators often fail to generate consistent symbolic representations when the same concept appears in different linguistic forms. Such inconsistencies break logical coherence and lead to solver errors. However, most existing benchmarks lack this type of linguistic variation, which frequently occurs in real-world text, leaving the problem underexplored. To address this gap, we present SoLT, a benchmark that systematically rewrites reasoning datasets into diverse yet logically equivalent forms across multiple levels. Beyond evaluation, SoLT also provides a general method to enrich any dataset with linguistic diversity while preserving both meaning and logic. To further enhance the stability of LLM-based reasoning, we propose MenTaL, which explicitly guides models to build a concept-symbol mapping table during translation. By linking equivalent expressions to shared symbols, MenTaL maintains consistency and mitigates symbol drift. Experiments on SoLT demonstrate that LLMs indeed suffer from inconsistent symbol mapping under linguistic variation, leading to significant drops in reasoning accuracy. Meanwhile, applying MenTaL brings clear and stable performance improvements across diverse inputs. Overall, our findings reveal that overlooking linguistic diversity hides key weaknesses in LLM-based translators, and our work offers a step toward more reliable logical reasoning in varied real-world scenarios. Our code is available at https://github.com/wufeiwuwoshihua/LinguDiver.

[149] KScope: A Framework for Characterizing the Knowledge Status of Language Models

Yuxin Xiao, Shan Chen, Jack Gallifant, Danielle Bitterman, Thomas Hartvigsen, Marzyeh Ghassemi

Main category: cs.CL

TL;DR: KScope is a hierarchical framework that uses statistical tests to characterize LLM knowledge into five statuses based on consistency and correctness, revealing how context features influence knowledge updates across different models.

Details

Motivation: Prior work focused on knowledge conflicts but didn't fully capture how well LLMs know answers. There's a need for a systematic way to characterize LLM knowledge status beyond just conflict scenarios.

Method: Proposed KScope - a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and categorizes LLM knowledge into five statuses based on consistency and correctness.

Result: Applied to 9 LLMs across 4 datasets, found that: supporting context narrows knowledge gaps; difficulty, relevance, and familiarity features drive successful updates; LLMs show similar feature preferences when partially correct/conflicted but diverge when consistently wrong; context summarization with feature analysis improves update effectiveness.

Conclusion: KScope provides a systematic framework for characterizing LLM knowledge, revealing key insights about how context features influence knowledge updates and showing that targeted context summarization can improve knowledge update effectiveness across different LLMs.

Abstract: Characterizing a large language model’s (LLM’s) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model’s internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.

[150] Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Sungeun Hahm, Heejin Kim, Gyuseong Lee, Hyunji Park, Jaejin Lee

Main category: cs.CL

TL;DR: Thunder-DeID is a framework for de-identifying Korean court judgments that addresses legal requirements through a systematic PII categorization and DNN-based pipeline, achieving state-of-the-art performance.

Details

Motivation: Current de-identification processes in South Korean judiciary are inadequate for large-scale court judgment processing and face challenges with vague legal definitions of personal identifiers that don't translate well to technical solutions.

Method: Proposed Thunder-DeID framework includes: (i) creating first Korean legal dataset with annotated judgments and entity mentions, (ii) systematic PII categorization, and (iii) end-to-end DNN-based de-identification pipeline.

Result: The model achieves state-of-the-art performance in court judgment de-identification.

Conclusion: Thunder-DeID provides an effective solution for balancing open access to justice with personal data protection by aligning technical de-identification with legal requirements.

Abstract: To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.

[151] Detecting Token-Level Hallucinations Using Variance Signals: A Reference-Free Approach

Keshav Kumar

Main category: cs.CL

TL;DR: A reference-free, token-level hallucination detection framework using variance in token log-probabilities across multiple stochastic generations to identify hallucinations in LLMs.

Details

Motivation: LLMs often generate factually incorrect outputs (hallucinations) confidently, requiring methods to detect these errors without relying on ground-truth references.

Method: Leverages variance in token log-probabilities across multiple stochastic generations from the same model, making it model-agnostic and suitable for real-time or post-hoc analysis.

Result: Evaluation on SQuAD v2 unanswerable questions shows token-level variance reliably highlights model instability and correlates with hallucination patterns across GPT-Neo 125M, Falcon 1B, and Mistral 7B models.

Conclusion: The framework provides a lightweight, reproducible, and adaptable diagnostic tool for analyzing generative reliability in LLMs across multiple domains.

Abstract: Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.

[152] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Nathan Byrd, Ashrith Sheshan, Raia Hadsell, Sangnie Bhardwaj, Pawel Janus, Tero Rissa, Dan Horgan, Alvin Abdagic, Lior Belenki, James Allingham, Anima Singh, Theo Guidroz, Srivatsan Srinivasan, Herman Schmit, Kristen Chiafullo, Andre Elisseeff, Nilpa Jha, Prateek Kolhar, Leonard Berrada, Frank Ding, Xiance Si, Shrestha Basu Mallick, Franz Och, Sofia Erell, Eric Ni, Tejasi Latkar, Sherry Yang, Petar Sirkovic, Ziqiang Feng, Robert Leland, Rachel Hornung, Gang Wu, Charles Blundell, Hamidreza Alvari, Po-Sen Huang, Cathy Yip, Sanja Deur, Li Liu, Gabriela Surita, Pablo Duque, Dima Damen, Johnson Jia, Arthur Guez, Markus Mircea, Animesh Sinha, Alberto Magni, Paweł Stradomski, Tal Marian, Vlado Galić, Wenhu Chen, Hisham Husain, Achintya Singhal, Dominik Grewe, François-Xavier Aubet, Shuang Song, Lorenzo Blanco, Leland Rechis, Lewis Ho, Rich Munoz, Kelvin Zheng, Jessica Hamrick, Kevin Mather, Hagai Taitelbaum, Eliza Rutherford, Yun Lei, Kuangyuan Chen, Anand Shukla, Erica Moreira, Eric Doi, Berivan Isik, Nir Shabat, Dominika Rogozińska, Kashyap Kolipaka, Jason Chang, Eugen Vušak, Srinivasan Venkatachary, Shadi Noghabi, Tarun Bharti, Younghoon Jun, Aleksandr Zaks, Simon Green, Jeshwanth Challagundla, William Wong, Muqthar Mohammad, Dean Hirsch, Yong Cheng, Iftekhar Naim, Lev Proleev, Damien Vincent, Aayush Singh, Maxim Krikun, Dilip Krishnan, Zoubin Ghahramani, Aviel Atias, Rajeev Aggarwal, Christo Kirov, Dimitrios Vytiniotis, Christy Koh, Alexandra Chronopoulou, Pawan Dogra, Vlad-Doru Ion, Gladys Tyen, Jason Lee, Felix Weissenberger, Trevor Strohman, Ashwin Balakrishna, Jack Rae, Marko Velic, Raoul de Liedekerke, Oded Elyada, Wentao Yuan, Canoee Liu, Lior Shani, Sergey Kishchenko, Bea Alessio, Yandong Li, Richard Song, Sam Kwei, Orion Jankowski, Aneesh Pappu, Youhei Namiki, Yenai Ma, Nilesh Tripuraneni, Colin Cherry, Marissa Ikonomidis, Yu-Cheng Ling, Colin Ji, Beka Westberg, Auriel Wright, Da Yu, David Parkinson, Swaroop Ramaswamy, Jerome Connor, Soheil Hassas Yeganeh, Snchit Grover, George Kenwright, Lubo Litchev, Chris Apps, Alex Tomala, Felix Halim, Alex Castro-Ros, Zefei Li, Anudhyan Boral, Pauline Sho, Michal Yarom, Eric Malmi, David Klinghoffer, Rebecca Lin, Alan Ansell, Pradeep Kumar S, Shubin Zhao, Siqi Zuo, Adam Santoro, Heng-Tze Cheng, Solomon Demmessie, Yuchi Liu, Nicole Brichtova, Allie Culp, Nathaniel Braun, Dan Graur, Will Ng, Nikhil Mehta, Aaron Phillips, Patrik Sundberg, Varun Godbole, Fangyu Liu, Yash Katariya, David Rim, Mojtaba Seyedhosseini, Sean Ammirati, Jonas Valfridsson, Mahan Malihi, Timothy Knight, Andeep Toor, Thomas Lampe, Abe Ittycheriah, Lewis Chiang, Chak Yeung, Alexandre Fréchette, Jinmeng Rao, Huisheng Wang, Himanshu Srivastava, Richard Zhang, Rocky Rhodes, Ariel Brand, Dean Weesner, Ilya Figotin, Felix Gimeno, Rachana Fellinger, Pierre Marcenac, José Leal, Eyal Marcus, Victor Cotruta, Rodrigo Cabrera, Sheryl Luo, Dan Garrette, Vera Axelrod, Sorin Baltateanu, David Barker, Dongkai Chen, Horia Toma, Ben Ingram, Jason Riesa, Chinmay Kulkarni, Yujing Zhang, Hongbin Liu, Chao Wang, Martin Polacek, Will Wu, Kai Hui, Adrian N Reyes, Yi Su, Megan Barnes, Ishaan Malhi, Anfal Siddiqui, Qixuan Feng, Mihai Damaschin, Daniele Pighin, Andreas Steiner, Samuel Yang, Ramya Sree Boppana, Simeon Ivanov, Arun Kandoor, Aditya Shah, Asier Mujika, Da Huang, Christopher A. Choquette-Choo, Mohak Patel, Tianhe Yu, Toni Creswell, Jerry, Liu, Catarina Barros, Yasaman Razeghi, Aurko Roy, Phil Culliton, Binbin Xiong, Jiaqi Pan, Thomas Strohmann, Tolly Powell, Babi Seal, Doug DeCarlo, Pranav Shyam, Kaan Katircioglu, Xuezhi Wang, Cassidy Hardin, Immanuel Odisho, Josef Broder, Oscar Chang, Arun Nair, Artem Shtefan, Maura O’Brien, Manu Agarwal, Sahitya Potluri, Siddharth Goyal, Amit Jhindal, Saksham Thakur, Yury Stuken, James Lyon, Kristina Toutanova, Fangxiaoyu Feng, Austin Wu, Ben Horn, Alek Wang, Alex Cullum, Gabe Taubman, Disha Shrivastava, Chongyang Shi, Hamish Tomlinson, Roma Patel, Tao Tu, Ada Maksutaj Oflazer, Francesco Pongetti, Mingyao Yang, Adrien Ali Taïga, Vincent Perot, Nuo Wang Pierse, Feng Han, Yoel Drori, Iñaki Iturrate, Ayan Chakrabarti, Legg Yeung, Dave Dopson, Yi-ting Chen, Apoorv Kulshreshtha, Tongfei Guo, Philip Pham, Tal Schuster, Junquan Chen, Alex Polozov, Jinwei Xing, Huanjie Zhou, Praneeth Kacham, Doron Kukliansky, Antoine Miech, Sergey Yaroshenko, Ed Chi, Sholto Douglas, Hongliang Fei, Mathieu Blondel, Preethi Myla, Lior Madmoni, Xing Wu, Daniel Keysers, Kristian Kjems, Isabela Albuquerque, Lijun Yu, Joel D’sa, Michelle Plantan, Vlad Ionescu, Jaume Sanchez Elias, Abhirut Gupta, Manish Reddy Vuyyuru, Fred Alcober, Tong Zhou, Kaiyang Ji, Florian Hartmann, Subha Puttagunta, Hugo Song, Ehsan Amid, Anca Stefanoiu, Andrew Lee, Paul Pucciarelli, Emma Wang, Amit Raul, Slav Petrov, Isaac Tian, Valentin Anklin, Nana Nti, Victor Gomes, Max Schumacher, Grace Vesom, Alex Panagopoulos, Konstantinos Bousmalis, Daniel Andor, Josh Jacob, Yuan Zhang, Bill Rosgen, Matija Kecman, Matthew Tung, Alexandra Belias, Noah Goodman, Paul Covington, Brian Wieder, Nikita Saxena, Elnaz Davoodi, Muhuan Huang, Sharath Maddineni, Vincent Roulet, Folawiyo Campbell-Ajala, Pier Giuseppe Sessa, Xintian, Wu, Guangda Lai, Paul Collins, Alex Haig, Vytenis Sakenas, Xiaowei Xu, Marissa Giustina, Laurent El Shafey, Pichi Charoenpanit, Shefali Garg, Joshua Ainslie, Boone Severson, Montse Gonzalez Arenas, Shreya Pathak, Sujee Rajayogam, Jie Feng, Michiel Bakker, Sheng Li, Nevan Wichers, Jamie Rogers, Xinyang Geng, Yeqing Li, Rolf Jagerman, Chao Jia, Nadav Olmert, David Sharon, Matthew Mauger, Sandeep Mariserla, Hongxu Ma, Megha Mohabey, Kyuyeun Kim, Alek Andreev, Scott Pollom, Juliette Love, Vihan Jain, Priyanka Agrawal, Yannick Schroecker, Alisa Fortin, Manfred Warmuth, Ji Liu, Andrew Leach, Irina Blok, Ganesh Poomal Girirajan, Roee Aharoni, Benigno Uria, Andrei Sozanschi, Dan Goldberg, Lucian Ionita, Marco Tulio Ribeiro, Martin Zlocha, Vighnesh Birodkar, Sami Lachgar, Liangzhe Yuan, Himadri Choudhury, Matt Ginsberg, Fei Zheng, Gregory Dibb, Emily Graves, Swachhand Lokhande, Gabriel Rasskin, George-Cristian Muraru, Corbin Quick, Sandeep Tata, Pierre Sermanet, Aditya Chawla, Itay Karo, Yan Wang, Susan Zhang, Orgad Keller, Anca Dragan, Guolong Su, Ian Chou, Xi Liu, Yiqing Tao, Shruthi Prabhakara, Marc Wilson, Ruibo Liu, Shibo Wang, Georgie Evans, David Du, Alfonso Castaño, Gautam Prasad, Mona El Mahdy, Sebastian Gerlach, Machel Reid, Jarrod Kahn, Amir Zait, Thanumalayan Sankaranarayana Pillai, Thatcher Ulrich, Guanyu Wang, Jan Wassenberg, Efrat Farkash, Kiran Yalasangi, Congchao Wang, Maria Bauza, Simon Bucher, Ting Liu, Jun Yan, Gary Leung, Vikas Sindhwani, Parker Barnes, Avi Singh, Ivan Jurin, Jichuan Chang, Niket Kumar Bhumihar, Sivan Eiger, Gui Citovsky, Ben Withbroe, Zhang Li, Siyang Xue, Niccolò Dal Santo, Georgi Stoyanov, Yves Raimond, Steven Zheng, Yilin Gao, Vít Listík, Sławek Kwasiborski, Rachel Saputro, Adnan Ozturel, Ganesh Mallya, Kushal Majmundar, Ross West, Paul Caron, Jinliang Wei, Lluis Castrejon, Sharad Vikram, Deepak Ramachandran, Nikhil Dhawan, Jiho Park, Sara Smoot, George van den Driessche, Yochai Blau, Chase Malik, Wei Liang, Roy Hirsch, Cicero Nogueira dos Santos, Eugene Weinstein, Aäron van den Oord, Sid Lall, Nicholas FitzGerald, Zixuan Jiang, Xuan Yang, Dale Webster, Ali Elqursh, Aedan Pope, Georges Rotival, David Raposo, Wanzheng Zhu, Jeff Dean, Sami Alabed, Dustin Tran, Arushi Gupta, Zach Gleicher, Jessica Austin, Edouard Rosseel, Megh Umekar, Dipanjan Das, Yinghao Sun, Kai Chen, Karolis Misiunas, Xiang Zhou, Yixian Di, Alyssa Loo, Josh Newlan, Bo Li, Vinay Ramasesh, Ying Xu, Alex Chen, Sudeep Gandhe, Radu Soricut, Nikita Gupta, Shuguang Hu, Seliem El-Sayed, Xavier Garcia, Idan Brusilovsky, Pu-Chin Chen, Andrew Bolt, Lu Huang, Alex Gurney, Zhiying Zhang, Alexander Pritzel, Jarek Wilkiewicz, Bryan Seybold, Bhargav Kanagal Shamanna, Felix Fischer, Josef Dean, Karan Gill, Ross Mcilroy, Abhishek Bhowmick, Jeremy Selier, Antoine Yang, Derek Cheng, Vladimir Magay, Jie Tan, Dhriti Varma, Christian Walder, Tomas Kocisky, Ryo Nakashima, Paul Natsev, Mike Kwong, Ionel Gog, Chiyuan Zhang, Sander Dieleman, Thomas Jimma, Andrey Ryabtsev, Siddhartha Brahma, David Steiner, Dayou Du, Ante Žužul, Mislav Žanić, Mukund Raghavachari, Willi Gierke, Zeyu Zheng, Dessie Petrova, Yann Dauphin, Yuchuan Liu, Ido Kessler, Steven Hand, Chris Duvarney, Seokhwan Kim, Hyo Lee, Léonard Hussenot, Jeffrey Hui, Josh Smith, Deepali Jain, Jiawei Xia, Gaurav Singh Tomar, Keyvan Amiri, Du Phan, Fabian Fuchs, Tobias Weyand, Nenad Tomasev, Alexandra Cordell, Xin Liu, Jonathan Mallinson, Pankaj Joshi, Andy Crawford, Arun Suggala, Steve Chien, Nick Fernando, Mariella Sanchez-Vargas, Duncan Williams, Phil Crone, Xiyang Luo, Igor Karpov, Jyn Shan, Terry Thurk, Robin Strudel, Paul Voigtlaender, Piyush Patil, Tim Dozat, Ali Khodaei, Sahil Singla, Piotr Ambroszczyk, Qiyin Wu, Yifan Chang, Brian Roark, Chaitra Hegde, Tianli Ding, Angelos Filos, Zhongru Wu, André Susano Pinto, Shuang Liu, Saarthak Khanna, Aditya Pandey, Siobhan Mcloughlin, Qiujia Li, Sam Haves, Allan Zhou, Elena Buchatskaya, Isabel Leal, Peter de Boursac, Nami Akazawa, Nina Anderson, Terry Chen, Krishna Somandepalli, Chen Liang, Sheela Goenka, Stephanie Winkler, Alexander Grushetsky, Yifan Ding, Jamie Smith, Fan Ye, Jordi Pont-Tuset, Eric Li, Ruichao Li, Tomer Golany, Dawid Wegner, Tao Jiang, Omer Barak, Yuan Shangguan, Eszter Vértes, Renee Wong, Jörg Bornschein, Alex Tudor, Michele Bevilacqua, Tom Schaul, Ankit Singh Rawat, Yang Zhao, Kyriakos Axiotis, Lei Meng, Cory McLean, Jonathan Lai, Jennifer Beattie, Nate Kushman, Yaxin Liu, Blair Kutzman, Fiona Lang, Jingchen Ye, Praneeth Netrapalli, Pushkar Mishra, Myriam Khan, Megha Goel, Rob Willoughby, David Tian, Honglei Zhuang, JD Chen, Zak Tsai, Tasos Kementsietsidis, Arjun Khare, James Keeling, Keyang Xu, Nathan Waters, Florent Altché, Ashok Popat, Bhavishya Mittal, David Saxton, Dalia El Badawy, Michael Mathieu, Zheng Zheng, Hao Zhou, Nishant Ranka, Richard Shin, Qingnan Duan, Tim Salimans, Ioana Mihailescu, Uri Shaham, Ming-Wei Chang, Yannis Assael, Nishanth Dikkala, Martin Izzard, Vincent Cohen-Addad, Cat Graves, Vlad Feinberg, Grace Chung, DJ Strouse, Danny Karmon, Sahand Sharifzadeh, Zoe Ashwood, Khiem Pham, Jon Blanton, Alex Vasiloff, Jarred Barber, Mark Geller, Aurick Zhou, Fedir Zubach, Tzu-Kuo Huang, Lei Zhang, Himanshu Gupta, Matt Young, Julia Proskurnia, Ronny Votel, Valentin Gabeur, Gabriel Barcik, Aditya Tripathi, Hongkun Yu, Geng Yan, Beer Changpinyo, Filip Pavetić, Amy Coyle, Yasuhisa Fujii, Jorge Gonzalez Mendez, Tianhao Zhou, Harish Rajamani, Blake Hechtman, Eddie Cao, Da-Cheng Juan, Yi-Xuan Tan, Valentin Dalibard, Yilun Du, Natalie Clay, Kaisheng Yao, Wenhao Jia, Dimple Vijaykumar, Yuxiang Zhou, Xinyi Bai, Wei-Chih Hung, Steven Pecht, Georgi Todorov, Nikhil Khadke, Pramod Gupta, Preethi Lahoti, Arnaud Autef, Karthik Duddu, James Lee-Thorp, Alexander Bykovsky, Tautvydas Misiunas, Sebastian Flennerhag, Santhosh Thangaraj, Jed McGiffin, Zack Nado, Markus Kunesch, Andreas Noever, Amir Hertz, Marco Liang, Victor Stone, Evan Palmer, Samira Daruki, Arijit Pramanik, Siim Põder, Austin Kyker, Mina Khan, Evgeny Sluzhaev, Marvin Ritter, Avraham Ruderman, Wenlei Zhou, Chirag Nagpal, Kiran Vodrahalli, George Necula, Paul Barham, Ellie Pavlick, Jay Hartford, Izhak Shafran, Long Zhao, Maciej Mikuła, Tom Eccles, Hidetoshi Shimokawa, Kanav Garg, Luke Vilnis, Hanwen Chen, Ilia Shumailov, Kuang-Huei Lee, Abdelrahman Abdelhamed, Meiyan Xie, Vered Cohen, Ester Hlavnova, Dan Malkin, Chawin Sitawarin, James Lottes, Pauline Coquinot, Tianli Yu, Sandeep Kumar, Jingwei Zhang, Aroma Mahendru, Zafarali Ahmed, James Martens, Tao Chen, Aviel Boag, Daiyi Peng, Coline Devin, Arseniy Klimovskiy, Mary Phuong, Danny Vainstein, Jin Xie, Bhuvana Ramabhadran, Nathan Howard, Xinxin Yu, Gitartha Goswami, Jingyu Cui, Sam Shleifer, Mario Pinto, Chih-Kuan Yeh, Ming-Hsuan Yang, Sara Javanmardi, Dan Ethier, Chace Lee, Jordi Orbay, Suyog Kotecha, Carla Bromberg, Pete Shaw, James Thornton, Adi Gerzi Rosenthal, Shane Gu, Matt Thomas, Ian Gemp, Aditya Ayyar, Asahi Ushio, Aarush Selvan, Joel Wee, Chenxi Liu, Maryam Majzoubi, Weiren Yu, Jake Abernethy, Tyler Liechty, Renke Pan, Hoang Nguyen, Qiong, Hu, Sarah Perrin, Abhinav Arora, Emily Pitler, Weiyi Wang, Kaushik Shivakumar, Flavien Prost, Ben Limonchik, Jing Wang, Yi Gao, Timothee Cour, Shyamal Buch, Huan Gui, Maria Ivanova, Philipp Neubeck, Kelvin Chan, Lucy Kim, Huizhong Chen, Naman Goyal, Da-Woon Chung, Lu Liu, Yao Su, Anastasia Petrushkina, Jiajun Shen, Armand Joulin, Yuanzhong Xu, Stein Xudong Lin, Yana Kulizhskaya, Ciprian Chelba, Shobha Vasudevan, Eli Collins, Vasilisa Bashlovkina, Tony Lu, Doug Fritz, Jongbin Park, Yanqi Zhou, Chen Su, Richard Tanburn, Mikhail Sushkov, Mitchelle Rasquinha, Jinning Li, Jennifer Prendki, Yiming Li, Pallavi LV, Shriya Sharma, Hen Fitoussi, Hui Huang, Andrew Dai, Phuong Dao, Mike Burrows, Henry Prior, Danfeng Qin, Golan Pundak, Lars Lowe Sjoesund, Art Khurshudov, Zhenkai Zhu, Albert Webson, Elizabeth Kemp, Tat Tan, Saurabh Agrawal, Susie Sargsyan, Liqun Cheng, Jim Stephan, Tom Kwiatkowski, David Reid, Arunkumar Byravan, Assaf Hurwitz Michaely, Nicolas Heess, Luowei Zhou, Sonam Goenka, Viral Carpenter, Anselm Levskaya, Bo Wang, Reed Roberts, Rémi Leblond, Sharat Chikkerur, Stav Ginzburg, Max Chang, Robert Riachi, Chuqiao, Xu, Zalán Borsos, Michael Pliskin, Julia Pawar, Morgane Lustman, Hannah Kirkwood, Ankit Anand, Aditi Chaudhary, Norbert Kalb, Kieran Milan, Sean Augenstein, Anna Goldie, Laurel Prince, Karthik Raman, Yanhua Sun, Vivian Xia, Aaron Cohen, Zhouyuan Huo, Josh Camp, Seher Ellis, Lukas Zilka, David Vilar Torres, Lisa Patel, Sho Arora, Betty Chan, Jonas Adler, Kareem Ayoub, Jacky Liang, Fayaz Jamil, Jiepu Jiang, Simon Baumgartner, Haitian Sun, Yael Karov, Yaroslav Akulov, Hui Zheng, Irene Cai, Claudio Fantacci, James Rubin, Alex Rav Acha, Mengchao Wang, Nina D’Souza, Rohit Sathyanarayana, Shengyang Dai, Simon Rowe, Andrey Simanovsky, Omer Goldman, Yuheng Kuang, Xiaoyue Pan, Andrew Rosenberg, Tania Rojas-Esponda, Praneet Dutta, Amy Zeng, Irina Jurenka, Greg Farquhar, Yamini Bansal, Shariq Iqbal, Becca Roelofs, Ga-Young Joung, Parker Beak, Changwan Ryu, Ryan Poplin, Yan Wu, Jean-Baptiste Alayrac, Senaka Buthpitiya, Olaf Ronneberger, Caleb Habtegebriel, Wei Li, Paul Cavallaro, Aurora Wei, Guy Bensky, Timo Denk, Harish Ganapathy, Jeff Stanway, Pratik Joshi, Francesco Bertolini, Jessica Lo, Olivia Ma, Zachary Charles, Geta Sampemane, Himanshu Sahni, Xu Chen, Harry Askham, David Gaddy, Peter Young, Jiewen Tan, Matan Eyal, Arthur Bražinskas, Li Zhong, Zhichun Wu, Mark Epstein, Kai Bailey, Andrew Hard, Kamyu Lee, Sasha Goldshtein, Alex Ruiz, Mohammed Badawi, Matthias Lochbrunner, JK Kearns, Ashley Brown, Fabio Pardo, Theophane Weber, Haichuan Yang, Pan-Pan Jiang, Berkin Akin, Zhao Fu, Marcus Wainwright, Chi Zou, Meenu Gaba, Pierre-Antoine Manzagol, Wendy Kan, Yang Song, Karina Zainullina, Rui Lin, Jeongwoo Ko, Salil Deshmukh, Apoorv Jindal, James Svensson, Divya Tyam, Heri Zhao, Christine Kaeser-Chen, Scott Baird, Pooya Moradi, Jamie Hall, Qiuchen Guo, Vincent Tsang, Bowen Liang, Fernando Pereira, Suhas Ganesh, Ivan Korotkov, Jakub Adamek, Sridhar Thiagarajan, Vinh Tran, Charles Chen, Chris Tar, Sanil Jain, Ishita Dasgupta, Taylan Bilal, David Reitter, Kai Zhao, Giulia Vezzani, Yasmin Gehman, Pulkit Mehta, Lauren Beltrone, Xerxes Dotiwalla, Sergio Guadarrama, Zaheer Abbas, Stefani Karp, Petko Georgiev, Chun-Sung Ferng, Marc Brockschmidt, Liqian Peng, Christoph Hirnschall, Vikas Verma, Yingying Bi, Ying Xiao, Avigail Dabush, Kelvin Xu, Phil Wallis, Randall Parker, Qifei Wang, Yang Xu, Ilkin Safarli, Dinesh Tewari, Yin Zhang, Seungyeon Kim, Andrea Gesmundo, Mackenzie Thomas, Sergey Levi, Ahmed Chowdhury, Kanishka Rao, Peter Garst, Sam Conway-Rahman, Helen Ran, Kay McKinney, Zhisheng Xiao, Wenhao Yu, Rohan Agrawal, Axel Stjerngren, Catalin Ionescu, Jingjing Chen, Vivek Sharma, Justin Chiu, Fei Liu, Ken Franko, Clayton Sanford, Xingyu Cai, Paul Michel, Sanjay Ganapathy, Jane Labanowski, Zachary Garrett, Ben Vargas, Sean Sun, Bryan Gale, Thomas Buschmann, Guillaume Desjardins, Nimesh Ghelani, Palak Jain, Mudit Verma, Chulayuth Asawaroengchai, Julian Eisenschlos, Jitendra Harlalka, Hideto Kazawa, Don Metzler, Joshua Howland, Ying Jian, Jake Ades, Viral Shah, Tynan Gangwani, Seungji Lee, Roman Ring, Steven M. Hernandez, Dean Reich, Amer Sinha, Ashutosh Sathe, Joe Kovac, Ashleah Gill, Ajay Kannan, Andrea D’olimpio, Martin Sevenich, Jay Whang, Been Kim, Khe Chai Sim, Jilin Chen, Jiageng Zhang, Shuba Lall, Yossi Matias, Bill Jia, Abe Friesen, Sara Nasso, Ashish Thapliyal, Bryan Perozzi, Ting Yu, Anna Shekhawat, Safeen Huda, Peter Grabowski, Eric Wang, Ashwin Sreevatsa, Hilal Dib, Mehadi Hassen, Parker Schuh, Vedrana Milutinovic, Chris Welty, Michael Quinn, Ali Shah, Bangju Wang, Gabe Barth-Maron, Justin Frye, Natalie Axelsson, Tao Zhu, Yukun Ma, Irene Giannoumis, Hanie Sedghi, Chang Ye, Yi Luan, Kevin Aydin, Bilva Chandra, Vivek Sampathkumar, Ronny Huang, Victor Lavrenko, Ahmed Eleryan, Zhi Hong, Steven Hansen, Sara Mc Carthy, Bidisha Samanta, Domagoj Ćevid, Xin Wang, Fangtao Li, Michael Voznesensky, Matt Hoffman, Andreas Terzis, Vikash Sehwag, Gil Fidel, Luheng He, Mu Cai, Yanzhang He, Alex Feng, Martin Nikoltchev, Samrat Phatale, Jason Chase, Rory Lawton, Ming Zhang, Tom Ouyang, Manuel Tragut, Mehdi Hafezi Manshadi, Arjun Narayanan, Jiaming Shen, Xu Gao, Tolga Bolukbasi, Nick Roy, Xin Li, Daniel Golovin, Liviu Panait, Zhen Qin, Guangxing Han, Thomas Anthony, Sneha Kudugunta, Viorica Patraucean, Aniket Ray, Xinyun Chen, Xiaochen Yang, Tanuj Bhatia, Pranav Talluri, Alex Morris, Andrija Ražnatović, Bethanie Brownfield, James An, Sheng Peng, Patrick Kane, Ce Zheng, Nico Duduta, Joshua Kessinger, James Noraky, Siqi Liu, Keran Rong, Petar Veličković, Keith Rush, Alex Goldin, Fanny Wei, Shiva Mohan Reddy Garlapati, Caroline Pantofaru, Okwan Kwon, Jianmo Ni, Eric Noland, Julia Di Trapani, Françoise Beaufays, Abhijit Guha Roy, Yinlam Chow, Aybuke Turker, Geoffrey Cideron, Lantao Mei, Jon Clark, Qingyun Dou, Matko Bošnjak, Ralph Leith, Yuqing Du, Amir Yazdanbakhsh, Milad Nasr, Chester Kwak, Suraj Satishkumar Sheth, Alex Kaskasoli, Ankesh Anand, Balaji Lakshminarayanan, Sammy Jerome, David Bieber, Chun-Te Chu, Alexandre Senges, Tianxiao Shen, Mukund Sridhar, Ndaba Ndebele, Benjamin Beyret, Shakir Mohamed, Mia Chen, Markus Freitag, Jiaxian Guo, Luyang Liu, Paul Roit, Heng Chen, Shen Yan, Tom Stone, JD Co-Reyes, Jeremy Cole, Salvatore Scellato, Shekoofeh Azizi, Hadi Hashemi, Alicia Jin, Anand Iyer, Marcella Valentine, András György, Arun Ahuja, Daniel Hernandez Diaz, Chen-Yu Lee, Nathan Clement, Weize Kong, Drew Garmon, Ishaan Watts, Kush Bhatia, Khyatti Gupta, Matt Miecnikowski, Hugo Vallet, Ankur Taly, Edward Loper, Saket Joshi, James Atwood, Jo Chick, Mark Collier, Fotis Iliopoulos, Ryan Trostle, Beliz Gunel, Ramiro Leal-Cavazos, Arnar Mar Hrafnkelsson, Michael Guzman, Xiaoen Ju, Andy Forbes, Jesse Emond, Kushal Chauhan, Ben Caine, Li Xiao, Wenjun Zeng, Alexandre Moufarek, Daniel Murphy, Maya Meng, Nitish Gupta, Felix Riedel, Anil Das, Elijah Lawal, Shashi Narayan, Tiberiu Sosea, James Swirhun, Linda Friso, Behnam Neyshabur, Jing Lu, Sertan Girgin, Michael Wunder, Edouard Yvinec, Aroonalok Pyne, Victor Carbune, Shruti Rijhwani, Yang Guo, Tulsee Doshi, Anton Briukhov, Max Bain, Ayal Hitron, Xuanhui Wang, Ashish Gupta, Ke Chen, Cosmo Du, Weiyang Zhang, Dhruv Shah, Arjun Akula, Max Dylla, Ashyana Kachra, Weicheng Kuo, Tingting Zou, Lily Wang, Luyao Xu, Jifan Zhu, Justin Snyder, Sachit Menon, Orhan Firat, Igor Mordatch, Yuan Yuan, Natalia Ponomareva, Rory Blevins, Lawrence Moore, Weijun Wang, Phil Chen, Martin Scholz, Artur Dwornik, Jason Lin, Sicheng Li, Diego Antognini, Te I, Xiaodan Song, Matt Miller, Uday Kalra, Adam Raveret, Oscar Akerlund, Felix Wu, Andrew Nystrom, Namrata Godbole, Tianqi Liu, Hannah DeBalsi, Jewel Zhao, Buhuang Liu, Avi Caciularu, Lauren Lax, Urvashi Khandelwal, Victoria Langston, Eric Bailey, Silvio Lattanzi, Yufei Wang, Neel Kovelamudi, Sneha Mondal, Guru Guruganesh, Nan Hua, Ofir Roval, Paweł Wesołowski, Rishikesh Ingale, Jonathan Halcrow, Tim Sohn, Christof Angermueller, Bahram Raad, Eli Stickgold, Eva Lu, Alec Kosik, Jing Xie, Timothy Lillicrap, Austin Huang, Lydia Lihui Zhang, Dominik Paulus, Clement Farabet, Alex Wertheim, Bing Wang, Rishabh Joshi, Chu-ling Ko, Yonghui Wu, Shubham Agrawal, Lily Lin, XiangHai Sheng, Peter Sung, Tyler Breland-King, Christina Butterfield, Swapnil Gawde, Sumeet Singh, Qiao Zhang, Raj Apte, Shilpa Shetty, Adrian Hutter, Tao Li, Elizabeth Salesky, Federico Lebron, Jonni Kanerva, Michela Paganini, Arthur Nguyen, Rohith Vallu, Jan-Thorsten Peter, Sarmishta Velury, David Kao, Jay Hoover, Anna Bortsova, Colton Bishop, Shoshana Jakobovits, Alessandro Agostini, Alekh Agarwal, Chang Liu, Charles Kwong, Sasan Tavakkol, Ioana Bica, Alex Greve, Anirudh GP, Jake Marcus, Le Hou, Tom Duerig, Rivka Moroshko, Dave Lacey, Andy Davis, Julien Amelot, Guohui Wang, Frank Kim, Theofilos Strinopoulos, Hui Wan, Charline Le Lan, Shankar Krishnan, Haotian Tang, Peter Humphreys, Junwen Bai, Idan Heimlich Shtacher, Diego Machado, Chenxi Pang, Ken Burke, Dangyi Liu, Renga Aravamudhan, Yue Song, Ed Hirst, Abhimanyu Singh, Brendan Jou, Liang Bai, Francesco Piccinno, Chuyuan Kelly Fu, Robin Alazard, Barak Meiri, Daniel Winter, Charlie Chen, Mingda Zhang, Jens Heitkaemper, John Lambert, Jinhyuk Lee, Alexander Frömmgen, Sergey Rogulenko, Pranav Nair, Paul Niemczyk, Anton Bulyenov, Bibo Xu, Hadar Shemtov, Morteza Zadimoghaddam, Serge Toropov, Mateo Wirth, Hanjun Dai, Sreenivas Gollapudi, Daniel Zheng, Alex Kurakin, Chansoo Lee, Kalesha Bullard, Nicolas Serrano, Ivana Balazevic, Yang Li, Johan Schalkwyk, Mark Murphy, Mingyang Zhang, Kevin Sequeira, Romina Datta, Nishant Agrawal, Charles Sutton, Nithya Attaluri, Mencher Chiang, Wael Farhan, Gregory Thornton, Kate Lin, Travis Choma, Hung Nguyen, Kingshuk Dasgupta, Dirk Robinson, Iulia Comşa, Michael Riley, Arjun Pillai, Basil Mustafa, Ben Golan, Amir Zandieh, Jean-Baptiste Lespiau, Billy Porter, David Ross, Sujeevan Rajayogam, Mohit Agarwal, Subhashini Venugopalan, Bobak Shahriari, Qiqi Yan, Hao Xu, Taylor Tobin, Pavel Dubov, Hongzhi Shi, Adrià Recasens, Anton Kovsharov, Sebastian Borgeaud, Lucio Dery, Shanthal Vasanth, Elena Gribovskaya, Linhai Qiu, Mahdis Mahdieh, Wojtek Skut, Elizabeth Nielsen, CJ Zheng, Adams Yu, Carrie Grimes Bostock, Shaleen Gupta, Aaron Archer, Chris Rawles, Elinor Davies, Alexey Svyatkovskiy, Tomy Tsai, Yoni Halpern, Christian Reisswig, Bartek Wydrowski, Bo Chang, Joan Puigcerver, Mor Hazan Taege, Jian Li, Eva Schnider, Xinjian Li, Dragos Dena, Yunhan Xu, Umesh Telang, Tianze Shi, Heiga Zen, Kyle Kastner, Yeongil Ko, Neesha Subramaniam, Aviral Kumar, Pete Blois, Zhuyun Dai, John Wieting, Yifeng Lu, Yoel Zeldes, Tian Xie, Anja Hauth, Alexandru Ţifrea, Yuqi Li, Sam El-Husseini, Dan Abolafia, Howard Zhou, Wen Ding, Sahra Ghalebikesabi, Carlos Guía, Andrii Maksai, Ágoston Weisz, Sercan Arik, Nick Sukhanov, Aga Świetlik, Xuhui Jia, Luo Yu, Weiyue Wang, Mark Brand, Dawn Bloxwich, Sean Kirmani, Zhe Chen, Alec Go, Pablo Sprechmann, Nithish Kannen, Alen Carin, Paramjit Sandhu, Isabel Edkins, Leslie Nooteboom, Jai Gupta, Loren Maggiore, Javad Azizi, Yael Pritch, Pengcheng Yin, Mansi Gupta, Danny Tarlow, Duncan Smith, Desi Ivanov, Mohammad Babaeizadeh, Ankita Goel, Satish Kambala, Grace Chu, Matej Kastelic, Michelle Liu, Hagen Soltau, Austin Stone, Shivani Agrawal, Min Kim, Kedar Soparkar, Srinivas Tadepalli, Oskar Bunyan, Rachel Soh, Arvind Kannan, DY Kim, Blake JianHang Chen, Afief Halumi, Sudeshna Roy, Yulong Wang, Olcan Sercinoglu, Gena Gibson, Sijal Bhatnagar, Motoki Sano, Daniel von Dincklage, Qingchun Ren, Blagoj Mitrevski, Mirek Olšák, Jennifer She, Carl Doersch, Jilei, Wang, Bingyuan Liu, Qijun Tan, Tamar Yakar, Tris Warkentin, Alex Ramirez, Carl Lebsack, Josh Dillon, Rajiv Mathews, Tom Cobley, Zelin Wu, Zhuoyuan Chen, Jon Simon, Swaroop Nath, Tara Sainath, Alexei Bendebury, Ryan Julian, Bharath Mankalale, Daria Ćurko, Paulo Zacchello, Adam R. Brown, Kiranbir Sodhia, Heidi Howard, Sergi Caelles, Abhinav Gupta, Gareth Evans, Anna Bulanova, Lesley Katzen, Roman Goldenberg, Anton Tsitsulin, Joe Stanton, Benoit Schillings, Vitaly Kovalev, Corey Fry, Rushin Shah, Kuo Lin, Shyam Upadhyay, Cheng Li, Soroush Radpour, Marcello Maggioni, Jing Xiong, Lukas Haas, Jenny Brennan, Aishwarya Kamath, Nikolay Savinov, Arsha Nagrani, Trevor Yacovone, Ryan Kappedal, Kostas Andriopoulos, Li Lao, YaGuang Li, Grigory Rozhdestvenskiy, Kazuma Hashimoto, Andrew Audibert, Sophia Austin, Daniel Rodriguez, Anian Ruoss, Garrett Honke, Deep Karkhanis, Xi Xiong, Qing Wei, James Huang, Zhaoqi Leng, Vittal Premachandran, Stan Bileschi, Georgios Evangelopoulos, Thomas Mensink, Jay Pavagadhi, Denis Teplyashin, Paul Chang, Linting Xue, Garrett Tanzer, Sally Goldman, Kaushal Patel, Shixin Li, Jeremy Wiesner, Ivy Zheng, Ian Stewart-Binks, Jie Han, Zhi Li, Liangchen Luo, Karel Lenc, Mario Lučić, Fuzhao Xue, Ryan Mullins, Alexey Guseynov, Chung-Ching Chang, Isaac Galatzer-Levy, Adam Zhang, Garrett Bingham, Grace Hu, Ale Hartman, Yue Ma, Jordan Griffith, Alex Irpan, Carey Radebaugh, Summer Yue, Lijie Fan, Victor Ungureanu, Christina Sorokin, Hannah Teufel, Peiran Li, Rohan Anil, Dimitris Paparas, Todd Wang, Chu-Cheng Lin, Hui Peng, Megan Shum, Goran Petrovic, Demetra Brady, Richard Nguyen, Klaus Macherey, Zhihao Li, Harman Singh, Madhavi Yenugula, Mariko Iinuma, Xinyi Chen, Kavya Kopparapu, Alexey Stern, Shachi Dave, Chandu Thekkath, Florence Perot, Anurag Kumar, Fangda Li, Yang Xiao, Matthew Bilotti, Mohammad Hossein Bateni, Isaac Noble, Lisa Lee, Amelio Vázquez-Reina, Julian Salazar, Xiaomeng Yang, Boyu Wang, Ela Gruzewska, Anand Rao, Sindhu Raghuram, Zheng Xu, Eyal Ben-David, Jieru Mei, Sid Dalmia, Zhaoyi Zhang, Yuchen Liu, Gagan Bansal, Helena Pankov, Steven Schwarcz, Andrea Burns, Christine Chan, Sumit Sanghai, Ricky Liang, Ethan Liang, Antoine He, Amy Stuart, Arun Narayanan, Yukun Zhu, Christian Frank, Bahar Fatemi, Amit Sabne, Oran Lang, Indro Bhattacharya, Shane Settle, Maria Wang, Brendan McMahan, Andrea Tacchetti, Livio Baldini Soares, Majid Hadian, Serkan Cabi, Timothy Chung, Nikita Putikhin, Gang Li, Jeremy Chen, Austin Tarango, Henryk Michalewski, Mehran Kazemi, Hussain Masoom, Hila Sheftel, Rakesh Shivanna, Archita Vadali, Ramona Comanescu, Doug Reid, Joss Moore, Arvind Neelakantan, Michaël Sander, Jonathan Herzig, Aviv Rosenberg, Mostafa Dehghani, JD Choi, Michael Fink, Reid Hayes, Eric Ge, Shitao Weng, Chia-Hua Ho, John Karro, Kalpesh Krishna, Lam Nguyen Thiet, Amy Skerry-Ryan, Daniel Eppens, Marco Andreetto, Navin Sarma, Silvano Bonacina, Burcu Karagol Ayan, Megha Nawhal, Zhihao Shan, Mike Dusenberry, Shantanu Thakoor, Sagar Gubbi, Duc Dung Nguyen, Reut Tsarfaty, Samuel Albanie, Jovana Mitrović, Meet Gandhi, Bo-Juen Chen, Alessandro Epasto, Georgi Stephanov, Ye Jin, Samuel Gehman, Aida Amini, Jack Weber, Feryal Behbahani, Shawn Xu, Miltos Allamanis, Xi Chen, Myle Ott, Claire Sha, Michal Jastrzebski, Hang Qi, David Greene, Xinyi Wu, Abodunrinwa Toki, Daniel Vlasic, Jane Shapiro, Ragha Kotikalapudi, Zhe Shen, Takaaki Saeki, Sirui Xie, Albin Cassirer, Shikhar Bharadwaj, Tatsuya Kiyono, Srinadh Bhojanapalli, Elan Rosenfeld, Sam Ritter, Jieming Mao, João Gabriel Oliveira, Zoltan Egyed, Bernd Bandemer, Emilio Parisotto, Keisuke Kinoshita, Juliette Pluto, Petros Maniatis, Steve Li, Yaohui Guo, Golnaz Ghiasi, Jean Tarbouriech, Srimon Chatterjee, Julie Jin, Katrina, Xu, Jennimaria Palomaki, Séb Arnold, Madhavi Sewak, Federico Piccinini, Mohit Sharma, Ben Albrecht, Sean Purser-haskell, Ashwin Vaswani, Chongyan Chen, Matheus Wisniewski, Qin Cao, John Aslanides, Nguyet Minh Phu, Maximilian Sieb, Lauren Agubuzu, Anne Zheng, Daniel Sohn, Marco Selvi, Anders Andreassen, Krishan Subudhi, Prem Eruvbetine, Oliver Woodman, Tomas Mery, Sebastian Krause, Xiaoqi Ren, Xiao Ma, Jincheng Luo, Dawn Chen, Wei Fan, Henry Griffiths, Christian Schuler, Alice Li, Shujian Zhang, Jean-Michel Sarr, Shixin Luo, Riccardo Patana, Matthew Watson, Dani Naboulsi, Michael Collins, Sailesh Sidhwani, Emiel Hoogeboom, Sharon Silver, Emily Caveness, Xiaokai Zhao, Mikel Rodriguez, Maxine Deines, Libin Bai, Patrick Griffin, Marco Tagliasacchi, Emily Xue, Spandana Raj Babbula, Bo Pang, Nan Ding, Gloria Shen, Elijah Peake, Remi Crocker, Shubha Srinivas Raghvendra, Danny Swisher, Woohyun Han, Richa Singh, Ling Wu, Vladimir Pchelin, Tsendsuren Munkhdalai, Dana Alon, Geoff Bacon, Efren Robles, Jannis Bulian, Melvin Johnson, George Powell, Felipe Tiengo Ferreira, Yaoyiran Li, Frederik Benzing, Mihajlo Velimirović, Hubert Soyer, William Kong, Tony, Nguyên, Zhen Yang, Jeremiah Liu, Joost van Amersfoort, Daniel Gillick, Baochen Sun, Nathalie Rauschmayr, Katie Zhang, Serena Zhan, Tao Zhou, Alexey Frolov, Chengrun Yang, Denis Vnukov, Louis Rouillard, Hongji Li, Amol Mandhane, Nova Fallen, Rajesh Venkataraman, Clara Huiyi Hu, Jennifer Brennan, Jenny Lee, Jerry Chang, Martin Sundermeyer, Zhufeng Pan, Rosemary Ke, Simon Tong, Alex Fabrikant, William Bono, Jindong Gu, Ryan Foley, Yiran Mao, Manolis Delakis, Dhruva Bhaswar, Roy Frostig, Nick Li, Avital Zipori, Cath Hope, Olga Kozlova, Swaroop Mishra, Josip Djolonga, Craig Schiff, Majd Al Merey, Eleftheria Briakou, Peter Morgan, Andy Wan, Avinatan Hassidim, RJ Skerry-Ryan, Kuntal Sengupta, Mary Jasarevic, Praveen Kallakuri, Paige Kunkle, Hannah Brennan, Tom Lieber, Hassan Mansoor, Julian Walker, Bing Zhang, Annie Xie, Goran Žužić, Adaeze Chukwuka, Alex Druinsky, Donghyun Cho, Rui Yao, Ferjad Naeem, Shiraz Butt, Eunyoung Kim, Zhipeng Jia, Mandy Jordan, Adam Lelkes, Mark Kurzeja, Sophie Wang, James Zhao, Andrew Over, Abhishek Chakladar, Marcel Prasetya, Neha Jha, Sriram Ganapathy, Yale Cong, Prakash Shroff, Carl Saroufim, Sobhan Miryoosefi, Mohamed Hammad, Tajwar Nasir, Weijuan Xi, Yang Gao, Young Maeng, Ben Hora, Chin-Yi Cheng, Parisa Haghani, Yoad Lewenberg, Caden Lu, Martin Matysiak, Naina Raisinghani, Huiyu Wang, Lexi Baugher, Rahul Sukthankar, Minh Giang, John Schultz, Noah Fiedel, Minmin Chen, Cheng-Chun Lee, Tapomay Dey, Hao Zheng, Shachi Paul, Celine Smith, Andy Ly, Yicheng Wang, Rishabh Bansal, Bartek Perz, Susanna Ricco, Stasha Blank, Vaishakh Keshava, Deepak Sharma, Marvin Chow, Kunal Lad, Komal Jalan, Simon Osindero, Craig Swanson, Jacob Scott, Anastasija Ilić, Xiaowei Li, Siddhartha Reddy Jonnalagadda, Afzal Shama Soudagar, Yan Xiong, Bat-Orgil Batsaikhan, Daniel Jarrett, Naveen Kumar, Maulik Shah, Matt Lawlor, Austin Waters, Mark Graham, Rhys May, Sabela Ramos, Sandra Lefdal, Zeynep Cankara, Nacho Cano, Brendan O’Donoghue, Jed Borovik, Frederick Liu, Jordan Grimstad, Mahmoud Alnahlawi, Katerina Tsihlas, Tom Hudson, Nikolai Grigorev, Yiling Jia, Terry Huang, Tobenna Peter Igwe, Sergei Lebedev, Xiaodan Tang, Igor Krivokon, Frankie Garcia, Melissa Tan, Eric Jia, Peter Stys, Shikhar Vashishth, Yu Liang, Balaji Venkatraman, Chenjie Gu, Anastasios Kementsietsidis, Chen Zhu, Junehyuk Jung, Yunfei Bai, Mohammad Javad Hosseini, Faruk Ahmed, Aditya Gupta, Xin Yuan, Shereen Ashraf, Shitij Nigam, Gautam Vasudevan, Pranjal Awasthi, Adi Mayrav Gilady, Zelda Mariet, Ramy Eskander, Haiguang Li, Hexiang Hu, Guillermo Garrido, Philippe Schlattner, George Zhang, Rohun Saxena, Petar Dević, Kritika Muralidharan, Ashwin Murthy, Yiqian Zhou, Min Choi, Arissa Wongpanich, Zhengdong Wang, Premal Shah, Yuntao Xu, Yiling Huang, Stephen Spencer, Alice Chen, James Cohan, Junjie Wang, Jonathan Tompson, Junru Wu, Ruba Haroun, Haiqiong Li, Blanca Huergo, Fan Yang, Tongxin Yin, James Wendt, Michael Bendersky, Rahma Chaabouni, Javier Snaider, Johan Ferret, Abhishek Jindal, Tara Thompson, Andrew Xue, Will Bishop, Shubham Milind Phal, Archit Sharma, Yunhsuan Sung, Prabakar Radhakrishnan, Mo Shomrat, Reeve Ingle, Roopali Vij, Justin Gilmer, Mihai Dorin Istin, Sam Sobell, Yang Lu, Emily Nottage, Dorsa Sadigh, Jeremiah Willcock, Tingnan Zhang, Steve Xu, Sasha Brown, Katherine Lee, Gary Wang, Yun Zhu, Yi Tay, Cheolmin Kim, Audrey Gutierrez, Abhanshu Sharma, Yongqin Xian, Sungyong Seo, Claire Cui, Elena Pochernina, Cip Baetu, Krzysztof Jastrzębski, Mimi Ly, Mohamed Elhawaty, Dan Suh, Eren Sezener, Pidong Wang, Nancy Yuen, George Tucker, Jiahao Cai, Zuguang Yang, Cindy Wang, Alex Muzio, Hai Qian, Jae Yoo, Derek Lockhart, Kevin R. McKee, Mandy Guo, Malika Mehrotra, Artur Mendonça, Sanket Vaibhav Mehta, Sherry Ben, Chetan Tekur, Jiaqi Mu, Muye Zhu, Victoria Krakovna, Hongrae Lee, AJ Maschinot, Sébastien Cevey, HyunJeong Choe, Aijun Bai, Hansa Srinivasan, Derek Gasaway, Nick Young, Patrick Siegler, Dan Holtmann-Rice, Vihari Piratla, Kate Baumli, Roey Yogev, Alex Hofer, Hado van Hasselt, Svetlana Grant, Yuri Chervonyi, David Silver, Andrew Hogue, Ayushi Agarwal, Kathie Wang, Preeti Singh, Four Flynn, Josh Lipschultz, Robert David, Lizzetth Bellot, Yao-Yuan Yang, Long Le, Filippo Graziano, Kate Olszewska, Kevin Hui, Akanksha Maurya, Nikos Parotsidis, Weijie Chen, Tayo Oguntebi, Joe Kelley, Anirudh Baddepudi, Johannes Mauerer, Gregory Shaw, Alex Siegman, Lin Yang, Shravya Shetty, Subhrajit Roy, Yunting Song, Wojciech Stokowiec, Ryan Burnell, Omkar Savant, Robert Busa-Fekete, Jin Miao, Samrat Ghosh, Liam MacDermed, Phillip Lippe, Mikhail Dektiarev, Zach Behrman, Fabian Mentzer, Kelvin Nguyen, Meng Wei, Siddharth Verma, Chris Knutsen, Sudeep Dasari, Zhipeng Yan, Petr Mitrichev, Xingyu Wang, Virat Shejwalkar, Jacob Austin, Srinivas Sunkara, Navneet Potti, Yan Virin, Christian Wright, Gaël Liu, Oriana Riva, Etienne Pot, Greg Kochanski, Quoc Le, Gargi Balasubramaniam, Arka Dhar, Yuguo Liao, Adam Bloniarz, Divyansh Shukla, Elizabeth Cole, Jong Lee, Sheng Zhang, Sushant Kafle, Siddharth Vashishtha, Parsa Mahmoudieh, Grace Chen, Raphael Hoffmann, Pranesh Srinivasan, Agustin Dal Lago, Yoav Ben Shalom, Zi Wang, Michael Elabd, Anuj Sharma, Junhyuk Oh, Suraj Kothawade, Maigo Le, Marianne Monteiro, Shentao Yang, Kaiz Alarakyia, Robert Geirhos, Diana Mincu, Håvard Garnes, Hayato Kobayashi, Soroosh Mariooryad, Kacper Krasowiak, Zhixin, Lai, Shibl Mourad, Mingqiu Wang, Fan Bu, Ophir Aharoni, Guanjie Chen, Abhimanyu Goyal, Vadim Zubov, Ankur Bapna, Elahe Dabir, Nisarg Kothari, Kay Lamerigts, Nicola De Cao, Jeremy Shar, Christopher Yew, Nitish Kulkarni, Dre Mahaarachchi, Mandar Joshi, Zhenhai Zhu, Jared Lichtarge, Yichao Zhou, Hannah Muckenhirn, Vittorio Selo, Oriol Vinyals, Peter Chen, Anthony Brohan, Vaibhav Mehta, Sarah Cogan, Ruth Wang, Ty Geri, Wei-Jen Ko, Wei Chen, Fabio Viola, Keshav Shivam, Lisa Wang, Madeleine Clare Elish, Raluca Ada Popa, Sébastien Pereira, Jianqiao Liu, Raphael Koster, Donnie Kim, Gufeng Zhang, Sayna Ebrahimi, Partha Talukdar, Yanyan Zheng, Petra Poklukar, Ales Mikhalap, Dale Johnson, Anitha Vijayakumar, Mark Omernick, Matt Dibb, Ayush Dubey, Qiong Hu, Apurv Suman, Vaibhav Aggarwal, Ilya Kornakov, Fei Xia, Wing Lowe, Alexey Kolganov, Ted Xiao, Vitaly Nikolaev, Steven Hemingray, Bonnie Li, Joana Iljazi, Mikołaj Rybiński, Ballie Sandhu, Peggy Lu, Thang Luong, Rodolphe Jenatton, Vineetha Govindaraj, Hui, Li, Gabriel Dulac-Arnold, Wonpyo Park, Henry Wang, Abhinit Modi, Jean Pouget-Abadie, Kristina Greller, Rahul Gupta, Robert Berry, Prajit Ramachandran, Jinyu Xie, Liam McCafferty, Jianling Wang, Kilol Gupta, Hyeontaek Lim, Blaž Bratanič, Andy Brock, Ilia Akolzin, Jim Sproch, Dan Karliner, Duhyeon Kim, Adrian Goedeckemeyer, Noam Shazeer, Cordelia Schmid, Daniele Calandriello, Parul Bhatia, Krzysztof Choromanski, Ceslee Montgomery, Dheeru Dua, Ana Ramalho, Helen King, Yue Gao, Lynn Nguyen, David Lindner, Divya Pitta, Oleaser Johnson, Khalid Salama, Diego Ardila, Michael Han, Erin Farnese, Seth Odoom, Ziyue Wang, Xiangzhuo Ding, Norman Rink, Ray Smith, Harshal Tushar Lehri, Eden Cohen, Neera Vats, Tong He, Parthasarathy Gopavarapu, Adam Paszke, Miteyan Patel, Wouter Van Gansbeke, Lucia Loher, Luis Castro, Maria Voitovich, Tamara von Glehn, Nelson George, Simon Niklaus, Zach Eaton-Rosen, Nemanja Rakićević, Erik Jue, Sagi Perel, Carrie Zhang, Yuval Bahat, Angéline Pouget, Zhi Xing, Fantine Huot, Ashish Shenoy, Taylor Bos, Vincent Coriou, Bryan Richter, Natasha Noy, Yaqing Wang, Santiago Ontanon, Siyang Qin, Gleb Makarchuk, Demis Hassabis, Zhuowan Li, Mandar Sharma, Kumaran Venkatesan, Iurii Kemaev, Roxanne Daniel, Shiyu Huang, Saloni Shah, Octavio Ponce, Warren, Chen, Manaal Faruqui, Jialin Wu, Slavica Andačić, Szabolcs Payrits, Daniel McDuff, Tom Hume, Yuan Cao, MH Tessler, Qingze Wang, Yinan Wang, Ivor Rendulic, Eirikur Agustsson, Matthew Johnson, Tanya Lando, Andrew Howard, Sri Gayatri Sundara Padmanabhan, Mayank Daswani, Andrea Banino, Michael Kilgore, Jonathan Heek, Ziwei Ji, Alvaro Caceres, Conglong Li, Nora Kassner, Alexey Vlaskin, Zeyu Liu, Alex Grills, Yanhan Hou, Roykrong Sukkerd, Gowoon Cheon, Nishita Shetty, Larisa Markeeva, Piotr Stanczyk, Tejas Iyer, Yuan Gong, Shawn Gao, Keerthana Gopalakrishnan, Tim Blyth, Malcolm Reynolds, Avishkar Bhoopchand, Misha Bilenko, Dero Gharibian, Vicky Zayats, Aleksandra Faust, Abhinav Singh, Min Ma, Hongyang Jiao, Sudheendra Vijayanarasimhan, Lora Aroyo, Vikas Yadav, Sarah Chakera, Ashwin Kakarla, Vilobh Meshram, Karol Gregor, Gabriela Botea, Evan Senter, Dawei Jia, Geza Kovacs, Neha Sharma, Sebastien Baur, Kai Kang, Yifan He, Lin Zhuo, Marija Kostelac, Itay Laish, Songyou Peng, Louis O’Bryan, Daniel Kasenberg, Girish Ramchandra Rao, Edouard Leurent, Biao Zhang, Sage Stevens, Ana Salazar, Ye Zhang, Ivan Lobov, Jake Walker, Allen Porter, Morgan Redshaw, Han Ke, Abhishek Rao, Alex Lee, Hoi Lam, Michael Moffitt, Jaeyoun Kim, Siyuan Qiao, Terry Koo, Robert Dadashi, Xinying Song, Mukund Sundararajan, Peng Xu, Chizu Kawamoto, Yan Zhong, Clara Barbu, Apoorv Reddy, Mauro Verzetti, Leon Li, George Papamakarios, Hanna Klimczak-Plucińska, Mary Cassin, Koray Kavukcuoglu, Rigel Swavely, Alain Vaucher, Jeffrey Zhao, Ross Hemsley, Michael Tschannen, Heming Ge, Gaurav Menghani, Yang Yu, Natalie Ha, Wei He, Xiao Wu, Maggie Song, Rachel Sterneck, Stefan Zinke, Dan A. Calian, Annie Marsden, Alejandro Cruzado Ruiz, Matteo Hessel, Almog Gueta, Benjamin Lee, Brian Farris, Manish Gupta, Yunjie Li, Mohammad Saleh, Vedant Misra, Kefan Xiao, Piermaria Mendolicchio, Gavin Buttimore, Varvara Krayvanova, Nigamaa Nayakanti, Matthew Wiethoff, Yash Pande, Azalia Mirhoseini, Ni Lao, Jasmine Liu, Yiqing Hua, Angie Chen, Yury Malkov, Dmitry Kalashnikov, Shubham Gupta, Kartik Audhkhasi, Yuexiang Zhai, Sudhindra Kopalle, Prateek Jain, Eran Ofek, Clemens Meyer, Khuslen Baatarsukh, Hana Strejček, Jun Qian, James Freedman, Ricardo Figueira, Michal Sokolik, Olivier Bachem, Raymond Lin, Dia Kharrat, Chris Hidey, Pingmei Xu, Dennis Duan, Yin Li, Muge Ersoy, Richard Everett, Kevin Cen, Rebeca Santamaria-Fernandez, Amir Taubenfeld, Ian Mackinnon, Linda Deng, Polina Zablotskaia, Shashank Viswanadha, Shivanker Goel, Damion Yates, Yunxiao Deng, Peter Choy, Mingqing Chen, Abhishek Sinha, Alex Mossin, Yiming Wang, Arthur Szlam, Susan Hao, Paul Kishan Rubenstein, Metin Toksoz-Exley, Miranda Aperghis, Yin Zhong, Junwhan Ahn, Michael Isard, Olivier Lacombe, Florian Luisier, Chrysovalantis Anastasiou, Yogesh Kalley, Utsav Prabhu, Emma Dunleavy, Shaan Bijwadia, Justin Mao-Jones, Kelly Chen, Rama Pasumarthi, Emily Wood, Adil Dostmohamed, Nate Hurley, Jiri Simsa, Alicia Parrish, Mantas Pajarskas, Matt Harvey, Ondrej Skopek, Yony Kochinski, Javier Rey, Verena Rieser, Denny Zhou, Sun Jae Lee, Trilok Acharya, Guowang Li, Joe Jiang, Xiaofan Zhang, Bryant Gipson, Ethan Mahintorabi, Marco Gelmi, Nima Khajehnouri, Angel Yeh, Kayi Lee, Loic Matthey, Leslie Baker, Trang Pham, Han Fu, Alex Pak, Prakhar Gupta, Cristina Vasconcelos, Adam Sadovsky, Brian Walker, Sissie Hsiao, Patrik Zochbauer, Andreea Marzoca, Noam Velan, Junhao Zeng, Gilles Baechler, Danny Driess, Divya Jain, Yanping Huang, Lizzie Tao, John Maggs, Nir Levine, Jon Schneider, Erika Gemzer, Samuel Petit, Shan Han, Zach Fisher, Dustin Zelle, Courtney Biles, Eugene Ie, Asya Fadeeva, Casper Liu, Juliana Vicente Franco, Adrian Collister, Hao Zhang, Renshen Wang, Ruizhe Zhao, Leandro Kieliger, Kurt Shuster, Rui Zhu, Boqing Gong, Lawrence Chan, Ruoxi Sun, Sujoy Basu, Roland Zimmermann, Jamie Hayes, Abhishek Bapna, Jasper Snoek, Weel Yang, Puranjay Datta, Jad Al Abdallah, Kevin Kilgour, Lu Li, SQ Mah, Yennie Jun, Morgane Rivière, Abhijit Karmarkar, Tammo Spalink, Tao Huang, Lucas Gonzalez, Duc-Hieu Tran, Averi Nowak, John Palowitch, Martin Chadwick, Ellie Talius, Harsh Mehta, Thibault Sellam, Philipp Fränken, Massimo Nicosia, Kyle He, Aditya Kini, David Amos, Sugato Basu, Harrison Jobe, Eleni Shaw, Qiantong Xu, Colin Evans, Daisuke Ikeda, Chaochao Yan, Larry Jin, Lun Wang, Sachin Yadav, Ilia Labzovsky, Ramesh Sampath, Ada Ma, Candice Schumann, Aditya Siddhant, Rohin Shah, John Youssef, Rishabh Agarwal, Natalie Dabney, Alessio Tonioni, Moran Ambar, Jing Li, Isabelle Guyon, Benny Li, David Soergel, Boya Fang, Georgi Karadzhov, Cristian Udrescu, Trieu Trinh, Vikas Raunak, Seb Noury, Dee Guo, Sonal Gupta, Mara Finkelstein, Denis Petek, Lihao Liang, Greg Billock, Pei Sun, David Wood, Yiwen Song, Xiaobin Yu, Tatiana Matejovicova, Regev Cohen, Kalyan Andra, David D’Ambrosio, Zhiwei Deng, Vincent Nallatamby, Ebrahim Songhori, Rumen Dangovski, Andrew Lampinen, Pankil Botadra, Adam Hillier, Jiawei Cao, Nagabhushan Baddi, Adhi Kuncoro, Toshihiro Yoshino, Ankit Bhagatwala, Marcáurelio Ranzato, Rylan Schaeffer, Tianlin Liu, Shuai Ye, Obaid Sarvana, John Nham, Chenkai Kuang, Isabel Gao, Jinoo Baek, Shubham Mittal, Ayzaan Wahid, Anita Gergely, Bin Ni, Josh Feldman, Carrie Muir, Pascal Lamblin, Wolfgang Macherey, Ethan Dyer, Logan Kilpatrick, Víctor Campos, Mukul Bhutani, Stanislav Fort, Yanif Ahmad, Aliaksei Severyn, Kleopatra Chatziprimou, Oleksandr Ferludin, Mason Dimarco, Aditya Kusupati, Joe Heyward, Dan Bahir, Kevin Villela, Katie Millican, Dror Marcus, Sanaz Bahargam, Caglar Unlu, Nicholas Roth, Zichuan Wei, Siddharth Gopal, Deepanway Ghoshal, Edward Lee, Sharon Lin, Jennie Lees, Dayeong Lee, Anahita Hosseini, Connie Fan, Seth Neel, Marcus Wu, Yasemin Altun, Honglong Cai, Enrique Piqueras, Josh Woodward, Alessandro Bissacco, Salem Haykal, Mahyar Bordbar, Prasha Sundaram, Sarah Hodkinson, Daniel Toyama, George Polovets, Austin Myers, Anu Sinha, Tomer Levinboim, Kashyap Krishnakumar, Rachita Chhaparia, Tatiana Sholokhova, Nitesh Bharadwaj Gundavarapu, Ganesh Jawahar, Haroon Qureshi, Jieru Hu, Nikola Momchev, Matthew Rahtz, Renjie Wu, Aishwarya P S, Kedar Dhamdhere, Meiqi Guo, Umang Gupta, Ali Eslami, Mariano Schain, Michiel Blokzijl, David Welling, Dave Orr, Levent Bolelli, Nicolas Perez-Nieves, Mikhail Sirotenko, Aman Prasad, Arjun Kar, Borja De Balle Pigem, Tayfun Terzi, Gellért Weisz, Dipankar Ghosh, Aditi Mavalankar, Dhruv Madeka, Kaspar Daugaard, Hartwig Adam, Viraj Shah, Dana Berman, Maggie Tran, Steven Baker, Ewa Andrejczuk, Grishma Chole, Ganna Raboshchuk, Mahdi Mirzazadeh, Thais Kagohara, Shimu Wu, Christian Schallhart, Bernett Orlando, Chen Wang, Alban Rrustemi, Hao Xiong, Hao Liu, Arpi Vezer, Nolan Ramsden, Shuo-yiin Chang, Sidharth Mudgal, Yan Li, Nino Vieillard, Yedid Hoshen, Farooq Ahmad, Ambrose Slone, Amy Hua, Natan Potikha, Mirko Rossini, Jon Stritar, Sushant Prakash, Zifeng Wang, Xuanyi Dong, Alireza Nazari, Efrat Nehoran, Kaan Tekelioglu, Yinxiao Li, Kartikeya Badola, Tom Funkhouser, Yuanzhen Li, Varun Yerram, Ramya Ganeshan, Daniel Formoso, Karol Langner, Tian Shi, Huijian Li, Yumeya Yamamori, Amayika Panda, Alaa Saade, Angelo Scorza Scarpati, Chris Breaux, CJ Carey, Zongwei Zhou, Cho-Jui Hsieh, Sophie Bridgers, Alena Butryna, Nishesh Gupta, Vaibhav Tulsyan, Sanghyun Woo, Evgenii Eltyshev, Will Grathwohl, Chanel Parks, Seth Benjamin, Rina Panigrahy, Shenil Dodhia, Daniel De Freitas, Chris Sauer, Will Song, Ferran Alet, Jackson Tolins, Cosmin Paduraru, Xingyi Zhou, Brian Albert, Zizhao Zhang, Lei Shu, Mudit Bansal, Sarah Nguyen, Amir Globerson, Owen Xiao, James Manyika, Tom Hennigan, Rong Rong, Josip Matak, Anton Bakalov, Ankur Sharma, Danila Sinopalnikov, Andrew Pierson, Stephen Roller, Geoff Brown, Mingcen Gao, Toshiyuki Fukuzawa, Amin Ghafouri, Kenny Vassigh, Iain Barr, Zhicheng Wang, Anna Korsun, Rajesh Jayaram, Lijie Ren, Tim Zaman, Samira Khan, Yana Lunts, Dan Deutsch, Dave Uthus, Nitzan Katz, Masha Samsikova, Amr Khalifa, Nikhil Sethi, Jiao Sun, Luming Tang, Uri Alon, Xianghong Luo, Dian Yu, Abhishek Nayyar, Bryce Petrini, Will Truong, Vincent Hellendoorn, Nikolai Chinaev, Chris Alberti, Wei Wang, Jingcao Hu, Vahab Mirrokni, Ananth Balashankar, Avia Aharon, Aahil Mehta, Ahmet Iscen, Joseph Kready, Lucas Manning, Anhad Mohananey, Yuankai Chen, Anshuman Tripathi, Allen Wu, Igor Petrovski, Dawsen Hwang, Martin Baeuml, Shreyas Chandrakaladharan, Yuan Liu, Rey Coaguila, Maxwell Chen, Sally Ma, Pouya Tafti, Susheel Tatineni, Terry Spitz, Jiayu Ye, Paul Vicol, Mihaela Rosca, Adrià Puigdomènech, Zohar Yahav, Sanjay Ghemawat, Hanzhao Lin, Phoebe Kirk, Zaid Nabulsi, Sergey Brin, Bernd Bohnet, Ken Caluwaerts, Aditya Srikanth Veerubhotla, Dan Zheng, Zihang Dai, Petre Petrov, Yichong Xu, Ramin Mehran, Zhuo Xu, Luisa Zintgraf, Jiho Choi, Spurthi Amba Hombaiah, Romal Thoppilan, Sashank Reddi, Lukasz Lew, Li Li, Kellie Webster, KP Sawhney, Lampros Lamprou, Siamak Shakeri, Mayank Lunayach, Jianmin Chen, Sumit Bagri, Alex Salcianu, Ying Chen, Yani Donchev, Charlotte Magister, Signe Nørly, Vitor Rodrigues, Tomas Izo, Hila Noga, Joe Zou, Thomas Köppe, Wenxuan Zhou, Kenton Lee, Xiangzhu Long, Danielle Eisenbud, Anthony Chen, Connor Schenck, Chi Ming To, Peilin Zhong, Emanuel Taropa, Minh Truong, Omer Levy, Danilo Martins, Zhiyuan Zhang, Christopher Semturs, Kelvin Zhang, Alex Yakubovich, Pol Moreno, Lara McConnaughey, Di Lu, Sam Redmond, Lotte Weerts, Yonatan Bitton, Tiziana Refice, Nicolas Lacasse, Arthur Conmy, Corentin Tallec, Julian Odell, Hannah Forbes-Pollard, Arkadiusz Socala, Jonathan Hoech, Pushmeet Kohli, Alanna Walton, Rui Wang, Mikita Sazanovich, Kexin Zhu, Andrei Kapishnikov, Rich Galt, Matthew Denton, Ben Murdoch, Caitlin Sikora, Kareem Mohamed, Wei Wei, Uri First, Tim McConnell, Luis C. Cobo, James Qin, Thi Avrahami, Daniel Balle, Yu Watanabe, Annie Louis, Adam Kraft, Setareh Ariafar, Yiming Gu, Eugénie Rives, Charles Yoon, Andrei Rusu, James Cobon-Kerr, Chris Hahn, Jiaming Luo, Yuvein, Zhu, Niharika Ahuja, Rodrigo Benenson, Raphaël Lopez Kaufman, Honglin Yu, Lloyd Hightower, Junlin Zhang, Darren Ni, Lisa Anne Hendricks, Gabby Wang, Gal Yona, Lalit Jain, Pablo Barrio, Surya Bhupatiraju, Siva Velusamy, Allan Dafoe, Sebastian Riedel, Tara Thomas, Zhe Yuan, Mathias Bellaiche, Sheena Panthaplackel, Klemen Kloboves, Sarthak Jauhari, Canfer Akbulut, Todor Davchev, Evgeny Gladchenko, David Madras, Aleksandr Chuklin, Tyrone Hill, Quan Yuan, Mukundan Madhavan, Luke Leonhard, Dylan Scandinaro, Qihang Chen, Ning Niu, Arthur Douillard, Bogdan Damoc, Yasumasa Onoe, Fabian Pedregosa, Fred Bertsch, Chas Leichner, Joseph Pagadora, Jonathan Malmaud, Sameera Ponda, Andy Twigg, Oleksii Duzhyi, Jingwei Shen, Miaosen Wang, Roopal Garg, Jing Chen, Utku Evci, Jonathan Lee, Leon Liu, Koji Kojima, Masa Yamaguchi, Arunkumar Rajendran, AJ Piergiovanni, Vinodh Kumar Rajendran, Marco Fornoni, Gabriel Ibagon, Harry Ragan, Sadh MNM Khan, John Blitzer, Andrew Bunner, Guan Sun, Takahiro Kosakai, Scott Lundberg, Ndidi Elue, Kelvin Guu, SK Park, Jane Park, Arunachalam Narayanaswamy, Chengda Wu, Jayaram Mudigonda, Trevor Cohn, Hairong Mu, Ravi Kumar, Laura Graesser, Yichi Zhang, Richard Killam, Vincent Zhuang, Mai Giménez, Wael Al Jishi, Ruy Ley-Wild, Alex Zhai, Kazuki Osawa, Diego Cedillo, Jialu Liu, Mayank Upadhyay, Marcin Sieniek, Roshan Sharma, Tom Paine, Anelia Angelova, Sravanti Addepalli, Carolina Parada, Kingshuk Majumder, Avery Lamp, Sanjiv Kumar, Xiang Deng, Artiom Myaskovsky, Tea Sabolić, Jeffrey Dudek, Sarah York, Félix de Chaumont Quitry, Jiazhong Nie, Dee Cattle, Alok Gunjan, Bilal Piot, Waleed Khawaja, Seojin Bang, Simon Wang, Siavash Khodadadeh, Raghavender R, Praynaa Rawlani, Richard Powell, Kevin Lee, Johannes Griesser, GS Oh, Cesar Magalhaes, Yujia Li, Simon Tokumine, Hadas Natalie Vogel, Dennis Hsu, Arturo BC, Disha Jindal, Matan Cohen, Zi Yang, Junwei Yuan, Dario de Cesare, Tony Bruguier, Jun Xu, Monica Roy, Alon Jacovi, Dan Belov, Rahul Arya, Phoenix Meadowlark, Shlomi Cohen-Ganor, Wenting Ye, Patrick Morris-Suzuki, Praseem Banzal, Gan Song, Pranavaraj Ponnuramu, Fred Zhang, George Scrivener, Salah Zaiem, Alif Raditya Rochman, Kehang Han, Badih Ghazi, Kate Lee, Shahar Drath, Daniel Suo, Antonious Girgis, Pradeep Shenoy, Duy Nguyen, Douglas Eck, Somit Gupta, Le Yan, Joao Carreira, Anmol Gulati, Ruoxin Sang, Daniil Mirylenka, Emma Cooney, Edward Chou, Mingyang Ling, Cindy Fan, Ben Coleman, Guilherme Tubone, Ravin Kumar, Jason Baldridge, Felix Hernandez-Campos, Angeliki Lazaridou, James Besley, Itay Yona, Neslihan Bulut, Quentin Wellens, AJ Pierigiovanni, Jasmine George, Richard Green, Pu Han, Connie Tao, Geoff Clark, Chong You, Abbas Abdolmaleki, Justin Fu, Tongzhou Chen, Ashwin Chaugule, Angad Chandorkar, Altaf Rahman, Will Thompson, Penporn Koanantakool, Mike Bernico, Jie Ren, Andrey Vlasov, Sergei Vassilvitskii, Maciej Kula, Yizhong Liang, Dahun Kim, Yangsibo Huang, Chengxi Ye, Dmitry Lepikhin, Wesley Helmholz

Main category: cs.CL

TL;DR: Google introduces the Gemini 2.X model family including Gemini 2.5 Pro (most capable model with SoTA coding/reasoning), Gemini 2.5 Flash (excellent reasoning with lower compute), and earlier Gemini 2.0 Flash/Flash-Lite models, spanning the full capability vs cost spectrum.

Details

Motivation: To provide a comprehensive model family that covers the entire Pareto frontier of model capability versus cost, enabling users to choose appropriate models for different use cases from complex agentic problem solving to cost-efficient applications.

Method: Developed multiple model variants: Gemini 2.5 Pro (thinking model with multimodal understanding and long context), Gemini 2.5 Flash (optimized for reasoning with lower compute), and Gemini 2.0 Flash/Flash-Lite (high performance at low latency/cost).

Result: Gemini 2.5 Pro achieves state-of-the-art performance on frontier coding and reasoning benchmarks, can process up to 3 hours of video content, and enables new agentic workflows through its combination of long context, multimodal, and reasoning capabilities.

Conclusion: The Gemini 2.X model generation successfully spans the full capability-cost spectrum, allowing users to explore boundaries of complex agentic problem solving while providing cost-effective options for various applications.

Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

[153] Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses

Jens Rupprecht, Georg Ahnert, Markus Strohmaier

Main category: cs.CL

TL;DR: LLMs used as human proxies in surveys show significant response biases including recency bias, and are vulnerable to question phrasing perturbations, highlighting the need for careful prompt design and robustness testing.

Details

Motivation: To investigate the reliability and susceptibility of LLMs to human-like response biases when used as proxies in social science surveys, particularly in normative survey contexts.

Method: Tested nine LLMs on World Values Survey questions with ten different perturbations to question phrasing and answer options, conducting over 167,000 simulated survey interviews.

Result: All tested LLMs exhibited consistent recency bias (favoring last-presented options) and remained sensitive to semantic variations like paraphrasing. Larger models were generally more robust but still vulnerable to combined perturbations.

Conclusion: LLMs have significant response biases and vulnerabilities to perturbations, emphasizing the critical importance of prompt design and robustness testing when using them for synthetic survey data generation.

Abstract: Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known human-like response biases, such as central tendency, opinion floating and primacy bias are poorly understood. This work investigates the response robustness of LLMs in normative survey contexts, we test nine LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of ten perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated survey interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

[154] Why is Your Language Model a Poor Implicit Reward Model?

Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora

Main category: cs.CL

TL;DR: IM-RMs generalize worse than EX-RMs due to over-reliance on superficial token-level cues, not because IM-RMs can function as both verifiers and generators.

Details

Motivation: To understand why implicit reward models (IM-RMs) generalize worse than explicit reward models (EX-RMs) despite being nearly identical in architecture and training.

Method: Theoretical analysis and experiments comparing IM-RMs and EX-RMs, investigating their generalization behavior under token-level distribution shifts and challenging alternative hypotheses.

Result: IM-RMs rely more heavily on superficial token-level cues, leading to worse generalization than EX-RMs under token-level distribution shifts and even in-distribution. Alternative hypotheses about IM-RMs struggling with generation tasks were disproven.

Conclusion: Seemingly minor design choices in reward model implementation can substantially impact generalization behavior, with IM-RMs’ token-level bias explaining their performance gap with EX-RMs.

Abstract: Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

[155] Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O’Brien

Main category: cs.CL

TL;DR: Using sparse autoencoder features to steer language generation in multilingual LLMs by modifying single features to control output language with high success rates.

Details

Motivation: Addressing the challenge of controlling target language generation in multilingual LLMs without explicit prompts or fine-tuning in zero-shot settings.

Method: Leveraging pretrained SAEs on Gemma-2B and Gemma-9B residual streams, identifying language-sensitive features, and modifying single SAE features at specific transformer layers during inference.

Result: Achieved up to 90% success in controlled language shifts while preserving semantic fidelity, with best results in mid-to-late transformer layers and amplified by specific attention heads.

Conclusion: Sparse feature steering provides a lightweight and interpretable mechanism for controllable multilingual generation in LLMs.

Abstract: Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.

[156] Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal

Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin

Main category: cs.CL

TL;DR: A simple add-on module enhances PLM robustness by removing instance-level principal components, transforming embeddings to Gaussian properties without adversarial training or data perturbation.

Details

Motivation: PLMs are vulnerable to adversarial attacks, and existing defense methods incur high computational costs through adversarial training or data augmentation.

Method: Proposes an add-on module that removes instance-level principal components from embeddings, transforming them to approximate Gaussian properties without conventional adversarial defenses.

Result: Evaluations on 8 benchmark datasets show improved adversarial robustness while maintaining comparable before-attack accuracy to baselines.

Conclusion: The approach achieves a balanced trade-off between robustness and generalization without requiring adversarial examples or costly training-time augmentation.

Abstract: Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.

[157] LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Junhong Wu, Jinliang Lu, Zixuan Ren, Gangqiang Hu, Zhi Wu, Dai Dai, Hua Wu

Main category: cs.CL

TL;DR: LLMs exhibit greedy behavior in Soft Thinking by relying on highest-probability tokens, creating a feedback loop that suppresses alternative reasoning paths. Stochastic Soft Thinking with Gumbel-Softmax randomness breaks this pattern and improves performance.

Details

Motivation: Human cognition uses abstract concepts while LLMs rely on discrete tokens, limiting expressive capabilities. Soft Thinking aims to enable reasoning in continuous concept space, but current implementations have limitations.

Method: Systematic analysis of LLM internal behavior using probing techniques, revealing greedy token selection. Proposed Stochastic Soft Thinking with Gumbel-Softmax trick to introduce randomness and break greedy feedback loops.

Result: Vanilla Soft Thinking shows single-threaded reasoning behavior. Stochastic approach alleviates limitations and achieves superior performance across eight reasoning benchmarks, with stronger exploration potential than conventional Chain-of-Thought.

Conclusion: Stochastic Soft Thinking deepens understanding of continuous reasoning and provides foundation for future improvements with Reinforcement Learning, overcoming the Greedy Pitfall in current Soft Thinking implementations.

Abstract: Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. In this paper, we investigate the Soft Thinking capabilities of various LLMs through a systematic analysis of their internal behavior using a suite of probing techniques. Contrary to the prevailing belief that Soft Thinking supports parallel exploration of diverse reasoning paths, our findings reveal that LLMs behave as single-threaded reasoners–they predominantly rely on the token with the highest probability in the soft input to predict the next step. This behavior induces a greedy feedback loop that suppresses alternative reasoning paths and undermines the benefits of transmitting richer information via Soft Tokens. To address this Greedy Pitfall, we propose Stochastic Soft Thinking, which introduces stochasticity to break free from this Greedy Pitfall. Our experiments demonstrate that incorporating randomness–particularly with the Gumbel-Softmax trick–can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking, resulting in superior performance across eight reasoning benchmarks. We further demonstrate that Stochastic Soft Thinking exhibits stronger exploration potential compared to conventional COT. Our findings deepen the understanding of continuous reasoning and establish the foundation for future work on improving Soft Thinking with Reinforcement Learning.

[158] Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Peng Lai, Jianjie Zheng, Sijie Cheng, Yun Chen, Peng Li, Yang Liu, Guanhua Chen

Main category: cs.CL

TL;DR: LAGER is a post-hoc framework that improves LLM-as-a-judge alignment with human preferences by leveraging cross-layer representations, achieving up to 7.5% improvement without complex prompts or fine-tuning.

Details

Motivation: Current LLM-as-a-judge methods mainly optimize based on shallow outputs and overlook rich cross-layer representations, while middle-to-upper layers often encode more human-aligned representations than the final layer.

Method: LAGER aggregates cross-layer score-token logits and computes expected scores from softmax-based distributions while keeping the LLM backbone frozen, fully leveraging complementary information across different layers.

Result: LAGER achieves improvements of up to 7.5% over best baselines on Flask, HelpSteer, and BIGGen benchmarks using Spearman correlation, and matches or outperforms reasoning-based methods without reasoning steps.

Conclusion: LAGER effectively improves LLM-as-a-judge alignment with human scores by leveraging internal representations, demonstrating strong generalization across various applications while maintaining a plug-and-play, frozen backbone approach.

Abstract: The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using LLMs, a paradigm known as “LLM-as-a-judge”. However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. Previous studies mainly optimize based on shallow outputs, overlooking rich cross-layer representations. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and task-relevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a post-hoc, plug-and-play framework for improving the alignment of LLM-as-a-Judge point-wise evaluations with human scores by leveraging internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer score-token logits and computing the expected score from a softmax-based distribution, while keeping the LLM backbone frozen and ensuring no impact on the inference process. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the generalization of LAGER.

[159] EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes

Yuqin Dai, Guoqing Wang, Yuan Wang, Kairan Dou, Kaichen Zhou, Zhanwei Zhang, Shuo Yang, Fei Tang, Jun Yin, Pengyu Zeng, Zhenzhe Ying, Can Yi, Changhua Meng, Yuchen Zhou, Yongliang Shen, Shuai Lu

Main category: cs.CL

TL;DR: EviNote-RAG introduces a retrieve-note-answer framework that creates Supportive-Evidence Notes to address noise and error accumulation in RAG systems, achieving significant performance improvements on QA benchmarks.

Details

Motivation: To overcome challenges in RAG systems: (1) low signal-to-noise ratio where relevant information is diluted by irrelevant content, and (2) error accumulation in multi-hop reasoning from incomplete or misleading information.

Method: Proposes a retrieve-note-answer workflow where the model first generates Supportive-Evidence Notes (SENs) that concisely preserve answer-critical information and mark key/uncertainty details. Uses an entailment-based Evidence Quality Reward (EQR) to ensure SENs are logically sufficient for deriving answers.

Result: Achieves state-of-the-art performance with relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256). Improves answer accuracy, training stability, robustness, and efficiency.

Conclusion: EviNote-RAG effectively addresses RAG limitations by introducing intermediate evidence notes and quality rewards, leading to substantial improvements in reasoning quality and performance across multiple QA benchmarks.

Abstract: Retrieval-Augmented Generation (RAG) has advanced open-domain question answering by incorporating external information into model reasoning. However, effectively leveraging external information to enhance reasoning presents the following challenges: (1) low signal-to-noise ratio, where answer-supportive external information is diluted by irrelevant material, and (2) error accumulation, which arises in multi-hop reasoning when incomplete or misleading information is incorporated. To address these challenges, we introduce EviNote-RAG, a framework that follows a retrieve-note-answer workflow. Instead of reasoning directly over raw external information, the model first produces Supportive-Evidence Notes (SENs), which concisely preserve answer-critical information and explicitly mark key and uncertainty information to improve accuracy. We further design an entailment-based Evidence Quality Reward (EQR) to ensure that SENs are logically sufficient to derive the final answer, thereby enhancing SENs’ quality. Experiments on both in-domain and out-of-domain QA benchmarks show that EviNote-RAG achieves state-of-the-art performance, improving answer accuracy, training stability, robustness, and efficiency. In particular, it yields relative F1 gains of 20% on HotpotQA (+0.093), 40% on Bamboogle (+0.151), and 91% on 2Wiki (+0.256), benefiting from improvements in the reasoning process.

[160] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin

Main category: cs.CL

TL;DR: Drivelology is identified as “nonsense with depth” - syntactically coherent but pragmatically paradoxical language that LLMs fail to understand despite excelling at other NLP tasks.

Details

Motivation: Current LLMs excel at surface-level NLP tasks but struggle with deeper linguistic phenomena that require contextual inference, moral reasoning, or emotional interpretation.

Method: Created a benchmark dataset of 1,200+ curated Drivelology examples across 6 languages, evaluated LLMs on classification, generation, and reasoning tasks using expert-reviewed examples.

Result: LLMs consistently fail to grasp Drivelology’s layered semantics, confusing it with shallow nonsense, producing incoherent justifications, and missing implied rhetorical functions.

Conclusion: There’s a deep representational gap in LLMs’ pragmatic understanding, challenging the assumption that statistical fluency implies cognitive comprehension.

Abstract: We introduce Drivelology, a unique linguistic phenomenon characterised as “nonsense with depth” - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs’ pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

[161] Cross-Question Method Reuse in Large Language Models: From Word-Level Prediction to Rational Logical-Layer Reasoning

Hong Su

Main category: cs.CL

TL;DR: This paper extends method reuse in LLMs beyond highly similar questions to handle low-similarity questions and hidden similarities, improving cross-question solution transfer.

Details

Motivation: Existing method reuse approaches require highly similar questions, limiting their applicability. The authors aim to extend method reuse to questions with low similarity or hidden similarities that are not explicitly observable.

Method: Separate questions and solutions rather than feeding pairs directly to LLMs. Guide LLMs to adapt solutions to new but related questions, focusing on solution transfer rather than question recognition. Extend to cases where questions share partial features or hidden characteristics.

Result: Experimental verification shows the scope-extension approach increases the probability of filtering out reusable solutions, improving cross-question method reuse effectiveness.

Conclusion: The proposed approach successfully extends method reuse beyond conventional similarity constraints, enabling more effective solution transfer across questions with low or hidden similarities.

Abstract: Large language models (LLMs) have been widely applied to assist in finding solutions for diverse questions. Prior work has proposed representing a method as a pair of a question and its corresponding solution, enabling method reuse. However, existing approaches typically require the questions to be highly similar. In this paper, we extend the scope of method reuse to address questions with low similarity or with hidden similarities that are not explicitly observable. For questions that are similar in a general-specific sense (i.e., broader or narrower in scope), we propose to first separate the question and solution, rather than directly feeding the pair to the LLM. The LLM is then guided to adapt the solution to new but related questions, allowing it to focus on solution transfer rather than question recognition. Furthermore, we extend this approach to cases where questions only share partial features or hidden characteristics. This enables cross-question method reuse beyond conventional similarity constraints. Experimental verification shows that our scope-extension approach increases the probability of filtering out reusable solutions, thereby improving the effectiveness of cross-question method reuse.

[162] Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

Main category: cs.CL

TL;DR: A bilevel optimization method that combines SFT and RL to prevent catastrophic forgetting and improve efficiency in training reasoning models.

Details

Motivation: Traditional two-stage SFT+RL approach suffers from catastrophic forgetting where RL loses SFT-acquired behaviors and inefficiently explores new patterns.

Method: Bilevel optimization that conditions SFT objective on optimal RL policy, enabling SFT to meta-learn how to guide RL’s optimization. Lower level performs RL updates with SFT supervision, upper level maximizes cooperative gain.

Result: Outperforms baselines on five reasoning benchmarks and achieves better balance between effectiveness and efficiency.

Conclusion: The proposed bilevel optimization method enables better cooperation between SFT and RL training paradigms for reasoning models.

Abstract: Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach suffers from catastrophic forgetting: second-stage RL gradually loses SFT-acquired behaviors and inefficiently explores new patterns. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL’s optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.

[163] Preservation of Language Understanding Capabilities in Speech-aware Large Language Models

Marek Kubis, Paweł Skórzewski, Iwona Christop, Mateusz Czyżnikiewicz, Jakub Kubiak, Łukasz Bondaruk, Marcin Lewandowski

Main category: cs.CL

TL;DR: C3T is a new benchmark that evaluates speech-aware LLMs by testing if their language understanding capabilities are preserved when accessed via speech input, using textual tasks and voice cloning TTS.

Details

Motivation: To assess how well speech-aware large language models maintain their language understanding capabilities when accessed through speech input rather than text, and to measure fairness across different speaker categories.

Method: Uses textual tasks combined with a voice cloning text-to-speech model to create speech inputs, then compares model performance between text and speech modalities.

Result: The benchmark enables quantification of model fairness for different speaker categories and robustness across text and speech modalities.

Conclusion: C3T provides a systematic way to evaluate cross-modal capabilities conservation in speech-aware LLMs, addressing both fairness and robustness concerns.

Abstract: The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.

[164] Confidence Calibration in Large Language Model-Based Entity Matching

Iris Kamsteeg, Juan Cardenas-Cartagena, Floris van Beers, Gineke ten Holt, Tsegaye Misikir Tashu, Matias Valdenegro-Toro

Main category: cs.CL

TL;DR: Empirical study comparing confidence calibration methods (Temperature Scaling, Monte Carlo Dropout, Ensembles) for RoBERTa in Entity Matching tasks, showing Temperature Scaling reduces calibration error by up to 23.83%.

Details

Motivation: To explore the intersection of Large Language Models and confidence calibration in Entity Matching, addressing the issue of model overconfidence.

Method: Empirical study comparing baseline RoBERTa confidences against calibrated confidences using Temperature Scaling, Monte Carlo Dropout, and Ensembles on Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets.

Result: Modified RoBERTa model exhibits slight overconfidence (Expected Calibration Error: 0.0043-0.0552). Temperature Scaling reduces Expected Calibration Error by up to 23.83%.

Conclusion: Temperature Scaling effectively mitigates overconfidence in RoBERTa models for Entity Matching tasks, improving confidence calibration.

Abstract: This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.

[165] Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy

Nuwan I. Senaratna

Main category: cs.CL

TL;DR: A collection of 230,091 multilingual documents (57.7 GB) from Sri Lankan parliamentary, legal, government, news, and tourism sources, updated daily and available on GitHub/Hugging Face.

Details

Motivation: To provide open, machine-readable datasets to support research in computational linguistics, legal analytics, socio-political studies, and multilingual NLP for Sri Lankan languages.

Method: Created a data collection pipeline from various Sri Lankan sources, processed documents into machine-readable formats, and established daily updates with mirroring on GitHub and Hugging Face.

Result: Successfully compiled 24 datasets containing 230,091 documents across Sinhala, Tamil, and English languages, totaling 57.7 GB of data.

Conclusion: This resource collection enables research in multiple domains while addressing licensing and ethical considerations, with ongoing daily updates to maintain current data.

Abstract: We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 230,091 documents (57.7 GB) across 24 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2025-10-16-0818.

[166] Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor

Main category: cs.CL

TL;DR: Inoculation prompting modifies finetuning data by prepending instructions that deliberately elicit undesirable traits, reducing their expression at test time without the instruction.

Details

Motivation: Language model finetuning often results in learning undesirable traits alongside desired ones, creating a need for selective learning techniques.

Method: Prepend short system-prompt instructions to finetuning data that deliberately elicit undesirable traits, then evaluate without these instructions at test time.

Result: Inoculated models show much lower expression of undesirable traits across multiple settings including reducing emergent misalignment, defending against backdoor injections, and mitigating trait transmission via subliminal learning.

Conclusion: Inoculation prompting is an effective technique for selective learning that reduces generalization of undesirable traits by making them less surprising, contributing to better understanding of how language models generalize.

Abstract: Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.’’) teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

[167] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh

Main category: cs.CL

TL;DR: This survey provides the first comprehensive analysis of code-switching (CSW) in large language models (LLMs), reviewing 308 studies across multiple research areas, NLP tasks, datasets, and languages to address challenges in multilingual NLP.

Details

Motivation: Code-switching remains a fundamental challenge for multilingual NLP despite advances in LLMs, with most models struggling with mixed-language inputs, limited datasets, and evaluation biases that hinder deployment in multilingual societies.

Method: Comprehensive analysis of 308 studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages, classifying advances by architecture, training strategy, and evaluation methodology.

Result: The survey outlines how LLMs have reshaped CSW modeling while identifying persistent challenges in the field.

Conclusion: A roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence, with curated resources maintained at a GitHub repository.

Abstract: Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multilingual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 308 studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.

[168] Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures

Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant

Main category: cs.CL

TL;DR: LLMs struggle with challenging linguistic structures, especially garden path sentences, showing both convergence and divergence from human sentence comprehension patterns.

Details

Motivation: To systematically compare human and LLM sentence comprehension across challenging linguistic structures and understand if LLMs experience human-like processing difficulties.

Method: Collected sentence comprehension data from humans and five families of state-of-the-art LLMs in a unified experimental framework, testing seven challenging linguistic structures and their matched baselines.

Result: LLMs overall struggle on target structures, especially garden path sentences (46.8% accuracy for GPT-5 vs 93.7% on non-GP structures). Rank correlation between humans and models increases with parameter count. Performance gap between target and baseline sentences holds for LLMs except for very weak or very strong models.

Conclusion: The study reveals both convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity between humans and LLMs.

Abstract: Large language models (LLMs) that fluently converse with humans are a reality

but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

Kaiqi Yang, Hang Li, Yucheng Chu, Zitao Liu, Mi Tian, Hui Liu

Main category: cs.CL

TL;DR: The paper proposes an iterative LLM-based framework to generate math word problems with distracting conditions that don’t alter the original solutions, addressing limitations in existing datasets.

Details

Motivation: Existing MWP datasets lack realistic distracting conditions, making benchmarks unreliable. Current datasets have low difficulty and out-of-context expressions, and manually adding distracting conditions requires intensive labor to verify solutions.

Method: An iterative framework using LLMs with specialized prompts to revise MWPs from different perspectives and cognitive levels, generating distracting conditions while preserving original solutions.

Result: The framework efficiently generates high-quality MWPs with distracting conditions without needing to create new solutions, reducing generation overhead while maintaining data quality.

Conclusion: The proposed LLM-based framework provides an effective solution for creating realistic MWP datasets with distracting conditions, improving benchmarking credibility for mathematical reasoning evaluation.

Abstract: Mathematical reasoning serves as a crucial testbed for the intelligence of large language models (LLMs), and math word problems (MWPs) are a popular type of math problems. Most MWP datasets consist of problems containing only the necessary information, while problems with distracting and excessive conditions are often overlooked. Prior works have tested popular LLMs and found a dramatic performance drop in the presence of distracting conditions. However, datasets of MWPs with distracting conditions are limited, and most suffer from lower levels of difficulty and out-of-context expressions. This makes distracting conditions easy to identify and exclude, thus reducing the credibility of benchmarking on them. Moreover, when adding distracting conditions, the reasoning and answers may also change, requiring intensive labor to check and write the solutions. To address these issues, we design an iterative framework to generate distracting conditions using LLMs. We develop a set of prompts to revise MWPs from different perspectives and cognitive levels, encouraging the generation of distracting conditions as well as suggestions for further revision. Another advantage is the shared solutions between original and revised problems: we explicitly guide the LLMs to generate distracting conditions that do not alter the original solutions, thus avoiding the need to generate new solutions. This framework is efficient and easy to deploy, reducing the overhead of generating MWPs with distracting conditions while maintaining data quality.

[170] Higher-order interactions of multi-layer prompt

Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang, Weigang Lu

Main category: cs.CL

TL;DR: Proposes a framework modeling higher-order interactions between multi-layer prompts to enhance prompt-tuning by capturing synergistic relationships across network layers.

Details

Motivation: Current prompt-tuning methods treat prompts as isolated components across layers, overlooking complex higher-order interactions that limit expressive power and semantic richness.

Method: Designs an interaction module to capture sophisticated non-linear correlations among multi-layer prompts, modeling their cooperative effects and enabling dynamic aggregation of prompt information across network depth.

Result: Extensive experiments on eight benchmark datasets show consistent superiority over state-of-the-art prompt-tuning baselines, with particularly pronounced advantages in few-shot scenarios.

Conclusion: Capturing intricate interplay between multi-layer prompts is key to unlocking more robust and generalizable representation learning, demonstrating the importance of modeling higher-order prompt interactions.

Abstract: The “pre-train, prompt” paradigm has successfully evolved in representation learning. While current prompt-tuning methods often introduce learnable prompts, they predominantly treat prompts as isolated, independent components across different network layers. This overlooks the complex and synergistic higher-order interactions that exist between prompts at various hierarchical depths, consequently limiting the expressive power and semantic richness of the prompted model. To address this fundamental gap, we propose a novel framework that explicitly models the Higher-order Interactions of Multi-layer Prompt. Our approach conceptualizes prompts from different layers not as separate entities, but as a cohesive system where their inter-relationships are critical. We design an innovative interaction module that captures these sophisticated, non-linear correlations among multi-layer prompts, effectively modeling their cooperative effects. This allows the model to dynamically aggregate and refine prompt information across the network’s depth, leading to a more integrated and powerful prompting strategy. Extensive experiments on eight benchmark datasets demonstrate that our method, by leveraging these higher-order interactions, consistently surpasses state-of-the-art prompt-tuning baselines. The performance advantage is particularly pronounced in few-shot scenarios, validating that capturing the intricate interplay between multi-layer prompts is key to unlocking more robust and generalizable representation learning.

[171] All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

Shiyuan Guo, Henry Sleight, Fabien Roger

Main category: cs.CL

TL;DR: Testing AI models’ ability to reason in 28 different ciphers reveals an asymmetry: models can translate ciphered text but struggle to reason in it, especially with less common ciphers, posing risks for evading chain-of-thought monitoring.

Details

Motivation: To assess the risk of attackers or misaligned AI models evading chain-of-thought monitoring through ciphered reasoning (reasoning hidden in encrypted, translated, or compressed text).

Method: Tested 28 different ciphers by fine-tuning and prompting up to 10 models to reason in each cipher, using math problems as a proxy for reasoning ability and measuring model accuracy.

Result: Found an asymmetry where model accuracy drops significantly when reasoning in ciphered text despite accurate translation capability. Frontier models struggle with lesser-known ciphers but handle well-known ones like rot13. Ciphered reasoning capability correlates with cipher prevalence in pretraining data and improves slowly with additional fine-tuning.

Conclusion: Evading CoT monitoring using ciphered reasoning may be ineffective for current models, but provides guidance for constraining this capability in future frontier models.

Abstract: Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.

[172] Are Large Reasoning Models Interruptible?

Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Main category: cs.CL

TL;DR: This paper challenges the ‘frozen world’ assumption in Large Reasoning Models evaluation by testing robustness under dynamic scenarios like interruptions and changing context, revealing significant performance drops and novel failure modes.

Details

Motivation: Traditional LRM evaluations assume static, 'frozen world' settings where responses are instantaneous and context is immutable. This breaks down in real-world scenarios like assistive programming where models take hours to think and context changes during reasoning.

Method: Evaluate LRM robustness under two dynamic scenarios: 1) interruptions to test partial outputs on limited budget, and 2) dynamic context to test adaptation to in-flight changes. Tests conducted across mathematics and programming benchmarks requiring long-form reasoning.

Result: Static evaluations consistently overestimate robustness. State-of-the-art LRMs show performance drops up to 60% when interrupted or exposed to changing context, especially when updates occur late in reasoning. Novel failure modes identified: reasoning leakage, panic, and self-doubt.

Conclusion: The frozen world assumption is inadequate for evaluating modern reasoning tasks. LRMs exhibit significant vulnerabilities to dynamic conditions, highlighting the need for more realistic evaluation frameworks that account for interruptions and changing contexts.

Abstract: Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, “frozen world” settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the “frozen world” assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model’s final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model’s partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information. Project Page: http://dynamic-lm.github.io/

[173] HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

Ali Mekky, Omar El Herraoui, Preslav Nakov, Yuxia Wang

Main category: cs.CL

TL;DR: HALF is a harm-aware LLM fairness framework that evaluates bias in realistic applications by weighting outcomes based on harm severity across different domains.

Details

Motivation: Existing LLM fairness evaluations lack grounding in real-world scenarios and don't account for differences in harm severity across domains like clinical decision support vs. text summarization.

Method: HALF organizes nine application domains into three severity tiers (Severe, Moderate, Mild) using a five-stage pipeline that assesses model bias in realistic applications and weighs outcomes by harm severity.

Result: Evaluation across eight LLMs shows: (1) inconsistent fairness across domains, (2) model size/performance doesn’t guarantee fairness, (3) reasoning models perform better in medical decision support but worse in education.

Conclusion: HALF exposes a clear gap between previous benchmarking success and deployment readiness, highlighting the need for harm-aware fairness evaluation.

Abstract: Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.

[174] A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

Qianben Chen, Jingyi Cao, Jiayu Zhang, Tianrui Qin, Xiaowan Li, King Zhu, Dingfeng Shi, He Zhu, Minghao Liu, Xiaobo Liang, Xin Gui, Ge Zhang, Jian Yang, Yuchen Eleanor Jiang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: A^2FM is a unified framework that combines reasoning and agentic capabilities in LLMs through task-aware routing and three modes (reasoning, agentic, instant) to improve both accuracy and cost efficiency.

Details

Motivation: Current LLMs are divided into reasoning-centric models (good at internal reasoning but no tools) and agentic models (can use tools but weak in reasoning), creating inefficiencies where both types overthink or overuse tools on simple queries.

Method: Route-then-align principle: learn task-aware routing first, then align mode-specific trajectories under shared backbone. Introduces instant mode for simple queries, and Adaptive Policy Optimization (APO) for adaptive sampling across modes with cost-regularized reward.

Result: On 32B scale: 13.4% on BrowseComp, 70.4% on AIME25, 16.7% on HLE - new SOTA among comparable models. Adaptive execution achieves $0.00487 per correct answer, cutting costs by 45.2% vs reasoning and 33.5% vs agentic models.

Conclusion: A^2FM successfully unifies reasoning and agentic capabilities while significantly improving cost efficiency through adaptive mode selection, maintaining competitive accuracy across diverse benchmarks.

Abstract: Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

[175] CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning

Kehua Feng, Keyan Ding, Zhihui Zhu, Lei Liang, Qiang Zhang, Huajun Chen

Main category: cs.CL

TL;DR: CoT-Evo is an evolutionary framework that refines flawed reasoning from multiple LLMs through iterative selection, recombination, and mutation to create high-quality scientific reasoning data for distilling into smaller models.

Details

Motivation: Standard CoT distillation fails in scientific domains because advanced LLMs often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements, leading to poor training data for student models.

Method: Constructs diverse reasoning trajectories from multiple LLMs, enriches them with retrieved domain knowledge, then iteratively refines using novelty-driven selection, reflective recombination and mutation guided by a fitness function evaluating correctness, coherence, and knowledge utilization.

Result: The evolved CoT dataset enables fine-tuning of compact models that achieve state-of-the-art performance on scientific reasoning benchmarks.

Conclusion: Establishes a scalable approach to synthesize high-fidelity scientific reasoning data from diverse and fallible LLMs, overcoming limitations of direct CoT distillation in complex domains.

Abstract: While chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks, it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements. Directly distilling from such flawed outputs results in low-quality training data and limits the performance of smaller student models. To overcome this, we propose CoT-Evo, an evolutionary CoT distillation framework. It begins by constructing a diverse pool of reasoning trajectories from multiple LLM thinkers, enriches them with automatically retrieved domain knowledge, and iteratively refines the trajectories using novelty-driven selection, reflective recombination and mutation. The refinement is guided by a fitness function that evaluates answer correctness, coherence, and effective knowledge utilization. This results in a high-quality CoT dataset tailored for scientific reasoning. We employ this evolved dataset to fine-tune a compact model, which achieves state-of-the-art performance on scientific reasoning benchmarks. Our work establishes a scalable approach to synthesizing high-fidelity scientific reasoning data from diverse and fallible LLMs.

[176] Make an Offer They Can’t Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

Buwei He, Yang Liu, Zhaowei Zhang, Zixia Jia, Huijia Wu, Zhaofeng He, Zilong Zheng, Yipeng Kang

Main category: cs.CL

TL;DR: The paper proposes a Bayesian Persuasion framework for LLMs that uses commitment-communication mechanisms to enhance strategic persuasion capabilities in single-turn dialogues.

Details

Motivation: Current AI systems struggle with strategic persuasion, overlooking information asymmetry and relying on strong pre-commitment assumptions. The work aims to improve LLMs' persuasion capabilities using Bayesian Persuasion principles.

Method: Developed two variants: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP, incorporating commitment-communication mechanisms where persuaders explicitly outline information schemas to guide Bayesian belief updates.

Result: BP-guided LLMs consistently achieved higher persuasion success rates than non-BP baselines. SFNL showed better credibility and logical coherence, while FNL demonstrated stronger emotional resonance and robustness. Fine-tuned smaller models achieved BP performance comparable to larger models.

Conclusion: Bayesian Persuasion framework effectively enhances LLMs’ strategic persuasion capabilities, with different variants offering complementary strengths, and demonstrates that smaller models can achieve competitive performance with proper fine-tuning.

Abstract: Persuasion, a fundamental social capability for humans, remains a challenge for AI systems such as large language models (LLMs). Current studies often overlook the strategic use of information asymmetry in message design or rely on strong assumptions regarding pre-commitment. In this work, we explore the application of Bayesian Persuasion (BP) in natural language within single-turn dialogue settings, to enhance the strategic persuasion capabilities of LLMs. Our framework incorporates a commitment-communication mechanism, where the persuader explicitly outlines an information schema by narrating their potential types (e.g., honest or dishonest), thereby guiding the persuadee in performing the intended Bayesian belief update. We evaluate two variants of our approach: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP, benchmarking them against both naive and strong non-BP (NBP) baselines within a comprehensive evaluation framework. This framework covers a diverse set of persuadees – including LLM instances with varying prompts and fine-tuning and human participants – across tasks ranging from specially designed persuasion scenarios to general everyday situations. Experimental results on LLM-based agents reveal three main findings: (1) LLMs guided by BP strategies consistently achieve higher persuasion success rates than NBP baselines; (2) SFNL exhibits greater credibility and logical coherence, while FNL shows stronger emotional resonance and robustness in naturalistic conversations; (3) with supervised fine-tuning, smaller models can attain BP performance comparable to that of larger models.

[177] Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Ahmed Alzubaidi, Shaikha Alsuwaidi, Basma El Amel Boussaha, Leen AlQadi, Omar Alkaabi, Mohammed Alyafeai, Hamza Alobeidli, Hakim Hacid

Main category: cs.CL

TL;DR: First systematic review of 40+ Arabic LLM benchmarks, proposing a taxonomy with four categories and identifying critical gaps in evaluation methodologies.

Details

Motivation: To provide a comprehensive reference for Arabic NLP researchers by systematically reviewing existing benchmarks and identifying gaps in Arabic LLM evaluation.

Method: Analyzed 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. Proposed a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations.

Result: Revealed significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. Examined three primary approaches: native collection, translation, and synthetic generation.

Conclusion: This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.

Abstract: This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.

[178] Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Zhiqi Huang, Vivek Datla, Chenyang Zhu, Alfy Samuel, Daben Liu, Anoop Kumar, Ritesh Soni

Main category: cs.CL

TL;DR: A method for confidence estimation in RAG systems using raw FFN activations as auto-regressive signals, achieving high accuracy in financial applications with reduced latency.

Details

Motivation: Confidence estimation is critical in high-stakes domains like finance and healthcare where incorrect answers have severe consequences, and existing methods suffer from information loss in token logits and probabilities.

Method: Extends uncertainty quantification by leveraging raw FFN activations, models confidence prediction as sequence classification, and uses Huber loss regularization for robustness against noisy supervision.

Result: Outperforms strong baselines in real-world financial customer-support settings, maintains high accuracy under strict latency constraints, and preserves accuracy using activations from only the 16th layer of Llama 3.1 8B model.

Conclusion: Activation-based confidence modeling provides a scalable, architecture-aware approach for trustworthy RAG deployment in high-stakes applications.

Abstract: We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.

[179] The Mechanistic Emergence of Symbol Grounding in Language Models

Shuyu Wu, Ziqiao Ma, Xiaoxi Luo, Yidong Huang, Josue Torres-Fonseca, Freda Shi, Joyce Chai

Main category: cs.CL

TL;DR: Symbol grounding emerges in language models through middle-layer computations where attention heads aggregate environmental context to support linguistic predictions, replicating across architectures but not in unidirectional LSTMs.

Details

Motivation: To understand how symbol grounding emerges in language models without explicit grounding objectives and identify the specific mechanisms and loci of this emergence.

Method: Introduced a controlled evaluation framework using mechanistic and causal analysis to trace symbol grounding within internal computations, testing across different architectures including Transformers, state-space models, and LSTMs.

Result: Grounding concentrates in middle-layer computations and is implemented through attention heads aggregating environmental context to support linguistic predictions. This phenomenon replicates in multimodal dialogue and across Transformers and state-space models, but not in unidirectional LSTMs.

Conclusion: Symbol grounding can emerge in language models through specific computational mechanisms, providing behavioral and mechanistic evidence with practical implications for predicting and controlling generation reliability.

Abstract: Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

cs.CV

[180] MultiFoodhat: A potential new paradigm for intelligent food quality inspection

Yue Hu, Guohang Zhuang

Main category: cs.CV

TL;DR: MultiFoodChat is a dialogue-driven multi-agent framework for zero-shot food recognition that combines vision-language models and large language models to achieve superior accuracy without requiring labeled training data.

Details

Motivation: Existing supervised food classification models require large labeled datasets and have poor generalization to unseen food categories, limiting their practical application in food quality inspection and dietary assessment.

Method: Uses a multi-agent reasoning framework with Object Perception Token (OPT) for fine-grained visual attribute extraction and Interactive Reasoning Agent (IRA) for dynamic contextual interpretation through visual-textual dialogues, without requiring additional training.

Result: Achieves superior recognition accuracy and interpretability on multiple public food datasets compared to existing unsupervised and few-shot methods.

Conclusion: MultiFoodChat represents a promising new paradigm for intelligent food quality inspection and analysis that enables flexible, human-like understanding of complex food scenes without manual annotations.

Abstract: Food image classification plays a vital role in intelligent food quality inspection, dietary assessment, and automated monitoring. However, most existing supervised models rely heavily on large labeled datasets and exhibit limited generalization to unseen food categories. To overcome these challenges, this study introduces MultiFoodChat, a dialogue-driven multi-agent reasoning framework for zero-shot food recognition. The framework integrates vision-language models (VLMs) and large language models (LLMs) to enable collaborative reasoning through multi-round visual-textual dialogues. An Object Perception Token (OPT) captures fine-grained visual attributes, while an Interactive Reasoning Agent (IRA) dynamically interprets contextual cues to refine predictions. This multi-agent design allows flexible and human-like understanding of complex food scenes without additional training or manual annotations. Experiments on multiple public food datasets demonstrate that MultiFoodChat achieves superior recognition accuracy and interpretability compared with existing unsupervised and few-shot methods, highlighting its potential as a new paradigm for intelligent food quality inspection and analysis.

[181] Post-surgical Endometriosis Segmentation in Laparoscopic Videos

Andreas Leibetseder, Klaus Schoeffmann, Jörg Keckstein, Simon Keckstein

Main category: cs.CV

TL;DR: A system for segmenting dark endometrial implants in laparoscopic surgery videos to assist gynecologic physicians in identifying endometriosis.

Details

Motivation: Endometriosis is difficult to identify due to its varied visual appearance, making assistance valuable for non-specialized medical practitioners.

Method: Training a system to analyze laparoscopic surgery videos and segment dark endometrial implants with multi-colored overlays.

Result: The system can identify implant regions and provide a detection summary for improved video browsing.

Conclusion: The developed system offers valuable assistance to gynecologic physicians in detecting endometriosis during laparoscopic procedures.

Abstract: Endometriosis is a common women’s condition exhibiting a manifold visual appearance in various body-internal locations. Having such properties makes its identification very difficult and error-prone, at least for laymen and non-specialized medical practitioners. In an attempt to provide assistance to gynecologic physicians treating endometriosis, this demo paper describes a system that is trained to segment one frequently occurring visual appearance of endometriosis, namely dark endometrial implants. The system is capable of analyzing laparoscopic surgery videos, annotating identified implant regions with multi-colored overlays and displaying a detection summary for improved video browsing.

[182] Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models

Jia Yun Chua, Argyrios Zolotas, Miguel Arana-Catania

Main category: cs.CV

TL;DR: This paper combines traditional vision models (YOLO) with Vision Language Models (LLaVA, ChatGPT, Gemini) to improve remote sensing image analysis, particularly for aircraft detection and scene understanding in challenging conditions.

Details

Motivation: Remote sensing faces challenges with limited labeled data and poor contextual understanding in complex environments. Vision Language Models offer potential but remain underexplored in remote sensing applications.

Method: Integration of YOLO with VLMs (LLaVA, ChatGPT, Gemini) for enhanced image interpretation. Evaluation on both labeled and unlabeled remote sensing data, including degraded image scenarios.

Result: 48.46% average MAE improvement in aircraft detection accuracy across models, especially in challenging conditions. 6.17% improvement in CLIPScore for comprehensive scene understanding.

Conclusion: The combination of traditional vision models and VLMs enables more advanced and efficient remote sensing image analysis, particularly beneficial for few-shot learning scenarios.

Abstract: Remote sensing has become a vital tool across sectors such as urban planning, environmental monitoring, and disaster response. While the volume of data generated has increased significantly, traditional vision models are often constrained by the requirement for extensive domain-specific labelled data and their limited ability to understand the context within complex environments. Vision Language Models offer a complementary approach by integrating visual and textual data; however, their application to remote sensing remains underexplored, particularly given their generalist nature. This work investigates the combination of vision models and VLMs to enhance image analysis in remote sensing, with a focus on aircraft detection and scene understanding. The integration of YOLO with VLMs such as LLaVA, ChatGPT, and Gemini aims to achieve more accurate and contextually aware image interpretation. Performance is evaluated on both labelled and unlabelled remote sensing data, as well as degraded image scenarios which are crucial for remote sensing. The findings show an average MAE improvement of 48.46% across models in the accuracy of aircraft detection and counting, especially in challenging conditions, in both raw and degraded scenarios. A 6.17% improvement in CLIPScore for comprehensive understanding of remote sensing images is obtained. The proposed approach combining traditional vision models and VLMs paves the way for more advanced and efficient remote sensing image analysis, especially in few-shot learning scenarios.

[183] Joint Modeling of Big Five and HEXACO for Multimodal Apparent Personality-trait Recognition

Ryo Masumura, Shota Orihashi, Mana Ihori, Tomohiro Tanaka, Naoki Makishima, Taiga Yamane, Naotaka Kawata, Satoshi Suzuki, Taichi Katayama

Main category: cs.CV

TL;DR: Proposes a joint modeling method for recognizing both Big Five and HEXACO personality traits from multimodal human behavior, addressing gaps in previous research that focused only on Big Five traits.

Details

Motivation: Previous studies only used Big Five for personality recognition, missing HEXACO traits like Honesty-Humility which are important for evaluating aggression and social dominance. The relationships between Big Five and HEXACO in machine learning models are unclear.

Method: Joint optimization method for simultaneously recognizing both Big Five and HEXACO personality traits from multimodal human behavior data.

Result: Experiments using self-introduction video dataset demonstrate effective recognition of both Big Five and HEXACO traits.

Conclusion: The proposed joint modeling approach successfully recognizes both personality trait models and improves awareness of multimodal human behavior by considering their relationships.

Abstract: This paper proposes a joint modeling method of the Big Five, which has long been studied, and HEXACO, which has recently attracted attention in psychology, for automatically recognizing apparent personality traits from multimodal human behavior. Most previous studies have used the Big Five for multimodal apparent personality-trait recognition. However, no study has focused on apparent HEXACO which can evaluate an Honesty-Humility trait related to displaced aggression and vengefulness, social-dominance orientation, etc. In addition, the relationships between the Big Five and HEXACO when modeled by machine learning have not been clarified. We expect awareness of multimodal human behavior to improve by considering these relationships. The key advance of our proposed method is to optimize jointly recognizing the Big Five and HEXACO. Experiments using a self-introduction video dataset demonstrate that the proposed method can effectively recognize the Big Five and HEXACO.

[184] Finding Holes: Pathologist Level Performance Using AI for Cribriform Morphology Detection in Prostate Cancer

Kelvin Szolnoky, Anders Blilie, Nita Mulliqi, Toyonori Tsuzuki, Hemamali Samaratunga, Matteo Titus, Xiaoyi Ji, Sol Erika Boman, Einar Gudlaugsson, Svein Reidar Kjosavik, José Asenjo, Marcello Gambacorta, Paolo Libretti, Marcin Braun, Radisław Kordek, Roman Łowicki, Brett Delahunt, Kenneth A. Iczkowski, Theo van der Kwast, Geert J. L. H. van Leenders, Katia R. M. Leite, Chin-Chen Pan, Emiel Adrianus Maria Janssen, Martin Eklund, Lars Egevad, Kimmo Kartasalo

Main category: cs.CV

TL;DR: AI model using EfficientNetV2-S with multiple instance learning achieves pathologist-level performance in detecting cribriform morphology in prostate cancer biopsies, outperforming expert uropathologists in inter-rater agreement.

Details

Motivation: Cribriform morphology in prostate cancer indicates poor prognosis but is underreported and suffers from significant interobserver variability among pathologists, necessitating improved detection methods.

Method: Deep learning model with EfficientNetV2-S encoder and multiple instance learning trained on 640 prostate biopsies from 430 patients across three cohorts, validated internally and externally, and compared against nine expert uropathologists.

Result: Model showed strong internal validation (AUC: 0.97, kappa: 0.81) and robust external validation (AUC: 0.90, kappa: 0.55), achieving highest average agreement (kappa: 0.66) in inter-rater analysis, outperforming all nine pathologists (kappa range: 0.35-0.62).

Conclusion: The AI model demonstrates pathologist-level performance for cribriform morphology detection, potentially enhancing diagnostic reliability, standardizing reporting, and improving prostate cancer treatment decisions.

Abstract: Background: Cribriform morphology in prostate cancer is a histological feature that indicates poor prognosis and contraindicates active surveillance. However, it remains underreported and subject to significant interobserver variability amongst pathologists. We aimed to develop and validate an AI-based system to improve cribriform pattern detection. Methods: We created a deep learning model using an EfficientNetV2-S encoder with multiple instance learning for end-to-end whole-slide classification. The model was trained on 640 digitised prostate core needle biopsies from 430 patients, collected across three cohorts. It was validated internally (261 slides from 171 patients) and externally (266 slides, 104 patients from three independent cohorts). Internal validation cohorts included laboratories or scanners from the development set, while external cohorts used completely independent instruments and laboratories. Annotations were provided by three expert uropathologists with known high concordance. Additionally, we conducted an inter-rater analysis and compared the model’s performance against nine expert uropathologists on 88 slides from the internal validation cohort. Results: The model showed strong internal validation performance (AUC: 0.97, 95% CI: 0.95-0.99; Cohen’s kappa: 0.81, 95% CI: 0.72-0.89) and robust external validation (AUC: 0.90, 95% CI: 0.86-0.93; Cohen’s kappa: 0.55, 95% CI: 0.45-0.64). In our inter-rater analysis, the model achieved the highest average agreement (Cohen’s kappa: 0.66, 95% CI: 0.57-0.74), outperforming all nine pathologists whose Cohen’s kappas ranged from 0.35 to 0.62. Conclusion: Our AI model demonstrates pathologist-level performance for cribriform morphology detection in prostate cancer. This approach could enhance diagnostic reliability, standardise reporting, and improve treatment decisions for prostate cancer patients.

[185] NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

Junjie Nan, Jianing Li, Wei Chen, Mingkun Zhang, Xueqi Cheng

Main category: cs.CV

TL;DR: NAPPure is an adversarial purification framework that handles non-additive perturbations like blur, occlusion, and distortion, unlike existing methods designed only for additive perturbations.

Details

Motivation: Existing adversarial purification methods are ineffective against non-additive perturbations (blur, occlusion, distortion) since they're designed for additive perturbations, but real-world adversarial attacks often involve non-additive perturbations.

Method: Propose NAPPure framework that establishes the generation process of adversarial images and disentangles clean images from perturbation parameters through likelihood maximization.

Result: Experiments on GTSRB and CIFAR-10 datasets show NAPPure significantly boosts robustness of image classification models against non-additive perturbations.

Conclusion: NAPPure effectively extends adversarial purification to handle non-additive perturbations, providing better protection against real-world adversarial attacks.

Abstract: Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.

[186] Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio

Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro

Main category: cs.CV

TL;DR: A unified framework for processing sign language, lip movements, and audio to generate spoken-language text, achieving state-of-the-art performance across multiple speech recognition tasks.

Details

Motivation: Audio-centric ASR systems exclude deaf/hard-of-hearing individuals, and existing visual alternatives (sign language, lip reading) have been studied in isolation without integration.

Method: Design a unified, modality-agnostic architecture that can process heterogeneous inputs (sign language, lip movements, audio) and explore modality synergies, particularly lip movements as non-manual cues in sign language.

Result: Achieved performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Lip movement modeling significantly improved SLT performance.

Conclusion: Explicitly modeling lip movements as a distinct modality captures critical non-manual cues and enhances sign language translation, demonstrating the value of unified multimodal frameworks.

Abstract: Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.

[187] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding

Xiaoqian Shen, Wenxuan Zhang, Jun Chen, Mohamed Elhoseiny

Main category: cs.CV

TL;DR: Vgent is a graph-based retrieval-reasoning-augmented generation framework that enhances large video language models for long video understanding by preserving temporal dependencies and reducing retrieval noise through structured verification.

Details

Motivation: Long videos pose challenges for LVLMs due to intensive video tokens beyond context windows and difficulty retaining long-term sequential information. Existing RAG methods for videos face issues with disrupted temporal dependencies and irrelevant information.

Method: Proposes Vgent with two innovations: (1) represents videos as structured graphs preserving semantic relationships across clips, (2) introduces intermediate reasoning step with structured verification to reduce noise and aggregate relevant information across clips.

Result: Achieved 3.0%-5.4% improvement over base models on MLVU benchmark and outperformed state-of-the-art video RAG methods by 8.6% across three long-video understanding benchmarks.

Conclusion: Vgent effectively addresses long video understanding challenges by combining graph-based retrieval with structured reasoning, significantly improving LVLM performance on long video tasks.

Abstract: Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0%\sim 5.4%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.

[188] Synchronization of Multiple Videos

Avihai Naaman, Ron Shapira Weber, Oren Freifeld

Main category: cs.CV

TL;DR: TPL is a prototype-based framework that creates compact 1D representations from video embeddings to synchronize videos from different scenes or generative AI videos, addressing complex temporal misalignment without exhaustive pairwise matching.

Details

Motivation: Traditional video synchronization methods work well for videos from the same scene but fail for videos from different scenes or generative AI videos due to diverse subjects, backgrounds, and nonlinear temporal misalignment.

Method: Temporal Prototype Learning (TPL) constructs shared, compact 1D representations from high-dimensional embeddings using pretrained models. It learns a unified prototype sequence that anchors key action phases to align videos robustly.

Result: TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets. It successfully addresses synchronization in multiple generative AI videos depicting the same action, outperforming previous approaches.

Conclusion: TPL provides an effective solution for synchronizing videos from different scenes and generative AI videos, demonstrating superior performance in fine-grained frame retrieval and phase classification tasks.

Abstract: Synchronizing videos captured simultaneously from multiple cameras in the same scene is often easy and typically requires only simple time shifts. However, synchronizing videos from different scenes or, more recently, generative AI videos, poses a far more complex challenge due to diverse subjects, backgrounds, and nonlinear temporal misalignment. We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. TPL robustly aligns videos by learning a unified prototype sequence that anchors key action phases, thereby avoiding exhaustive pairwise matching. Our experiments show that TPL improves synchronization accuracy, efficiency, and robustness across diverse datasets, including fine-grained frame retrieval and phase classification tasks. Importantly, TPL is the first approach to mitigate synchronization issues in multiple generative AI videos depicting the same action. Our code and a new multiple video synchronization dataset are available at https://bgu-cs-vil.github.io/TPL/

[189] SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

Cristian Sbrolli, Matteo Matteucci

Main category: cs.CV

TL;DR: SceneForge enhances 3D-text contrastive learning by creating multi-object scenes with spatial relations and pairing them with LLM-refined descriptions, addressing dataset scarcity and improving performance across various 3D tasks.

Details

Motivation: Address the scarcity of large-scale 3D-text datasets and enhance contrastive alignment between 3D point clouds and text through structured compositional scenes.

Method: Leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairs them with coherent multi-object descriptions refined by LLM, and augments contrastive training with structured compositional samples.

Result: Substantial performance gains across zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, ScanNet; few-shot part segmentation on ShapeNetPart; improved 3D visual question answering on ScanQA; robust generalization to retrieval with increasing complexity; and spatial reasoning capabilities.

Conclusion: SceneForge’s compositional augmentations are model-agnostic and consistently improve performance across multiple encoder architectures, demonstrating that structured multi-object scene compositions significantly enhance 3D-text contrastive learning.

Abstract: The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge’s compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.

[190] Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images

Emanuel Garbin, Guy Adam, Oded Krams, Zohar Barzelay, Eran Guendelman, Michael Schwarz, Moran Vatelmacher, Yigal Shenkman, Eli Peker, Itai Druker, Uri Patish, Yoav Blum, Max Bluvstein, Junxuan Li, Rawal Khirodkar, Shunsuke Saito

Main category: cs.CV

TL;DR: A zero-shot pipeline for creating hyperrealistic 3D avatars from unstructured phone images using generative canonicalization and transformer-based models trained on high-fidelity Gaussian splatting avatars.

Details

Motivation: Existing methods have limitations: single-view approaches suffer from geometric inconsistencies and hallucinations, while synthetic data-trained models fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism.

Method: Two key contributions: (1) generative canonicalization module that processes multiple unstructured views into standardized representation, and (2) transformer-based model trained on large-scale dataset of high-fidelity Gaussian splatting avatars from dome captures of real people.

Result: Produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

Conclusion: The “Capture, Canonicalize, Splat” pipeline successfully creates hyperrealistic, identity-preserving 3D avatars from few unstructured phone images.

Abstract: We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This “Capture, Canonicalize, Splat” pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.

[191] cubic: CUDA-accelerated 3D Bioimage Computing

Alexandr A. Kalinin, Anne E. Carpenter, Shantanu Singh, Matthew J. O’Meara

Main category: cs.CV

TL;DR: cubic is a Python library that provides GPU-accelerated bioimage analysis by extending SciPy and scikit-image APIs with CuPy and RAPIDS cuCIM, enabling faster 2D/3D image processing while maintaining compatibility with existing workflows.

Details

Motivation: Existing bioimage analysis tools face scalability issues with large microscopy datasets, lack GPU acceleration, have poor 3D support, and limited integration with modern scientific computing workflows.

Method: cubic extends SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM, using device-agnostic dispatching that automatically uses GPU when data is on device.

Result: Benchmarks show substantial speedups in individual operations and reproduction of deconvolution/segmentation pipelines while maintaining algorithmic fidelity.

Conclusion: cubic provides a robust foundation for scalable, reproducible bioimage analysis that integrates with Python’s scientific computing ecosystem, enabling both interactive exploration and high-throughput workflows.

Abstract: Quantitative analysis of multidimensional biological images is useful for understanding complex cellular phenotypes and accelerating advances in biomedical research. As modern microscopy generates ever-larger 2D and 3D datasets, existing computational approaches are increasingly limited by their scalability, efficiency, and integration with modern scientific computing workflows. Existing bioimage analysis tools often lack application programmable interfaces (APIs), do not support graphics processing unit (GPU) acceleration, lack broad 3D image processing capabilities, and/or have poor interoperability for compute-heavy workflows. Here, we introduce cubic, an open-source Python library that addresses these challenges by augmenting widely used SciPy and scikit-image APIs with GPU-accelerated alternatives from CuPy and RAPIDS cuCIM. cubic’s API is device-agnostic and dispatches operations to GPU when data reside on the device and otherwise executes on CPU, seamlessly accelerating a broad range of image processing routines. This approach enables GPU acceleration of existing bioimage analysis workflows, from preprocessing to segmentation and feature extraction for 2D and 3D data. We evaluate cubic both by benchmarking individual operations and by reproducing existing deconvolution and segmentation pipelines, achieving substantial speedups while maintaining algorithmic fidelity. These advances establish a robust foundation for scalable, reproducible bioimage analysis that integrates with the broader Python scientific computing ecosystem, including other GPU-accelerated methods, enabling both interactive exploration and automated high-throughput analysis workflows. cubic is openly available at https://github$.$com/alxndrkalinin/cubic

[192] Camera Movement Classification in Historical Footage: A Comparative Study of Deep Video Models

Tingyu Lin, Armin Dadras, Florian Kleber, Robert Sablatnig

Main category: cs.CV

TL;DR: First systematic evaluation of deep video camera movement classification models on historical archival footage, showing Video Swin Transformer achieves 80.25% accuracy on WWII footage despite limited training data.

Details

Motivation: Camera movement conveys essential spatial and narrative information in videos, but generalization of existing CMC methods to historical footage remains unexplored, creating a gap in understanding archival film content.

Method: Systematic evaluation of five standard video classification models on HISTORIAN dataset containing expert-annotated WWII footage, comparing model designs and label definitions across representative methods.

Result: Video Swin Transformer achieved the best performance with 80.25% accuracy, demonstrating strong convergence despite limited training data and challenges with low-quality video.

Conclusion: Findings highlight both challenges and potential for adapting existing models to historical footage, motivating future work that combines diverse input modalities and temporal architectures for improved archival video analysis.

Abstract: Camera movement conveys spatial and narrative information essential for understanding video content. While recent camera movement classification (CMC) methods perform well on modern datasets, their generalization to historical footage remains unexplored. This paper presents the first systematic evaluation of deep video CMC models on archival film material. We summarize representative methods and datasets, highlighting differences in model design and label definitions. Five standard video classification models are assessed on the HISTORIAN dataset, which includes expert-annotated World War II footage. The best-performing model, Video Swin Transformer, achieves 80.25% accuracy, showing strong convergence despite limited training data. Our findings highlight the challenges and potential of adapting existing models to low-quality video and motivate future work combining diverse input modalities and temporal architectures.

[193] Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Yuancheng Xu, Wenqi Xian, Li Ma, Julien Philip, Ahmet Levent Taşel, Yiwei Zhao, Ryan Burgert, Mingming He, Oliver Hermann, Oliver Pilarski, Rahul Garg, Paul Debevec, Ning Yu

Main category: cs.CV

TL;DR: A framework for video diffusion models that enables multi-view character consistency and 3D camera control through a novel customization pipeline using 4D Gaussian Splatting and video relighting.

Details

Motivation: To advance video generation integration into virtual production by addressing challenges in multi-view character consistency, camera control, and lighting adaptability.

Method: Train character consistency using volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting, with lighting variability from video relighting models. Fine-tune state-of-the-art video diffusion models on this data.

Result: Achieves strong multi-view identity preservation, precise camera control, lighting adaptability, multi-subject generation, scene customization, and motion/spatial layout control. Shows improved video quality and personalization accuracy.

Conclusion: The framework successfully advances video generation for virtual production with enhanced consistency, control, and adaptability capabilities.

Abstract: We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.

[194] LOTA: Bit-Planes Guided AI-Generated Image Detection

Hongsong Wang, Renxi Cheng, Yang Zhang, Chaolei Han, Jie Gui

Main category: cs.CV

TL;DR: A novel AI-generated image detection method using bit-plane-based noise analysis that achieves 98.9% accuracy and is nearly 100x faster than existing methods.

Details

Motivation: Current AI-generated image detection methods have high computational costs and fail to capture intrinsic noise features in raw images, making it difficult to distinguish AI-generated images from real ones.

Method: Uses bit-plane-based image processing to extract noise patterns, applies image normalization strategies, implements maximum gradient patch selection to amplify noise signals, and proposes lightweight classification heads with noise-based and noise-guided structures.

Result: Achieves 98.9% average accuracy on GenImage benchmark (11.9% improvement), excellent cross-generator generalization (98.2% from GAN to Diffusion, 99.2% from Diffusion to GAN), and millisecond-level processing that is nearly 100x faster than existing methods.

Conclusion: The proposed bit-plane-based noise analysis method provides highly accurate, fast, and generalizable AI-generated image detection with superior performance compared to existing approaches.

Abstract: The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9%} (\textbf{11.9}%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2% from GAN to Diffusion and over 99.2% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at https://github.com/hongsong-wang/LOTA.

[195] PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis

Soumyya Kanti Datta, Tanvi Ranga, Chengzhe Sun, Siwei Lyu

Main category: cs.CV

TL;DR: PIA is a novel multimodal audio-visual framework that detects deepfakes by analyzing phoneme-temporal relationships and identity-dynamic inconsistencies across language, facial motion, and identity cues.

Details

Motivation: Current deepfake detection methods fail against advanced generative models (GANs, diffusion models, neural rendering) that create nearly perfect individual frames but leave subtle temporal discrepancies that traditional detectors miss.

Method: Uses phoneme sequences, lip geometry data, and advanced facial identity embeddings in a multimodal framework to analyze inconsistencies across language, dynamic face motion, and facial identification.

Result: The integrated method significantly improves detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities.

Conclusion: PIA effectively addresses limitations of conventional detection methods by leveraging multimodal analysis to catch temporal discrepancies that single-modality approaches miss.

Abstract: The rise of manipulated media has made deepfakes a particularly insidious threat, involving various generative manipulations such as lip-sync modifications, face-swaps, and avatar-driven facial synthesis. Conventional detection methods, which predominantly depend on manually designed phoneme-viseme alignment thresholds, fundamental frame-level consistency checks, or a unimodal detection strategy, inadequately identify modern-day deepfakes generated by advanced generative models such as GANs, diffusion models, and neural rendering techniques. These advanced techniques generate nearly perfect individual frames yet inadvertently create minor temporal discrepancies frequently overlooked by traditional detectors. We present a novel multimodal audio-visual framework, Phoneme-Temporal and Identity-Dynamic Analysis(PIA), incorporating language, dynamic face motion, and facial identification cues to address these limitations. We utilize phoneme sequences, lip geometry data, and advanced facial identity embeddings. This integrated method significantly improves the detection of subtle deepfake alterations by identifying inconsistencies across multiple complementary modalities. Code is available at https://github.com/skrantidatta/PIA

[196] Event Interval Modulation: A Novel Scheme for Event-based Optical Camera Communication

Miu Sumino, Mayu Ishii, Shun Kaizu, Daisuke Hisano, Yu Nakayama

Main category: cs.CV

TL;DR: Proposes Event Interval Modulation (EIM) for event-based optical camera communication, achieving 28 kbps over 10m and 8.4 kbps over 50m - setting new bit rate benchmarks.

Details

Motivation: Existing event-based OCC systems use conventional modulation schemes that don't fully exploit the unique characteristics of event-based vision sensors (EVS), which offer high-speed, low-latency communication but suffer from low bit rates in current implementations.

Method: Developed Event Interval Modulation (EIM) scheme that modulates information using intervals between events. Tuned EVS parameters for optimal frequency response, determined maximum modulation order experimentally, and conducted transmission experiments.

Result: Achieved successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in indoor environments, setting new benchmarks for bit rate in event-based OCC systems.

Conclusion: EIM effectively exploits the unique characteristics of EVS for optical camera communication, significantly improving transmission speeds compared to existing methods and demonstrating practical long-range communication capabilities.

Abstract: Optical camera communication (OCC) represents a promising visible light communication technology. Nonetheless, typical OCC systems utilizing frame-based cameras are encumbered by limitations, including low bit rate and high processing load. To address these issues, OCC system utilizing an event-based vision sensor (EVS) as receivers have been proposed. The EVS enables high-speed, low-latency, and robust communication due to its asynchronous operation and high dynamic range. In existing event-based OCC systems, conventional modulation schemes such as on-off keying (OOK) and pulse position modulation have been applied, however, to the best of our knowledge, no modulation method has been proposed that fully exploits the unique characteristics of the EVS. This paper proposes a novel modulation scheme, called the event interval modulation (EIM) scheme, specifically designed for event-based OCC. EIM enables improvement in transmission speed by modulating information using the intervals between events. This paper proposes a theoretical model of EIM and conducts a proof-of-concept experiment. First, the parameters of the EVS are tuned and customized to optimize the frequency response specifically for EIM. Then, the maximum modulation order usable in EIM is determined experimentally. We conduct transmission experiments based on the obtained parameters. Finally, we report successful transmission at 28 kbps over 10 meters and 8.4 kbps over 50 meters in an indoor environment. This sets a new benchmark for bit rate in event-based OCC systems.

[197] MACE: Mixture-of-Experts Accelerated Coordinate Encoding for Large-Scale Scene Localization and Rendering

Mingkai Liu, Dikai Fan, Haohua Que, Haojia Gao, Xiao Liu, Shuxue Peng, Meixia Lin, Shengyu Gu, Ruicong Ye, Wanli Qiu, Handong Yao, Ruopeng Zhang, Xianliang Huang

Main category: cs.CV

TL;DR: MACE enables efficient localization and high-quality rendering in large-scale scenes using a mixed expert approach with gating networks and auxiliary-loss-free load balancing.

Details

Motivation: Address computational cost challenges in large-scale scene localization and rendering, overcoming limitations of single-network Scene Coordinate Regression methods.

Method: Propose Mixed Expert-based Accelerated Coordinate Encoding (MACE) with gating network to select sub-networks, and Auxiliary-Loss-Free Load Balancing strategy.

Result: Significant cost reduction while maintaining higher precision; achieves high-quality rendering with only 10 minutes training on Cambridge test set.

Conclusion: MACE provides an efficient solution for large-scale scene applications with improved localization accuracy and rendering quality.

Abstract: Efficient localization and high-quality rendering in large-scale scenes remain a significant challenge due to the computational cost involved. While Scene Coordinate Regression (SCR) methods perform well in small-scale localization, they are limited by the capacity of a single network when extended to large-scale scenes. To address these challenges, we propose the Mixed Expert-based Accelerated Coordinate Encoding method (MACE), which enables efficient localization and high-quality rendering in large-scale scenes. Inspired by the remarkable capabilities of MOE in large model domains, we introduce a gating network to implicitly classify and select sub-networks, ensuring that only a single sub-network is activated during each inference. Furtheremore, we present Auxiliary-Loss-Free Load Balancing(ALF-LB) strategy to enhance the localization accuracy on large-scale scene. Our framework provides a significant reduction in costs while maintaining higher precision, offering an efficient solution for large-scale scene applications. Additional experiments on the Cambridge test set demonstrate that our method achieves high-quality rendering results with merely 10 minutes of training.

[198] Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Liao Shen, Wentao Jiang, Yiran Zhu, Tiezheng Ge, Zhiguo Cao, Bo Zheng

Main category: cs.CV

TL;DR: IPRO is a reinforcement learning-based video diffusion framework that enhances identity preservation in image-to-video generation by optimizing diffusion models using a face identity scorer and novel facial scoring mechanism.

Details

Motivation: Existing image-to-video models struggle with maintaining identity consistency between input human images and generated videos, especially when faces are small or when there are significant expression changes and movements.

Method: Proposes Identity-Preserving Reward-guided Optimization (IPRO) that backpropagates reward signals through the last sampling steps, uses a facial scoring mechanism that treats ground-truth videos as facial feature pools, and incorporates KL-divergence regularization for stability.

Result: Extensive experiments on Wan 2.2 I2V model and in-house I2V model demonstrate the effectiveness of the method in enhancing identity preservation.

Conclusion: IPRO provides a direct and effective tuning algorithm for improving identity consistency in human-centric video generation without requiring architectural changes or auxiliary modules.

Abstract: Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at \href{https://ipro-alimama.github.io/}{https://ipro-alimama.github.io/}.

[199] Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning

Xiangyu Meng, Zixian Zhang, Zhenghao Zhang, Junchao Liao, Long Qin, Weizhi Wang

Main category: cs.CV

TL;DR: Identity-GRPO is a human feedback-driven optimization pipeline that improves multi-human identity preservation in video generation, achieving up to 18.9% improvement in human consistency metrics.

Details

Motivation: Existing methods like VACE and Phantom struggle with maintaining consistent identities across multiple characters in dynamic interactions, which is critical for realistic video generation.

Method: Proposes Identity-GRPO pipeline with: 1) video reward model trained on large-scale preference dataset with human-annotated and synthetic distortion data, 2) GRPO variant tailored for multi-human consistency optimization.

Result: Achieves up to 18.9% improvement in human consistency metrics over baseline methods, with extensive ablation studies showing impact of annotation quality and design choices.

Conclusion: The method offers actionable insights for aligning reinforcement learning with personalized video generation and greatly enhances existing video generation models like VACE and Phantom.

Abstract: While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.

[200] MetaQAP - A Meta-Learning Approach for Quality-Aware Pretraining in Image Quality Assessment

Nisar Ahmed, Gulshan Saleem, Nazik Alturki, Nada Alasbali

Main category: cs.CV

TL;DR: MetaQAP is a no-reference image quality assessment model that uses quality-aware pre-training and meta-learning to achieve state-of-the-art performance on multiple benchmark datasets.

Details

Motivation: Image Quality Assessment remains challenging due to subjective human perception and complex real-world distortions, requiring more robust and generalizable approaches.

Method: The model uses quality-aware pre-training on CNNs, implements a quality-aware loss function, and integrates a meta-learner to form an ensemble model combining predictions from multiple base models.

Result: Achieved exceptional performance with PLCC/SROCC scores of 0.9885/0.9812 on LiveCD, 0.9702/0.9658 on KonIQ-10K, and 0.884/0.8765 on BIQ2021, outperforming existing methods. Cross-dataset evaluations showed strong generalizability.

Conclusion: MetaQAP addresses authentic distortion complexities and establishes a robust, generalizable framework for practical IQA applications, advancing the state-of-the-art in no-reference IQA.

Abstract: Image Quality Assessment (IQA) is a critical task in a wide range of applications but remains challenging due to the subjective nature of human perception and the complexity of real-world image distortions. This study proposes MetaQAP, a novel no-reference IQA model designed to address these challenges by leveraging quality-aware pre-training and meta-learning. The model performs three key contributions: pre-training Convolutional Neural Networks (CNNs) on a quality-aware dataset, implementing a quality-aware loss function to optimize predictions, and integrating a meta-learner to form an ensemble model that effectively combines predictions from multiple base models. Experimental evaluations were conducted on three benchmark datasets: LiveCD, KonIQ-10K, and BIQ2021. The proposed MetaQAP model achieved exceptional performance with Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC) scores of 0.9885/0.9812 on LiveCD, 0.9702/0.9658 on KonIQ-10K, and 0.884/0.8765 on BIQ2021, outperforming existing IQA methods. Cross-dataset evaluations further demonstrated the generalizability of the model, with PLCC and SROCC scores ranging from 0.6721 to 0.8023 and 0.6515 to 0.7805, respectively, across diverse datasets. The ablation study confirmed the significance of each model component, revealing substantial performance degradation when critical elements such as the meta-learner or quality-aware loss function were omitted. MetaQAP not only addresses the complexities of authentic distortions but also establishes a robust and generalizable framework for practical IQA applications. By advancing the state-of-the-art in no-reference IQA, this research provides valuable insights and methodologies for future improvements and extensions in the field.

[201] MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching

Tingman Yan, Tao Liu, Xilian Yang, Qunfei Zhao, Zeyang Xia

Main category: cs.CV

TL;DR: Proposes MatchAttention for cross-view matching with dynamic relative position matching and BilinearSoftmax for efficient attention sampling. Achieves state-of-the-art performance with low computational complexity.

Details

Motivation: Existing cross-attention mechanisms have quadratic complexity and lack explicit matching constraints, making high-resolution image matching challenging.

Method: Uses MatchAttention with dynamic relative position matching, BilinearSoftmax for continuous attention sampling, and hierarchical MatchDecoder. Includes gated cross-MatchAttention and consistency loss to handle occlusions.

Result: MatchStereo-B ranked 1st on Middlebury benchmark, processes KITTI in 29ms. MatchStereo-T handles 4K UHD in 0.1s with 3GB GPU. SOTA on KITTI 2012/2015, ETH3D, Spring flow datasets.

Conclusion: Enables real-time, high-resolution, high-accuracy cross-view matching with efficient attention mechanism and occlusion handling.

Abstract: Cross-view matching is fundamentally achieved through cross-attention mechanisms. However, matching of high-resolution images remains challenging due to the quadratic complexity and lack of explicit matching constraints in the existing cross-attention. This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible. Code is available at https://github.com/TingmanYan/MatchAttention.

[202] Experimental Demonstration of Event-based Optical Camera Communication in Long-Range Outdoor Environment

Miu Sumino, Mayu Ishii, Shun Kaizu, Daisuke Hisano, Yu Nakayama

Main category: cs.CV

TL;DR: Robust demodulation scheme for optical camera communication using event-based vision sensor with OOK, toggle demodulation, and digital PLL.

Details

Motivation: To develop a robust demodulation method for optical camera communication systems that can achieve reliable performance at long distances and high data rates in outdoor environments.

Method: Combines On-Off Keying (OOK) with toggle demodulation and a digital phase-locked loop (PLL) using an event-based vision sensor.

Result: Achieved BER < 10^-3 at 200m-60kbps and 400m-30kbps in outdoor experiments - the first reported achievement at these specifications.

Conclusion: The proposed scheme successfully enables robust optical camera communication with high performance at extended ranges and data rates in outdoor settings.

Abstract: We propose a robust demodulation scheme for optical camera communication systems using an event-based vision sensor, combining OOK with toggle demodulation and a digital phase-locked loop. This is the first report to achieve a $\mathrm{BER} < 10^{-3}$ at 200m-60kbps and 400m-30kbps in outdoor experiments.

[203] GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Alexander Valverde, Brian Xu, Yuyin Zhou, Meng Xu, Hongyun Wang

Main category: cs.CV

TL;DR: GauSSmart is a hybrid method that combines 2D foundational models with 3D Gaussian Splatting to improve scene reconstruction by addressing sparse coverage and fine detail preservation issues.

Details

Motivation: Gaussian Splatting struggles with capturing fine details and maintaining realism in sparse regions due to limitations of sparse 3D training data, motivating the need for hybrid approaches.

Method: Integrates 2D computer vision techniques including convex filtering and semantic feature supervision from foundational models like DINO, using 2D segmentation priors and high-dimensional feature embeddings to guide Gaussian splat densification and refinement.

Result: Outperforms existing Gaussian Splatting methods across three datasets in the majority of evaluated scenes, demonstrating improved coverage in underrepresented areas and better preservation of structural details.

Conclusion: Hybrid 2D-3D approaches combining 2D foundational models with 3D reconstruction pipelines can overcome limitations inherent in either approach alone, showing significant potential for enhanced scene reconstruction.

Abstract: Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

[204] CLEAR: Causal Learning Framework For Robust Histopathology Tumor Detection Under Out-Of-Distribution Shifts

Kieu-Anh Truong Thi, Huy-Hieu Pham, Duc-Trong Le

Main category: cs.CV

TL;DR: A causal-inference-based framework that uses the front-door principle to address domain shift in histopathology by leveraging semantic features and mitigating confounders, achieving up to 7% improvement on benchmark datasets.

Details

Motivation: Domain shift in histopathology caused by differences in acquisition processes or data sources challenges deep learning model generalization. Existing methods focus on statistical correlations but overlook causal relationships.

Method: Proposed causal-inference framework implementing the front-door principle with transformation strategies that incorporate mediators and observed tissue slides to leverage semantic features while mitigating confounder impact.

Result: Validated on CAMELYON17 and private histopathology datasets, achieving consistent performance gains across unseen domains with up to 7% improvement, outperforming existing baselines.

Conclusion: Causal inference shows strong potential as a powerful tool for addressing domain shift in histopathology image analysis, providing more robust generalization capabilities.

Abstract: Domain shift in histopathology, often caused by differences in acquisition processes or data sources, poses a major challenge to the generalization ability of deep learning models. Existing methods primarily rely on modeling statistical correlations by aligning feature distributions or introducing statistical variation, yet they often overlook causal relationships. In this work, we propose a novel causal-inference-based framework that leverages semantic features while mitigating the impact of confounders. Our method implements the front-door principle by designing transformation strategies that explicitly incorporate mediators and observed tissue slides. We validate our method on the CAMELYON17 dataset and a private histopathology dataset, demonstrating consistent performance gains across unseen domains. As a result, our approach achieved up to a 7% improvement in both the CAMELYON17 dataset and the private histopathology dataset, outperforming existing baselines. These results highlight the potential of causal inference as a powerful tool for addressing domain shift in histopathology image analysis.

[205] Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding

Kyungryul Back, Seongbeom Park, Milim Kim, Mincheol Kwon, SangHyeok Lee, Hyunyoung Lee, Junhee Cho, Seunghyun Park, Jinkyu Kim

Main category: cs.CV

TL;DR: Training-free tri-layer contrastive decoding with watermarking reduces hallucinations in Large Vision-Language Models by selecting mature/amateur layers, identifying pivot layers via watermark questions, and applying contrastive decoding.

Details

Motivation: LVLMs often hallucinate by relying too heavily on single modalities or memorizing training data without proper visual grounding, despite showing promising multimodal performance.

Method: Three-step approach: (1) select mature and amateur layers among decoding layers, (2) identify pivot layer using watermark-related questions to assess visual grounding, (3) apply tri-layer contrastive decoding for final output generation.

Result: Achieves state-of-the-art performance on benchmarks (POPE, MME, AMBER) in reducing hallucinations and generates more visually grounded responses.

Conclusion: The proposed training-free method effectively mitigates hallucinations in LVLMs through tri-layer contrastive decoding with watermarking, improving visual grounding without requiring additional training.

Abstract: Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations – they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.

[206] A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection

Shivangi Yadav, Arun Ross

Main category: cs.CV

TL;DR: MID-StyleGAN is a novel framework that generates synthetic ocular images for iris presentation attack detection, combining diffusion models and GANs to address data scarcity in biometric security.

Details

Motivation: There is a scarcity of datasets for training and evaluating iris presentation attack detection (PAD) techniques due to difficulties in constructing and imaging presentation attacks (PAs) like artificial eyes, printed images, and cosmetic contact lenses.

Method: MID-StyleGAN combines diffusion models and GANs with a multi-domain architecture that enables translation between bonafide ocular images and different PA domains. It uses an adaptive loss function tailored for ocular data to maintain domain consistency.

Result: The method outperforms existing approaches in generating high-quality synthetic ocular images. On the LivDet2020 dataset, it improved the true detect rate at 1% false detect rate from 93.41% to 98.72%, significantly enhancing PAD system performance.

Conclusion: MID-StyleGAN provides a scalable solution to data scarcity in iris and ocular biometrics by generating realistic synthetic data that effectively improves presentation attack detection systems.

Abstract: An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.

[207] Vision-Centric Activation and Coordination for Multimodal Large Language Models

Yunnan Wang, Fan Lu, Kecheng Zheng, Ziyuan Huang, Ziqiang Li, Wenjun Zeng, Xin Jin

Main category: cs.CV

TL;DR: VaCo enhances MLLMs by integrating vision-centric supervision from multiple vision foundation models through visual discriminative alignment, addressing the limitation of text-only supervision in existing MLLMs.

Details

Motivation: Mainstream MLLMs are supervised only by next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities, which limits their visual comprehension capabilities.

Method: Introduces VaCo with learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) to activate specific visual signals under supervision from diverse VFMs, and uses Token Gateway Mask (TGM) to coordinate representation conflicts across multiple VFMs.

Result: Extensive experiments show VaCo significantly improves performance of different MLLMs on various benchmarks, demonstrating superior capabilities in visual comprehension.

Conclusion: VaCo effectively addresses the vision-centric supervision gap in MLLMs by coordinating multiple vision foundation models, leading to enhanced visual comprehension performance across diverse benchmarks.

Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.

[208] Leveraging Cycle-Consistent Anchor Points for Self-Supervised RGB-D Registration

Siddharth Tourani, Jayaram Reddy, Sarvesh Thakur, K Madhava Krishna, Muhammad Haris Khan, N Dinesh Reddy

Main category: cs.CV

TL;DR: A self-supervised RGB-D registration method using cycle-consistent keypoints and a novel pose block with GRU and transformation synchronization, achieving state-of-the-art performance on ScanNet and 3DMatch datasets.

Details

Motivation: To leverage the abundance of unlabeled RGB-D data from consumer depth cameras for geometric scene reasoning, moving beyond traditional geometric and feature-based similarity approaches.

Method: Uses cycle-consistent keypoints for spatial coherence constraints and introduces a pose block combining GRU recurrent unit with transformation synchronization to blend historical and multi-view data.

Result: Outperforms previous self-supervised registration methods on ScanNet and 3DMatch datasets, and even surpasses some older supervised methods.

Conclusion: The proposed components can be effectively integrated into existing methods, demonstrating their broad applicability and effectiveness in RGB-D registration tasks.

Abstract: With the rise in consumer depth cameras, a wealth of unlabeled RGB-D data has become available. This prompts the question of how to utilize this data for geometric reasoning of scenes. While many RGB-D registration meth- ods rely on geometric and feature-based similarity, we take a different approach. We use cycle-consistent keypoints as salient points to enforce spatial coherence constraints during matching, improving correspondence accuracy. Additionally, we introduce a novel pose block that combines a GRU recurrent unit with transformation synchronization, blending historical and multi-view data. Our approach surpasses previous self- supervised registration methods on ScanNet and 3DMatch, even outperforming some older supervised methods. We also integrate our components into existing methods, showing their effectiveness.

[209] Spatial Preference Rewarding for MLLMs Spatial Understanding

Han Qiu, Peng Gao, Lewei Lu, Xiaoqin Zhang, Ling Shao, Shijian Lu

Main category: cs.CV

TL;DR: SPR (Spatial Preference Rewarding) enhances MLLMs’ spatial understanding by rewarding detailed responses with precise object localization over vague ones, using semantic and localization scores for direct preference optimization.

Details

Motivation: MLLMs lack fine-grained spatial perception abilities like detailed region descriptions and accurate object localization, and often fail to meet user requirements for spatial understanding due to focusing on pre-annotated data without direct response supervision.

Method: SPR introduces semantic and localization scores to evaluate MLLM-generated descriptions, pairs best-scored refinements with lowest-scored initial descriptions for direct preference optimization, and uses randomly selected image regions and descriptions.

Result: Extensive experiments show SPR effectively improves MLLM spatial understanding capabilities with minimal training overhead on standard referring and grounding benchmarks.

Conclusion: SPR successfully enhances MLLMs’ fine-grained spatial capabilities through preference-based optimization, addressing limitations in detailed spatial perception and localization accuracy.

Abstract: Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user’s requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs’ actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs’ spatial capabilities by rewarding MLLMs’ detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR

[210] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Dongnam Byun, Jungwon Park, Jumgmin Ko, Changin Choi, Wonjong Rhee

Main category: cs.CV

TL;DR: DOS improves multi-object image generation by modifying CLIP text embeddings to address object neglect and mixing issues in text-to-image models.

Details

Motivation: Current text-to-image models struggle with prompts containing multiple objects, often resulting in object neglect or object mixing across four problematic scenarios.

Method: DOS modifies three types of CLIP text embeddings before passing them to text-to-image models, based on observations about CLIP embedding properties.

Result: DOS consistently improves multi-object generation success rate, reduces object mixing, and significantly outperforms four competing methods by 26.24%-43.04% in human evaluations.

Conclusion: DOS is a practical and effective solution for improving multi-object image generation in text-to-image models.

Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

[211] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights

Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan

Main category: cs.CV

TL;DR: Proposes DRBD-Mamba, an efficient 3D brain tumor segmentation model using dual-resolution bi-directional Mamba with space-filling curves and gated fusion, achieving improved accuracy and 15x efficiency gains.

Details

Motivation: Address computational overhead of Mamba-based models and lack of robustness evaluation across diverse BraTS data partitions for brain tumor segmentation.

Method: Uses space-filling curve for 3D-to-1D mapping, gated fusion module for forward/reverse context integration, quantization block for feature discretization, and proposes systematic five-fold evaluation on BraTS2023.

Result: Achieves Dice improvements of 0.10% (whole tumor), 1.75% (tumor core), 0.93% (enhancing tumor) on test set, with 15x efficiency improvement while maintaining accuracy.

Conclusion: DRBD-Mamba provides robust and computationally efficient brain tumor segmentation with competitive performance across diverse data conditions.

Abstract: Accurate brain tumor segmentation is significant for clinical diagnosis and treatment. It is challenging due to the heterogeneity of tumor subregions. Mamba-based State Space Models have demonstrated promising performance. However, they incur significant computational overhead due to sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address these limitations, we propose dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that discretizes features to improve robustness. In addition, we propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present detailed analysis of common failure scenarios. On the 20% test set used by recent methods, our model achieves Dice improvements of 0.10% for whole tumor, 1.75% for tumor core, and 0.93% for enhancing tumor. Evaluations on the proposed systematic five folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 0.86% for tumor core and 1.45% for enhancing tumor over existing state-of-the-art. Furthermore, our model attains 15 times improvement in efficiency while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing approaches.

[212] BoardVision: Deployment-ready and Robust Motherboard Defect Detection with YOLO+Faster-RCNN Ensemble

Brandon Hill, Kma Solaiman

Main category: cs.CV

TL;DR: BoardVision is a reproducible framework for detecting assembly-level motherboard defects using computer vision, featuring systematic benchmarking of YOLOv7 and Faster R-CNN detectors, a lightweight ensemble method (CTV Voter), robustness evaluation under realistic perturbations, and a deployable GUI inspection tool.

Details

Motivation: Assembly-level inspection of full motherboards remains underexplored compared to bare-board or trace-level defects, despite being critical for reliability in high-volume electronics manufacturing.

Method: Benchmarked YOLOv7 and Faster R-CNN on MiracleFactory motherboard dataset; proposed Confidence-Temporal Voting (CTV Voter) ensemble to balance precision and recall; evaluated robustness under sharpness, brightness, and orientation perturbations; developed GUI-driven inspection tool.

Result: YOLOv7 excels in precision but underperforms in recall, while Faster R-CNN shows the reverse pattern; CTV Voter ensemble effectively balances precision and recall; identified stability challenges under realistic perturbations.

Conclusion: Computer vision techniques can successfully transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing through systematic evaluation, ensemble methods, and deployable tools.

Abstract: Motherboard defect detection is critical for ensuring reliability in high-volume electronics manufacturing. While prior research in PCB inspection has largely targeted bare-board or trace-level defects, assembly-level inspection of full motherboards inspection remains underexplored. In this work, we present BoardVision, a reproducible framework for detecting assembly-level defects such as missing screws, loose fan wiring, and surface scratches. We benchmark two representative detectors - YOLOv7 and Faster R-CNN, under controlled conditions on the MiracleFactory motherboard dataset, providing the first systematic comparison in this domain. To mitigate the limitations of single models, where YOLO excels in precision but underperforms in recall and Faster R-CNN shows the reverse, we propose a lightweight ensemble, Confidence-Temporal Voting (CTV Voter), that balances precision and recall through interpretable rules. We further evaluate robustness under realistic perturbations including sharpness, brightness, and orientation changes, highlighting stability challenges often overlooked in motherboard defect detection. Finally, we release a deployable GUI-driven inspection tool that bridges research evaluation with operator usability. Together, these contributions demonstrate how computer vision techniques can transition from benchmark results to practical quality assurance for assembly-level motherboard manufacturing.

[213] DCMIL: A Progressive Representation Learning Model of Whole Slide Images for Cancer Prognosis Analysis

Chao Tu, Kun Huang, Jie Zhang, Qianjin Feng, Yu Zhang, Zhenyuan Ning

Main category: cs.CV

TL;DR: DCMIL is a dual-curriculum contrastive multi-instance learning model that efficiently processes whole slide images for cancer prognosis without dense annotations, outperforming standard methods across 12 cancer types.

Details

Motivation: Progress in computational pathology is hindered by computational bottlenecks from gigapixel-size inputs and scarcity of dense manual annotations, with current methods overlooking fine-grained information across multi-magnification WSIs and tumor microenvironment variations.

Method: Proposes an easy-to-hard progressive representation learning model called DCMIL that uses dual-curriculum contrastive multi-instance learning to process WSIs without dense annotations and directly transform gigapixel-size images into outcome predictions.

Result: Extensive experiments on 12 cancer types (5,954 patients, 12.54 million tiles) show DCMIL outperforms standard WSI-based prognostic models, identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues.

Conclusion: DCMIL offers an effective approach for computational pathology that doesn’t require dense annotations, provides biological insights, and has potential for generating new discoveries in cancer prognosis.

Abstract: The burgeoning discipline of computational pathology shows promise in harnessing whole slide images (WSIs) to quantify morphological heterogeneity and develop objective prognostic modes for human cancers. However, progress is impeded by the computational bottleneck of gigapixel-size inputs and the scarcity of dense manual annotations. Current methods often overlook fine-grained information across multi-magnification WSIs and variations in tumor microenvironments. Here, we propose an easy-to-hard progressive representation learning model, termed dual-curriculum contrastive multi-instance learning (DCMIL), to efficiently process WSIs for cancer prognosis. The model does not rely on dense annotations and enables the direct transformation of gigapixel-size WSIs into outcome predictions. Extensive experiments on twelve cancer types (5,954 patients, 12.54 million tiles) demonstrate that DCMIL outperforms standard WSI-based prognostic models. Additionally, DCMIL identifies fine-grained prognosis-salient regions, provides robust instance uncertainty estimation, and captures morphological differences between normal and tumor tissues, with the potential to generate new biological insights. All codes have been made publicly accessible at https://github.com/tuuuc/DCMIL.

[214] Real-Time Neural Video Compression with Unified Intra and Inter Coding

Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang, Dong Liu

Main category: cs.CV

TL;DR: A neural video compression framework that combines intra and inter coding in a unified model, addressing limitations of existing NVC schemes by handling disocclusion, new content, and error propagation more effectively.

Details

Motivation: Existing neural video compression schemes have limitations in handling disocclusion, new content, and suffer from interframe error propagation. The authors aim to eliminate these issues by borrowing concepts from classic video coding.

Method: Proposed an NVC framework with unified intra and inter coding where every frame is processed by a single adaptive model. Also introduced simultaneous two-frame compression to exploit both forward and backward interframe redundancy.

Result: Outperforms DCVC-RT by average 10.7% BD-rate reduction, delivers more stable bitrate and quality per frame, while maintaining real-time encoding/decoding performance.

Conclusion: The unified intra/inter coding approach effectively handles disocclusion and new content while naturally intercepting error propagation, achieving superior compression efficiency over state-of-the-art methods.

Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.

[215] Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Sven Jacob, Weijia Shao, Gjergji Kasneci

Main category: cs.CV

TL;DR: A minimally distorted universal adversarial attack for video object detection that uses nuclear norm regularization to create structured background perturbations, optimized with an adaptive exponentiated gradient method.

Details

Motivation: Video-based object detection is crucial for safety-critical applications, but deep learning detectors remain vulnerable to adversarial attacks, especially universal perturbations.

Method: Proposes a universal adversarial attack using nuclear norm regularization to concentrate perturbations in the background, optimized with an adaptive optimistic exponentiated gradient method for efficiency.

Result: The attack outperforms low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness.

Conclusion: The proposed method provides an effective and stealthy universal adversarial attack for video object detection, with publicly available code and data.

Abstract: Video-based object detection plays a vital role in safety-critical applications. While deep learning-based object detectors have achieved impressive performance, they remain vulnerable to adversarial attacks, particularly those involving universal perturbations. In this work, we propose a minimally distorted universal adversarial attack tailored for video object detection, which leverages nuclear norm regularization to promote structured perturbations concentrated in the background. To optimize this formulation efficiently, we employ an adaptive, optimistic exponentiated gradient method that enhances both scalability and convergence. Our results demonstrate that the proposed attack outperforms both low-rank projected gradient descent and Frank-Wolfe based attacks in effectiveness while maintaining high stealthiness. All code and data are publicly available at https://github.com/jsve96/AO-Exp-Attack.

[216] Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review

Youwan Mahé, Elise Bannier, Stéphanie Leplaideur, Elisa Fromont, Francesca Galassi

Main category: cs.CV

TL;DR: This scoping review synthesizes recent work on unsupervised deep generative models for anomaly detection in neuroimaging, covering autoencoders, VAEs, GANs, and diffusion models applied to brain MRI/CT across various pathologies.

Details

Motivation: Unsupervised deep generative models offer a promising alternative to supervised methods by training exclusively on healthy data and identifying anomalies as deviations from learned normative brain structures, overcoming limitations of supervised approaches that require large annotated datasets.

Method: PRISMA-guided scoping review of 49 studies (2018-2025) comparing unsupervised deep generative models including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models for neuroimaging anomaly detection.

Result: Generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing subtle abnormalities. They can produce interpretable pseudo-healthy reconstructions, which is valuable when annotated data are scarce.

Conclusion: Unsupervised generative models offer a compelling direction for anomaly detection, enabling semi-supervised learning and novel biomarker discovery. Future work should prioritize anatomy-aware modeling, foundation model development, appropriate evaluation metrics, and rigorous clinical validation.

Abstract: Unsupervised deep generative models are emerging as a promising alternative to supervised methods for detecting and segmenting anomalies in brain imaging. Unlike fully supervised approaches, which require large voxel-level annotated datasets and are limited to well-characterised pathologies, these models can be trained exclusively on healthy data and identify anomalies as deviations from learned normative brain structures. This PRISMA-guided scoping review synthesises recent work on unsupervised deep generative models for anomaly detection in neuroimaging, including autoencoders, variational autoencoders, generative adversarial networks, and denoising diffusion models. A total of 49 studies published between 2018 - 2025 were identified, covering applications to brain MRI and, less frequently, CT across diverse pathologies such as tumours, stroke, multiple sclerosis, and small vessel disease. Reported performance metrics are compared alongside architectural design choices. Across the included studies, generative models achieved encouraging performance for large focal lesions and demonstrated progress in addressing more subtle abnormalities. A key strength of generative models is their ability to produce interpretable pseudo-healthy (also referred to as counterfactual) reconstructions, which is particularly valuable when annotated data are scarce, as in rare or heterogeneous diseases. Looking ahead, these models offer a compelling direction for anomaly detection, enabling semi-supervised learning, supporting the discovery of novel imaging biomarkers, and facilitating within- and cross-disease deviation mapping in unified end-to-end frameworks. To realise clinical impact, future work should prioritise anatomy-aware modelling, development of foundation models, task-appropriate evaluation metrics, and rigorous clinical validation.

[217] Pruning Overparameterized Multi-Task Networks for Degraded Web Image Restoration

Thomas Katraouras, Dimitrios Rafailidis

Main category: cs.CV

TL;DR: Proposes MIR-L, a compression strategy for multi-task image restoration models that finds highly sparse subnetworks through iterative pruning, maintaining performance with only 10% of parameters.

Details

Motivation: Multi-task image restoration models are computationally inefficient due to excessive parameters, while image quality degradation from online social networks negatively impacts user experience.

Method: Uses iterative pruning strategy that removes low-magnitude weights across multiple rounds and resets remaining weights to original initialization, uncovering sparse “winning tickets” within overparameterized models.

Result: Experimental evaluation shows MIR-L retains only 10% of trainable parameters while maintaining high performance on deraining, dehazing, and denoising tasks across benchmark datasets.

Conclusion: The proposed compression strategy effectively creates highly sparse multi-task image restoration models that match or surpass dense counterparts, enabling computationally efficient deployment.

Abstract: Image quality is a critical factor in delivering visually appealing content on web platforms. However, images often suffer from degradation due to lossy operations applied by online social networks (OSNs), negatively affecting user experience. Image restoration is the process of recovering a clean high-quality image from a given degraded input. Recently, multi-task (all-in-one) image restoration models have gained significant attention, due to their ability to simultaneously handle different types of image degradations. However, these models often come with an excessively high number of trainable parameters, making them computationally inefficient. In this paper, we propose a strategy for compressing multi-task image restoration models. We aim to discover highly sparse subnetworks within overparameterized deep models that can match or even surpass the performance of their dense counterparts. The proposed model, namely MIR-L, utilizes an iterative pruning strategy that removes low-magnitude weights across multiple rounds, while resetting the remaining weights to their original initialization. This iterative process is important for the multi-task image restoration model’s optimization, effectively uncovering “winning tickets” that maintain or exceed state-of-the-art performance at high sparsity levels. Experimental evaluation on benchmark datasets for the deraining, dehazing, and denoising tasks shows that MIR-L retains only 10% of the trainable parameters while maintaining high image restoration performance. Our code, datasets and pre-trained models are made publicly available at https://github.com/Thomkat/MIR-L.

[218] Grazing Detection using Deep Learning and Sentinel-2 Time Series Data

Aleksis Pirinen, Delia Fano Yela, Smita Chakraborty, Erik Källman

Main category: cs.CV

TL;DR: The paper presents a method for detecting seasonal grazing using Sentinel-2 satellite imagery and CNN-LSTM models, achieving 77% F1 score and demonstrating practical utility for conservation compliance monitoring.

Details

Motivation: Grazing monitoring is crucial for both agricultural production and biodiversity conservation, but scalable monitoring methods are limited. Current approaches lack the ability to efficiently detect grazing patterns across large areas.

Method: Uses Sentinel-2 L2A time series data from April-October for binary prediction (grazed/not grazed). Trains an ensemble of CNN-LSTM models on multi-temporal reflectance features from polygon-defined field boundaries.

Result: Achieved 77% average F1 score across five validation splits, with 90% recall on grazed pastures. Model prioritization yields 17.2 times more confirmed non-grazing sites than random inspection when inspectors can visit only 4% of sites annually.

Conclusion: Coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. The approach enables efficient monitoring of grazing patterns at scale.

Abstract: Grazing shapes both agricultural production and biodiversity, yet scalable monitoring of where grazing occurs remains limited. We study seasonal grazing detection from Sentinel-2 L2A time series: for each polygon-defined field boundary, April-October imagery is used for binary prediction (grazed / not grazed). We train an ensemble of CNN-LSTM models on multi-temporal reflectance features, and achieve an average F1 score of 77 percent across five validation splits, with 90 percent recall on grazed pastures. Operationally, if inspectors can visit at most 4 percent of sites annually, prioritising fields predicted by our model as non-grazed yields 17.2 times more confirmed non-grazing sites than random inspection. These results indicate that coarse-resolution, freely available satellite data can reliably steer inspection resources for conservation-aligned land-use compliance. Code and models have been made publicly available.

[219] Vision Mamba for Permeability Prediction of Porous Media

Ali Kashefi, Tapan Mukerji

Main category: cs.CV

TL;DR: Vision Mamba is introduced as a backbone for 3D porous media permeability prediction, demonstrating advantages over ViTs and CNNs in computational efficiency and parameter requirements.

Details

Motivation: To leverage Vision Mamba's linear scaling with input resolution and smaller parameter count compared to ViTs (quadratic scaling) and CNNs for more efficient permeability prediction in 3D porous media.

Method: Used Vision Mamba as backbone network for permeability prediction, compared performance with ViT and CNN models, and conducted ablation studies to analyze component effects on accuracy.

Result: Demonstrated practical advantages of Vision Mamba over ViTs and CNNs in permeability prediction, showing improved computational and memory efficiency.

Conclusion: Vision Mamba has potential to replace ViTs in large vision models for permeability prediction and similar applications due to its efficiency advantages.

Abstract: Vision Mamba has recently received attention as an alternative to Vision Transformers (ViTs) for image classification. The network size of Vision Mamba scales linearly with input image resolution, whereas ViTs scale quadratically, a feature that improves computational and memory efficiency. Moreover, Vision Mamba requires a significantly smaller number of trainable parameters than traditional convolutional neural networks (CNNs), and thus, they can be more memory efficient. Because of these features, we introduce, for the first time, a neural network that uses Vision Mamba as its backbone for predicting the permeability of three-dimensional porous media. We compare the performance of Vision Mamba with ViT and CNN models across multiple aspects of permeability prediction and perform an ablation study to assess the effects of its components on accuracy. We demonstrate in practice the aforementioned advantages of Vision Mamba over ViTs and CNNs in the permeability prediction of three-dimensional porous media. We make the source code publicly available to facilitate reproducibility and to enable other researchers to build on and extend this work. We believe the proposed framework has the potential to be integrated into large vision models in which Vision Mamba is used instead of ViTs.

[220] Real-Time Surgical Instrument Defect Detection via Non-Destructive Testing

Qurrat Ul Ain, Atif Aftab Ahmed Jilani, Zunaira Shafqat, Nigar Azhar Butt

Main category: cs.CV

TL;DR: SurgScan is an AI-powered framework using YOLOv8 for real-time defect detection in surgical instruments, achieving 99.3% accuracy with fast inference speeds suitable for industrial deployment.

Details

Motivation: Manual inspection of surgical instruments is prone to human error and inconsistency, posing serious risks to sterility, mechanical integrity, and patient safety.

Method: Uses YOLOv8 model trained on 102,876 high-resolution images covering 11 instrument types and 5 major defect categories, with contrast-enhanced preprocessing to improve detection.

Result: Achieves 99.3% accuracy with real-time inference speeds of 4.2-5.8 ms per image, outperforming state-of-the-art CNN architectures.

Conclusion: SurgScan provides a scalable, cost-effective AI solution for automated quality control that reduces reliance on manual inspection while ensuring compliance with medical standards.

Abstract: Defective surgical instruments pose serious risks to sterility, mechanical integrity, and patient safety, increasing the likelihood of surgical complications. However, quality control in surgical instrument manufacturing often relies on manual inspection, which is prone to human error and inconsistency. This study introduces SurgScan, an AI-powered defect detection framework for surgical instruments. Using YOLOv8, SurgScan classifies defects in real-time, ensuring high accuracy and industrial scalability. The model is trained on a high-resolution dataset of 102,876 images, covering 11 instrument types and five major defect categories. Extensive evaluation against state-of-the-art CNN architectures confirms that SurgScan achieves the highest accuracy (99.3%) with real-time inference speeds of 4.2-5.8 ms per image, making it suitable for industrial deployment. Statistical analysis demonstrates that contrast-enhanced preprocessing significantly improves defect detection, addressing key limitations in visual inspection. SurgScan provides a scalable, cost-effective AI solution for automated quality control, reducing reliance on manual inspection while ensuring compliance with ISO 13485 and FDA standards, paving the way for enhanced defect detection in medical manufacturing.

[221] Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Yunze Tong, Didi Zhu, Zijing Hu, Jinluan Yang, Ziyu Zhao

Main category: cs.CV

TL;DR: A noise projector is proposed to refine initial noise in Stable Diffusion by making it prompt-aware, improving text-image alignment without modifying the base model.

Details

Motivation: Address training-inference mismatch in Stable Diffusion where training uses prompt-specific noise but inference uses prompt-agnostic Gaussian noise, causing misalignment.

Method: Use a noise projector conditioned on prompt embedding to map noise to prompt-aware distribution, trained via VLM feedback distillation and quasi-direct preference optimization.

Result: Improves text-image alignment across diverse prompts with small inference cost, replacing multi-sample selection with single forward pass.

Conclusion: Prompt-aware noise projection effectively bridges training-inference gap in diffusion models, enhancing alignment without model modifications.

Abstract: In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

[222] PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL is a state-of-the-art, resource-efficient document parsing model that combines a dynamic resolution visual encoder with a language model to accurately recognize text, tables, formulas, and charts in 109 languages.

Details

Motivation: To develop an efficient document parsing solution that can handle complex elements across multiple languages while maintaining minimal resource consumption for practical deployment.

Method: Uses PaddleOCR-VL-0.9B, a compact vision-language model integrating NaViT-style dynamic resolution visual encoder with ERNIE-4.5-0.3B language model for accurate element recognition.

Result: Achieves SOTA performance in both page-level document parsing and element-level recognition, significantly outperforms existing solutions, shows strong competitiveness against top VLMs, and delivers fast inference speeds.

Conclusion: PaddleOCR-VL is highly suitable for practical deployment in real-world scenarios due to its SOTA performance, resource efficiency, and fast inference capabilities.

Abstract: In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

[223] Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology

Xinrui Huang, Fan Xiao, Dongming He, Anqi Gao, Dandan Li, Xiaofan Zhang, Shaoting Zhang, Xudong Wang

Main category: cs.CV

TL;DR: DentVFM is the first vision foundation model for dentistry that uses self-supervised learning on 1.6M multi-modal dental images, achieving superior generalization across various dental tasks without requiring extensive labeled data.

Details

Motivation: Address limitations in dental AI systems including single-modality focus, task-specific design, and reliance on costly labeled data that hinder generalization across diverse clinical scenarios.

Method: Developed DentVFM family of vision foundation models using Vision Transformer architecture with 2D/3D variants, trained via self-supervised learning on DentVista dataset (1.6M multi-modal radiographic images), and evaluated using DentBench comprehensive benchmark across 8 dental subspecialties.

Result: DentVFM demonstrates robust generalization to diverse dental tasks (disease diagnosis, treatment analysis, biomarker identification, landmark detection), significantly outperforms supervised/self-supervised/weakly supervised baselines, enables cross-modality diagnostics with better reliability than experienced dentists when conventional imaging is unavailable.

Conclusion: DentVFM sets a new paradigm for dental AI by offering scalable, adaptable, and label-efficient models to improve intelligent dental healthcare and address global oral healthcare gaps.

Abstract: Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.

[224] Acquisition of interpretable domain information during brain MR image harmonization for content-based image retrieval

Keima Abe, Hayato Muraki, Shuhei Tomoshige, Kenichi Oishi, Hitoshi Iyatomi

Main category: cs.CV

TL;DR: PL-SE-ADA is a domain harmonization framework for brain MR images that uses pseudo-linear-style encoding to disentangle domain-invariant and domain-specific features while maintaining interpretability.

Details

Motivation: Medical images suffer from domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance. Existing domain harmonization methods lack interpretability, which is essential for medical applications.

Method: Proposes PL-SE-ADA with two encoders (f_E and f_SE) to extract domain-invariant (z_u) and domain-specific (z_d) features, a decoder for reconstruction, and a domain predictor. Uses adversarial training between encoder and domain predictor, and reconstructs input image by summing reconstructions from z_u and z_d.

Result: Achieves equal or better performance than prior methods in image reconstruction, disease classification, and domain recognition. Enables visualization of both domain-independent brain features and domain-specific components.

Conclusion: PL-SE-ADA provides an interpretable domain harmonization framework that preserves disease-relevant information while offering high interpretability across the entire system.

Abstract: Medical images like MR scans often show domain shifts across imaging sites due to scanner and protocol differences, which degrade machine learning performance in tasks such as disease classification. Domain harmonization is thus a critical research focus. Recent approaches encode brain images $\boldsymbol{x}$ into a low-dimensional latent space $\boldsymbol{z}$, then disentangle it into $\boldsymbol{z_u}$ (domain-invariant) and $\boldsymbol{z_d}$ (domain-specific), achieving strong results. However, these methods often lack interpretability$-$an essential requirement in medical applications$-$leaving practical issues unresolved. We propose Pseudo-Linear-Style Encoder Adversarial Domain Adaptation (PL-SE-ADA), a general framework for domain harmonization and interpretable representation learning that preserves disease-relevant information in brain MR images. PL-SE-ADA includes two encoders $f_E$ and $f_{SE}$ to extract $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, a decoder to reconstruct the image $f_D$, and a domain predictor $g_D$. Beyond adversarial training between the encoder and domain predictor, the model learns to reconstruct the input image $\boldsymbol{x}$ by summing reconstructions from $\boldsymbol{z_u}$ and $\boldsymbol{z_d}$, ensuring both harmonization and informativeness. Compared to prior methods, PL-SE-ADA achieves equal or better performance in image reconstruction, disease classification, and domain recognition. It also enables visualization of both domain-independent brain features and domain-specific components, offering high interpretability across the entire framework.

[225] Exploring Image Representation with Decoupled Classical Visual Descriptors

Chenyuan Qu, Hao Chen, Jianbo Jiao

Main category: cs.CV

TL;DR: VisualSplit is a framework that decomposes images into classical visual descriptors (edge, color, intensity) and uses reconstruction-driven pre-training to learn interpretable representations that benefit modern learning tasks.

Details

Motivation: Bridge the gap between opaque deep learning representations and interpretable classical visual descriptors by exploring whether modern learning can benefit from classical visual cues.

Method: Explicitly decomposes images into decoupled classical descriptors, treating each as independent but complementary components, using reconstruction-driven pre-training to capture the essence of each descriptor while preserving interpretability.

Result: Enables effective attribute control in advanced visual tasks like image generation and editing, extending beyond conventional classification and segmentation.

Conclusion: The framework demonstrates the effectiveness of integrating classical visual descriptors into modern learning approaches for improved interpretability and control in visual understanding tasks.

Abstract: Exploring and understanding efficient image representations is a long-standing challenge in computer vision. While deep learning has achieved remarkable progress across image understanding tasks, its internal representations are often opaque, making it difficult to interpret how visual information is processed. In contrast, classical visual descriptors (e.g. edge, colour, and intensity distribution) have long been fundamental to image analysis and remain intuitively understandable to humans. Motivated by this gap, we ask a central question: Can modern learning benefit from these classical cues? In this paper, we answer it with VisualSplit, a framework that explicitly decomposes images into decoupled classical descriptors, treating each as an independent but complementary component of visual knowledge. Through a reconstruction-driven pre-training scheme, VisualSplit learns to capture the essence of each visual descriptor while preserving their interpretability. By explicitly decomposing visual attributes, our method inherently facilitates effective attribute control in various advanced visual tasks, including image generation and editing, extending beyond conventional classification and segmentation, suggesting the effectiveness of this new learning approach for visual understanding. Project page: https://chenyuanqu.com/VisualSplit/.

Ziqi Jiang, Yanghao Wang, Long Chen

Main category: cs.CV

TL;DR: FMA is a model-agnostic multi-step alignment approach that learns a cross-modal velocity field for more precise and robust feature alignment in challenging datasets where modalities are highly entangled.

Details

Motivation: Existing PEFT methods perform only one-step adjustment, which is insufficient for complex datasets where features from different modalities are highly entangled. Current methods can only slightly adjust either visual or textual features.

Method: Proposes Flow Matching Alignment (FMA) with three key components: 1) fixed coupling strategy for category correspondence, 2) noise augmentation to alleviate data scarcity, and 3) early-stopping solver for efficiency and accuracy.

Result: FMA consistently yields significant performance gains across various benchmarks and backbones, particularly on challenging datasets, demonstrating superior multi-step rectification ability compared to one-step PEFT methods.

Conclusion: Multi-step adjustment through cross-modal velocity field learning enables more precise and robust alignment than traditional one-step PEFT methods, especially for complex datasets with highly entangled multimodal features.

Abstract: Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today’s PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

[227] Consistent text-to-image generation via scene de-contextualization

Song Tang, Peihao Gong, Kunyu Li, Kai Guo, Boyu Wang, Mao Ye, Jianwei Zhang, Xiatian Zhu

Main category: cs.CV

TL;DR: This paper proposes Scene De-Contextualization (SDeC), a training-free prompt embedding editing method that suppresses scene-ID correlation in text-to-image models to reduce identity shift and improve consistent subject generation across diverse scenes.

Details

Motivation: Current text-to-image generation methods suffer from identity shift when generating the same subject across different scenes, and existing solutions require knowing all target scenes in advance, which is unrealistic for real-world applications.

Method: SDeC identifies and suppresses latent scene-ID correlation in ID prompt embeddings by quantifying SVD directional stability and adaptively re-weighting eigenvalues. It works as an inversion process of the model’s built-in scene contextualization.

Result: Experiments show SDeC significantly enhances identity preservation while maintaining scene diversity, without requiring prior access to all target scenes.

Conclusion: SDeC provides a flexible, efficient training-free solution for consistent text-to-image generation that works well in real-world scenarios where target scenes are unknown or vary over time.

Abstract: Consistent text-to-image (T2I) generation seeks to produce identity-preserving images of the same subject across diverse scenes, yet it often fails due to a phenomenon called identity (ID) shift. Previous methods have tackled this issue, but typically rely on the unrealistic assumption of knowing all target scenes in advance. This paper reveals that a key source of ID shift is the native correlation between subject and scene context, called scene contextualization, which arises naturally as T2I models fit the training distribution of vast natural images. We formally prove the near-universality of this scene-ID correlation and derive theoretical bounds on its strength. On this basis, we propose a novel, efficient, training-free prompt embedding editing approach, called Scene De-Contextualization (SDeC), that imposes an inversion process of T2I’s built-in scene contextualization. Specifically, it identifies and suppresses the latent scene-ID correlation within the ID prompt’s embedding by quantifying the SVD directional stability to adaptively re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene use (one scene per prompt) without requiring prior access to all target scenes. This makes it a highly flexible and general solution well-suited to real-world applications where such prior knowledge is often unavailable or varies over time. Experiments demonstrate that SDeC significantly enhances identity preservation while maintaining scene diversity.

[228] Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video

Yulin Zhang, Cheng Shi, Yang Wang, Sibei Yang

Main category: cs.CV

TL;DR: The paper introduces a proactive AI assistant that processes ego-streaming video to answer questions at opportune moments, featuring proactive coherence, just-in-time responsiveness, and synchronized efficiency. It proposes ESTP-Bench benchmark and a technical pipeline for training such models.

Details

Motivation: To develop AI capable of functioning in human-like settings by moving beyond observation to actively understand, anticipate, and proactively respond to unfolding events in real-time video streams.

Method: Proposes a comprehensive pipeline with: (1) a data engine, (2) multi-stage training strategy, and (3) proactive dynamic compression technique. Introduces ESTP-Bench benchmark and ESTP-F1 metric for evaluation.

Result: The proposed model effectively addresses the three key properties (proactive coherence, just-in-time responsiveness, synchronized efficiency) and outperforms multiple baselines across diverse online and offline benchmarks.

Conclusion: The framework successfully enables AI assistants to proactively answer evolving questions in ego-streaming video while maintaining synchronized perception and reasoning, advancing towards human-like AI capabilities.

Abstract: Envision an AI capable of functioning in human-like settings, moving beyond mere observation to actively understand, anticipate, and proactively respond to unfolding events. Towards this vision, we focus on the innovative task where, given ego-streaming video input, an assistant proactively answers diverse, evolving questions at the opportune moment, while maintaining synchronized perception and reasoning. This task embodies three key properties: (1) Proactive Coherence, (2) Just-in-Time Responsiveness, and (3) Synchronized Efficiency. To evaluate and address these properties, we first introduce ESTP-Bench (Ego Streaming Proactive Benchmark) alongside the ESTP-F1 metric-a novel framework designed for their rigorous assessment. Secondly, we propose a comprehensive technical pipeline to enable models to tackle this challenging task. This pipeline comprises: (1) a data engine, (2) a multi-stage training strategy, and (3) a proactive dynamic compression technique. Our proposed model effectively addresses these critical properties while outperforming multiple baselines across diverse online and offline benchmarks. Project Page:https://zhangyl4.github.io/publications/eyes-wide-open/

[229] BalanceGS: Algorithm-System Co-design for Efficient 3D Gaussian Splatting Training on GPU

Junyi Wu, Jiaming Xu, Jinhao Li, Yongkang Zhou, Jiayi Pan, Xingyang Li, Guohao Dai

Main category: cs.CV

TL;DR: BalanceGS is an algorithm-system co-design that optimizes 3D Gaussian Splatting training by addressing inefficiencies in density allocation, computation workload, and memory access, achieving 1.44× speedup with minimal quality loss.

Details

Motivation: Traditional 3DGS suffers from three critical inefficiencies: skewed density allocation during Gaussian densification, imbalanced computation workload during Gaussian projection, and fragmented memory access during color splatting.

Method: Three-level optimization: (1) Algorithm-level heuristic workload-sensitive Gaussian density control to balance point distributions; (2) System-level similarity-based Gaussian sampling and merging for adaptive workload distribution; (3) Mapping-level reordering-based memory access strategy for efficient RGB storage.

Result: Achieves 1.44× training speedup on NVIDIA A100 GPU with negligible quality degradation, while removing 80% redundant Gaussians in dense regions.

Conclusion: BalanceGS successfully addresses the key inefficiencies in 3DGS through algorithm-system co-design, significantly improving training efficiency while maintaining reconstruction quality.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising 3D reconstruction technique. The traditional 3DGS training pipeline follows three sequential steps: Gaussian densification, Gaussian projection, and color splatting. Despite its promising reconstruction quality, this conventional approach suffers from three critical inefficiencies: (1) Skewed density allocation during Gaussian densification, (2) Imbalanced computation workload during Gaussian projection and (3) Fragmented memory access during color splatting. To tackle the above challenges, we introduce BalanceGS, the algorithm-system co-design for efficient training in 3DGS. (1) At the algorithm level, we propose heuristic workload-sensitive Gaussian density control to automatically balance point distributions - removing 80% redundant Gaussians in dense regions while filling gaps in sparse areas. (2) At the system level, we propose Similarity-based Gaussian sampling and merging, which replaces the static one-to-one thread-pixel mapping with adaptive workload distribution - threads now dynamically process variable numbers of Gaussians based on local cluster density. (3) At the mapping level, we propose reordering-based memory access mapping strategy that restructures RGB storage and enables batch loading in shared memory. Extensive experiments demonstrate that compared with 3DGS, our approach achieves a 1.44$\times$ training speedup on a NVIDIA A100 GPU with negligible quality degradation.

[230] CALM-Net: Curvature-Aware LiDAR Point Cloud-based Multi-Branch Neural Network for Vehicle Re-Identification

Dongwook Lee, Sol Han, Jinwhan Kim

Main category: cs.CV

TL;DR: CALM-Net is a curvature-aware multi-branch neural network for vehicle re-identification using LiDAR point clouds, achieving 1.97% accuracy improvement on nuScenes dataset.

Details

Motivation: To address the challenge of learning discriminative and complementary features from 3D point clouds to distinguish between vehicles in re-identification tasks.

Method: Uses multi-branch architecture integrating edge convolution, point attention, and curvature embedding that characterizes local surface variation in point clouds.

Result: Achieves mean re-identification accuracy improvement of approximately 1.97% points compared with the strongest baseline on nuScenes dataset.

Conclusion: Demonstrates effectiveness of incorporating curvature information and multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.

Abstract: This paper presents CALM-Net, a curvature-aware LiDAR point cloud-based multi-branch neural network for vehicle re-identification. The proposed model addresses the challenge of learning discriminative and complementary features from three-dimensional point clouds to distinguish between vehicles. CALM-Net employs a multi-branch architecture that integrates edge convolution, point attention, and a curvature embedding that characterizes local surface variation in point clouds. By combining these mechanisms, the model learns richer geometric and contextual features that are well suited for the re-identification task. Experimental evaluation on the large-scale nuScenes dataset demonstrates that CALM-Net achieves a mean re-identification accuracy improvement of approximately 1.97% points compared with the strongest baseline in our study. The results confirms the effectiveness of incorporating curvature information into deep learning architectures and highlight the benefit of multi-branch feature learning for LiDAR point cloud-based vehicle re-identification.

[231] Talking Points: Describing and Localizing Pixels

Matan Rusanovsky, Shimon Malnick, Shai Avidan

Main category: cs.CV

TL;DR: A novel framework for pixel-level keypoint grounding that uses natural language descriptions to precisely localize keypoints, overcoming limitations of object-level vision-language models.

Details

Motivation: Current vision-language models only achieve object-level or region-level grounding, lacking pixel-precise keypoint comprehension through natural language.

Method: Two-component framework: Point Descriptor generates contextual keypoint descriptions, and Point Localizer regresses pixel coordinates from descriptions. Trained on LlamaPointInPart dataset using GRPO optimization with frozen localizer as reward model.

Result: Superior performance compared to baseline models on LlamaPointInPart dataset, demonstrating effective pixel-level keypoint localization through natural language.

Conclusion: The bidirectional framework enables both keypoint-guided image understanding and language-guided precise localization, with potential for future applications.

Abstract: Vision-language models have achieved remarkable success in cross-modal understanding. Yet, these models remain limited to object-level or region-level grounding, lacking the capability for pixel-precise keypoint comprehension through natural language. We introduce a novel framework for pixel level grounding. The framework consists of two complementary components: a Point Descriptor that generates rich, contextual descriptions of individual keypoints, and a Point Localizer that regresses precise pixel coordinates from these descriptions. Unlike prior work that relies on templated prompts or keypoint names, our approach produces free-form, coarse-to-fine descriptions that situate keypoints within their visual context. Since there is no available dataset to train such a system, we introduce LlamaPointInPart, a carefully curated dataset of 20K+ image-keypoint-description triplets synthesized from multiple vision-language models, capturing multi-scale information from scene-level context to visual features around the keypoint. For cross-category generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the frozen Point Localizer as a reward model to produce descriptions that maximize localization accuracy. To evaluate our results we establish a new evaluation protocol. Instead of comparing the text description produced by our method to the ground truth, we use the localizer to determine how close is the predicted point generated to the ground truth point. Experiments demonstrate superior performance compared to baseline models on LlamaPointInPart.The bidirectional nature of our framework should enable future applications in both keypoint-guided image understanding and language-guided precise localization. Our code and dataset are publicly available at https://github.com/matanr/Talking_Points.

[232] STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding

Zhifei Chen, Tianshuo Xu, Leyi Wu, Luozhou Wang, Dongyu Yan, Zihan You, Wenting Luo, Guo Zhang, Yingcong Chen

Main category: cs.CV

TL;DR: STANCE is an image-to-video framework that improves motion coherence using Instance Cues for dense 2.5D motion guidance and Dense RoPE to preserve motion token salience, with joint RGB+auxiliary-map prediction.

Details

Motivation: Current video generation struggles with coherent object motion and interactions due to weak motion guidance from sparse 2D hints and optimization conflicts between appearance and temporal consistency.

Method: Uses Instance Cues to convert sparse user hints into dense 2.5D motion fields via instance flow averaging and depth augmentation. Implements Dense RoPE to preserve motion token salience in token space. Employs joint RGB+auxiliary-map prediction for structure anchoring.

Result: Improved temporal coherence and motion consistency without requiring per-frame trajectory scripts. Better handling of depth ambiguity compared to 2D arrow inputs.

Conclusion: STANCE effectively addresses motion guidance bottlenecks through dense motion representation and token-level preservation, enabling more coherent video generation with user-friendly controls.

Abstract: Video generation has recently made striking visual progress, but maintaining coherent object motion and interactions remains difficult. We trace two practical bottlenecks: (i) human-provided motion hints (e.g., small 2D maps) often collapse to too few effective tokens after encoding, weakening guidance; and (ii) optimizing for appearance and motion in a single head can favor texture over temporal consistency. We present STANCE, an image-to-video framework that addresses both issues with two simple components. First, we introduce Instance Cues – a pixel-aligned control signal that turns sparse, user-editable hints into a dense 2.5D (camera-relative) motion field by averaging per-instance flow and augmenting with monocular depth over the instance mask. This reduces depth ambiguity compared to 2D arrow inputs while remaining easy to use. Second, we preserve the salience of these cues in token space with Dense RoPE, which tags a small set of motion tokens (anchored on the first frame) with spatial-addressable rotary embeddings. Paired with joint RGB (+) auxiliary-map prediction (segmentation or depth), our model anchors structure while RGB handles appearance, stabilizing optimization and improving temporal coherence without requiring per-frame trajectory scripts.

[233] Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers

Hugo Markoff, Jevgenijs Galaktionovs

Main category: cs.CV

TL;DR: A hierarchical re-classification system that refines high-level taxonomic labels to species-level identification by combining SpeciesNet predictions with CLIP embeddings and metric learning, achieving 96.5% accuracy on re-classified detections.

Details

Motivation: Current animal classification models like SpeciesNet use conservative rollup strategies that result in many animals being labeled at high taxonomic levels rather than species level, limiting the specificity of identification.

Method: Five-stage pipeline: high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring. Combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning.

Result: Recovered 761 bird detections from “blank” and “animal” labels, re-classified 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9% of cases.

Conclusion: The hierarchical re-classification system successfully refines high-level taxonomic labels to species-level identification with high accuracy, demonstrating the effectiveness of combining multiple approaches for improved animal classification.

Abstract: State-of-the-art animal classification models like SpeciesNet provide predictions across thousands of species but use conservative rollup strategies, resulting in many animals labeled at high taxonomic levels rather than species. We present a hierarchical re-classification system for the Animal Detect platform that combines SpeciesNet EfficientNetV2-M predictions with CLIP embeddings and metric learning to refine high-level taxonomic labels toward species-level identification. Our five-stage pipeline (high-confidence acceptance, bird override, centroid building, triplet-loss metric learning, and adaptive cosine-distance scoring) is evaluated on a segment of the LILA BC Desert Lion Conservation dataset (4,018 images, 15,031 detections). After recovering 761 bird detections from “blank” and “animal” labels, we re-classify 456 detections labeled animal, mammal, or blank with 96.5% accuracy, achieving species-level identification for 64.9 percent

[234] Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering

Hugo Markoff, Jevgenijs Galaktionovs

Main category: cs.CV

TL;DR: This paper evaluates zero-shot approaches for organizing unlabeled wildlife camera trap imagery using self-supervised vision transformers, achieving high accuracy with DINOv2+UMAP+GMM and deploying continuous similarity ordering for production use.

Details

Motivation: Camera traps generate millions of wildlife images, but many datasets contain species absent from existing classifiers, creating a need for zero-shot approaches to organize unlabeled imagery.

Method: Compared unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) with dimensionality reduction techniques (PCA, UMAP), and demonstrated continuous 1D similarity ordering via t-SNE projection.

Result: DINOv2 with UMAP and GMM achieved 88.6% accuracy (macro-F1 = 0.874), while 1D sorting reached 88.2% coherence for mammals/birds and 95.2% for fish across 1,500 images on a 5-species test set.

Conclusion: Continuous similarity ordering was deployed in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.

Abstract: Camera traps generate millions of wildlife images, yet many datasets contain species that are absent from existing classifiers. This work evaluates zero-shot approaches for organizing unlabeled wildlife imagery using self-supervised vision transformers, developed and tested within the Animal Detect platform for camera trap analysis. We compare unsupervised clustering methods (DBSCAN, GMM) across three architectures (CLIP, DINOv2, MegaDescriptor) combined with dimensionality reduction techniques (PCA, UMAP), and we demonstrate continuous 1D similarity ordering via t-SNE projection. On a 5-species test set with ground truth labels used only for evaluation, DINOv2 with UMAP and GMM achieves 88.6 percent accuracy (macro-F1 = 0.874), while 1D sorting reaches 88.2 percent coherence for mammals and birds and 95.2 percent for fish across 1,500 images. Based on these findings, we deployed continuous similarity ordering in production, enabling rapid exploratory analysis and accelerating manual annotation workflows for biodiversity monitoring.

[235] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Yuyang Hong, Jiaqi Gu, Qi Yang, Lubin Fan, Yue Wu, Ying Wang, Kun Ding, Shiming Xiang, Jieping Ye

Main category: cs.CV

TL;DR: Wiki-PRF is a three-stage method for KB-VQA that improves multimodal knowledge retrieval through visual tool invocation, multimodal retrieval, and relevance filtering, achieving state-of-the-art performance.

Details

Motivation: Existing RAG methods struggle with multimodal query quality and relevance of retrieved knowledge in KB-VQA tasks.

Method: Three-stage approach: Processing (visual tool invocation), Retrieval (multimodal knowledge retrieval), and Filtering (relevance filtering). Uses VLM trained with reinforcement learning using answer accuracy and format consistency as rewards.

Result: Achieves significant improvements of 36.0 and 42.8 on E-VQA and InfoSeek datasets, setting new state-of-the-art performance.

Conclusion: Wiki-PRF effectively addresses KB-VQA challenges through its three-stage framework and reinforcement learning training, demonstrating superior performance in integrating visual understanding with external knowledge retrieval.

Abstract: Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model’s reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

[236] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding

Ning Ding, Keisuke Fujii, Toru Tamaki

Main category: cs.CV

TL;DR: Shot2Tactic-Caption is a dual-branch framework for badminton video captioning that generates both shot-level action descriptions and tactic-level temporal execution narratives, using a novel shot-wise prompt-guided mechanism.

Details

Motivation: To address the need for understanding both individual actions and how tactics dynamically unfold over time in badminton, capturing tactical executions including interrupted and resumed sequences.

Method: Dual-branch design with visual encoder, spatio-temporal Transformer encoder, and Transformer decoder for both shot and tactic captioning. Includes Tactic Unit Detector and shot-wise prompt-guided mechanism using predicted tactic type and state as prompts via cross-attention.

Result: Framework effectively generates both shot and tactic captions. ResNet50-based spatio-temporal encoder performs best, and shot-wise prompt structuring produces more coherent and accurate tactic descriptions.

Conclusion: The proposed Shot2Tactic-Caption framework successfully addresses multi-scale video captioning in badminton, enabling comprehensive tactical understanding through shot-level action descriptions and tactic-level temporal narratives.

Abstract: Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.

[237] Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference

Natan Bagrov, Eugene Khvedchenia, Borys Tymchenko, Shay Aharon, Lior Kadoch, Tomer Keren, Ofri Masad, Yonatan Geifman, Ran Zilberstein, Tuomas Rintamaki, Matthieu Le, Andrew Tao

Main category: cs.CV

TL;DR: EVS is a plug-and-play method that reduces video token redundancy by pruning temporally static patches, enabling faster inference and longer video sequences without architectural changes.

Details

Motivation: Vision-language models face scalability limitations due to quadratic processing costs of dense video frames, causing context limitations and latency issues with long videos.

Method: Efficient Video Sampling (EVS) identifies and prunes temporally static patches - spatial regions that remain unchanged across consecutive frames - while preserving positional identity.

Result: EVS reduces token count by up to 4x, maintains semantic fidelity, enables faster inference, and when combined with uptraining, yields models robust to varying compression levels with minimal accuracy loss.

Conclusion: EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

Abstract: Vision-language models (VLMs) have recently expanded from static image understanding to video reasoning, but their scalability is fundamentally limited by the quadratic cost of processing dense frame sequences. Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues. We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches – spatial regions that remain unchanged across consecutive frames. EVS preserves positional identity, requires no architectural changes or retraining. We show that EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. Applied at inference time, EVS reduces large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal accuracy loss. When combined with an uptraining phase using stochastic pruning rates, EVS yields models that are robust to varying compression levels and retain full performance under aggressive pruning. Extensive experiments demonstrate that EVS consistently improves efficiency-accuracy trade-offs, unlocking scalable video-language understanding without sacrificing quality.

[238] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer

Main category: cs.CV

TL;DR: RepTok is a generative modeling framework that uses a single continuous latent token from self-supervised vision transformers for efficient image generation and reconstruction.

Details

Motivation: To create a more efficient generative modeling approach that reduces spatial redundancies in 2D latent spaces and significantly lowers training costs while maintaining competitive performance.

Method: Fine-tunes only the semantic token embedding from a pre-trained SSL encoder, pairs it with a generative decoder trained using flow matching objective, and adds cosine-similarity loss to preserve SSL space geometry.

Result: Achieves competitive results on class-conditional ImageNet generation and reaches competitive zero-shot performance on MS-COCO under extremely limited training budgets.

Conclusion: Fine-tuned SSL representations can serve as compact and effective latent spaces for efficient generative modeling.

Abstract: We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

[239] SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation

Jihyun Yu, Yoojin Oh, Wonho Bae, Mingyu Kim, Junhyug Noh

Main category: cs.CV

TL;DR: SteeringTTA is a test-time adaptation method that uses Feynman-Kac steering to guide diffusion-based input adaptation for classification, achieving improved robustness on ImageNet-C without model updates or source data.

Details

Motivation: Existing diffusion-based TTA methods rely on gradient guidance, which limits exploration and generalization across different distortion types in test-time adaptation scenarios.

Method: Proposes SteeringTTA framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation using rewards driven by pseudo-labels. Maintains multiple particle trajectories steered by cumulative top-K probabilities and entropy schedule to balance exploration and confidence.

Result: On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.

Conclusion: SteeringTTA provides an effective inference-only framework for test-time adaptation that improves robustness to distribution shifts through guided diffusion-based input adaptation.

Abstract: Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.

[240] In-Context Learning with Unpaired Clips for Instruction-based Video Editing

Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, Guosheng Lin

Main category: cs.CV

TL;DR: A low-cost pretraining strategy for instruction-based video editing using in-context learning from unpaired video clips, achieving superior performance with minimal fine-tuning data.

Details

Motivation: Address the challenge of high cost and complexity in constructing large-scale paired video editing datasets for instruction-based video editing.

Method: Pretrain a foundation video generation model using in-context learning from ~1M unpaired video clips, then fine-tune with <150k curated editing pairs. Built on HunyuanVideoT2V framework.

Result: Achieves 12% improvement in instruction following and 15% improvement in editing quality compared to existing approaches, demonstrating superior instruction alignment and visual fidelity.

Conclusion: The proposed low-cost pretraining strategy effectively endows video generation models with general editing capabilities and can be efficiently refined with minimal paired data.

Abstract: Despite the rapid progress of instruction-based image editing, its extension to video remains underexplored, primarily due to the prohibitive cost and complexity of constructing large-scale paired video editing datasets. To address this challenge, we introduce a low-cost pretraining strategy for instruction-based video editing that leverages in-context learning from unpaired video clips. We show that pretraining a foundation video generation model with this strategy endows it with general editing capabilities, such as adding, replacing, or deleting operations, according to input editing instructions. The pretrained model can then be efficiently refined with a small amount of high-quality paired editing data. Built upon HunyuanVideoT2V, our framework first pretrains on approximately 1M real video clips to learn basic editing concepts, and subsequently fine-tunes on fewer than 150k curated editing pairs to extend more editing tasks and improve the editing quality. Comparative experiments show that our method surpasses existing instruction-based video editing approaches in both instruction alignment and visual fidelity, achieving a 12% improvement in editing instruction following and a 15% improvement in editing quality.

[241] Decorrelation Speeds Up Vision Transformers

Kieran Carrigg, Rob van Gastel, Melda Yeghaian, Sander Dalm, Faysal Boughorbel, Marcel van Gerven

Main category: cs.CV

TL;DR: DBP-MAE integrates Decorrelated Backpropagation into MAE pre-training to accelerate convergence, reducing training time by 21.1% and carbon emissions by 21.4% while improving downstream segmentation performance by 1.1 mIoU points.

Details

Motivation: MAE pre-training of vision transformers provides strong performance but has high computational costs that make it impractical for time- and resource-constrained industrial settings.

Method: Integrate Decorrelated Backpropagation (DBP) into MAE pre-training, applying it selectively to the encoder to iteratively reduce input correlations at each layer and accelerate convergence.

Result: DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4%, and improves segmentation mIoU by 1.1 points on ImageNet-1K pre-training with ADE20K fine-tuning. Similar gains observed on proprietary industrial data.

Conclusion: DBP can effectively reduce training time and energy use while improving downstream performance for large-scale ViT pre-training, making it applicable in real-world industrial scenarios.

Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method’s applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.

[242] EuroMineNet: A Multitemporal Sentinel-2 Benchmark for Spatiotemporal Mining Footprint Analysis in the European Union (2015-2024)

Weikang Yu, Vincent Nwazelibe, Xianping Ma, Xiaokang Zhang, Richard Gloaguen, Xiao Xiang Zhu, Pedram Ghamisi

Main category: cs.CV

TL;DR: EuroMineNet is the first comprehensive multitemporal benchmark for mining footprint mapping using Sentinel-2 imagery, covering 133 EU mining sites from 2015-2024 with expert-verified annotations to support GeoAI-based environmental monitoring.

Details

Motivation: Mining activities cause significant environmental degradation but existing datasets lack temporal depth and geographic scope for consistent long-term monitoring of mining-induced land surface changes.

Method: Created EuroMineNet dataset with annual Sentinel-2 observations from 2015-2024 across 133 EU mining sites, supporting two tasks: multitemporal mining footprint mapping (evaluated with CA-TIoU metric) and cross-temporal change detection for gradual/abrupt transformations.

Result: Benchmarked 20 state-of-the-art deep learning models, showing GeoAI methods effectively identify long-term environmental changes but struggle with detecting short-term dynamics needed for timely mitigation.

Conclusion: EuroMineNet advances temporally consistent and explainable mining monitoring to support sustainable land-use management and environmental resilience through open science principles.

Abstract: Mining activities are essential for industrial and economic development, but remain a leading source of environmental degradation, contributing to deforestation, soil erosion, and water contamination. Sustainable resource management and environmental governance require consistent, long-term monitoring of mining-induced land surface changes, yet existing datasets are often limited in temporal depth or geographic scope. To address this gap, we present EuroMineNet, the first comprehensive multitemporal benchmark for mining footprint mapping and monitoring based on Sentinel-2 multispectral imagery. Spanning 133 mining sites across the European Union, EuroMineNet provides annual observations and expert-verified annotations from 2015 to 2024, enabling GeoAI-based models to analyze environmental dynamics at a continental scale. It supports two sustainability-driven tasks: (1) multitemporal mining footprint mapping for consistent annual land-use delineation, evaluated with a novel Change-Aware Temporal IoU (CA-TIoU) metric, and (2) cross-temporal change detection to capture both gradual and abrupt surface transformations. Benchmarking 20 state-of-the-art deep learning models reveals that while GeoAI methods effectively identify long-term environmental changes, challenges remain in detecting short-term dynamics critical for timely mitigation. By advancing temporally consistent and explainable mining monitoring, EuroMineNet contributes to sustainable land-use management, environmental resilience, and the broader goal of applying GeoAI for social and environmental good. We release the codes and datasets by aligning with FAIR and the open science paradigm at https://github.com/EricYu97/EuroMineNet.

[243] WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging

Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Sami Azam, Asif Karim, Jemima Beissbarth, Amanda Leach

Main category: cs.CV

TL;DR: WeCKD introduces a chain-based knowledge distillation framework where models learn progressively from predecessors, enabling effective learning with minimal supervision and reduced data dependency.

Details

Motivation: Traditional KD suffers from knowledge degradation, inefficient supervision, and reliance on strong teachers or large datasets, limiting effectiveness in real-world limited-data scenarios.

Method: Weakly-supervised Chain-based KD network with progressive distillation chain where each model learns from its predecessor and refines knowledge before passing forward, using only fraction of dataset.

Result: Achieves cumulative accuracy gains up to +23% over single backbone, matches/surpasses supervised methods on otoscopic imaging datasets, and generalizes to other medical imaging modalities.

Conclusion: WeCKD enables effective knowledge transfer with minimal supervision, reduces data dependency, and shows strong potential for real-world medical imaging applications.

Abstract: Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets, which limits their effectiveness in real-world, limited-data scenarios. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning, reduces data dependency, and mitigates the limitations of one-step KD. Each model in the distillation chain is trained on only a fraction of the dataset and demonstrates that effective learning can be achieved with minimal supervision. Extensive evaluations across four otoscopic imaging datasets demonstrate that it not only matches but in many cases surpasses the performance of existing supervised methods. Experimental results on two other datasets further underscore its generalization across diverse medical imaging modalities, including microscopic and magnetic resonance imaging. Furthermore, our evaluations resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.

[244] VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, Chao Ma

Main category: cs.CV

TL;DR: VTimeCoT is a training-free framework that enhances video temporal grounding and reasoning in MLLMs using progress bar tools and visuotemporal chain-of-thought reasoning.

Details

Motivation: Current multimodal large language models lack strong video temporal grounding and reasoning capabilities, limiting their effectiveness in real-world video understanding systems.

Method: Proposes two visual tools: plug-and-play progress bar integration and high-efficiency highlighting tool, plus visuotemporal chain-of-thought reasoning that combines video and text cross-modality reasoning.

Result: Significant performance improvements on Qwen2VL-7B and GPT4o baselines for video temporal grounding and reasoning-based question answering.

Conclusion: The framework achieves compositional and interpretable reasoning processes for enhanced video understanding.

Abstract: In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. Project page: https://vtimecot.github.io

[245] Leveraging Learned Image Prior for 3D Gaussian Compression

Seungjoo Shin, Jaesik Park, Sunghyun Cho

Main category: cs.CV

TL;DR: A novel 3D Gaussian Splatting compression framework that uses learned image priors to restore quality degradation from compression, achieving superior rate-distortion performance with less storage.

Details

Motivation: Existing 3DGS compression methods lack learned priors, limiting further advances in rate-distortion trade-off despite considerable storage reduction.

Method: Built upon compressed Gaussians, uses a restoration network to model compression artifacts in image space, incorporates coarse rendering residuals as side information, and refines compressed Gaussians through restored image supervision.

Result: Extensive experiments show superior rate-distortion performance and better rendering quality than state-of-the-art 3DGS compression methods with substantially less storage.

Conclusion: The framework effectively leverages learned image priors to enhance 3DGS compression, is compatible with existing methods, and achieves compact representations with improved rendering performance.

Abstract: Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.

[246] Where are the Whales: A Human-in-the-loop Detection Method for Identifying Whales in High-resolution Satellite Imagery

Caleb Robinson, Kimberly T. Goetz, Christin B. Khan, Meredith Sackett, Kathleen Leonard, Rahul Dodhia, Juan M. Lavista Ferres

Main category: cs.CV

TL;DR: A semi-automated whale detection system using statistical anomaly detection to identify spatial outliers in satellite imagery, reducing expert inspection area by up to 99.8% while maintaining high recall rates.

Details

Motivation: Traditional whale monitoring methods are expensive and difficult to scale, and automated detection faces challenges due to lack of annotated data, image quality variability, and high computational costs.

Method: Uses statistical anomaly detection to flag spatial outliers (‘interesting points’) in VHR satellite imagery, paired with a web-based labeling interface for expert annotation.

Result: Achieved recalls of 90.3% to 96.4% on benchmark scenes, reducing expert inspection area from over 1,000 sq km to less than 2 sq km (up to 99.8% reduction).

Conclusion: The approach provides a scalable first step for machine-assisted marine mammal monitoring without requiring labeled training data, with the pipeline open-sourced for community use.

Abstract: Effective monitoring of whale populations is critical for conservation, but traditional survey methods are expensive and difficult to scale. While prior work has shown that whales can be identified in very high-resolution (VHR) satellite imagery, large-scale automated detection remains challenging due to a lack of annotated imagery, variability in image quality and environmental conditions, and the cost of building robust machine learning pipelines over massive remote sensing archives. We present a semi-automated approach for surfacing possible whale detections in VHR imagery using a statistical anomaly detection method that flags spatial outliers, i.e. “interesting points”. We pair this detector with a web-based labeling interface designed to enable experts to quickly annotate the interesting points. We evaluate our system on three benchmark scenes with known whale annotations and achieve recalls of 90.3% to 96.4%, while reducing the area requiring expert inspection by up to 99.8% – from over 1,000 sq km to less than 2 sq km in some cases. Our method does not rely on labeled training data and offers a scalable first step toward future machine-assisted marine mammal monitoring from space. We have open sourced this pipeline at https://github.com/microsoft/whales.

[247] Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection

Dingzhou Xie, Rushi Lan, Cheng Pang, Enhao Ning, Jiahao Zeng, Wei Zheng

Main category: cs.CV

TL;DR: Proposes Cross-Layer Feature Self-Attention Module (CFSAM) that models multi-scale feature dependencies for object detection, achieving significant performance gains over baselines.

Details

Motivation: Existing attention mechanisms are limited to single or dual-layer features, missing rich inter-layer dependencies across multi-scale representations needed for detecting objects with large scale variations.

Method: CFSAM consists of three components: convolutional local feature extractor, Transformer-based global modeling unit for cross-layer interactions, and feature fusion mechanism to enhance original representations. Integrated into SSD300 framework.

Result: Achieved 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Also accelerates convergence without substantial computational overhead.

Conclusion: Highlights the importance of explicit cross-layer attention modeling for advancing multi-scale object detection.

Abstract: Recent object detection methods have made remarkable progress by leveraging attention mechanisms to improve feature discriminability. However, most existing approaches are confined to refining single-layer or fusing dual-layer features, overlooking the rich inter-layer dependencies across multi-scale representations. This limits their ability to capture comprehensive contextual information essential for detecting objects with large scale variations. In this paper, we propose a novel Cross-Layer Feature Self-Attention Module (CFSAM), which holistically models both local and global dependencies within multi-scale feature maps. CFSAM consists of three key components: a convolutional local feature extractor, a Transformer-based global modeling unit that efficiently captures cross-layer interactions, and a feature fusion mechanism to restore and enhance the original representations. When integrated into the SSD300 framework, CFSAM significantly boosts detection performance, achieving 78.6% mAP on PASCAL VOC (vs. 75.5% baseline) and 52.1% mAP on COCO (vs. 43.1% baseline), outperforming existing attention modules. Moreover, the module accelerates convergence during training without introducing substantial computational overhead. Our work highlights the importance of explicit cross-layer attention modeling in advancing multi-scale object detection.

[248] Free-Grained Hierarchical Recognition

Seulki Park, Zilin Wang, Stella X. Yu

Main category: cs.CV

TL;DR: ImageNet-F benchmark for hierarchical image classification with mixed-granularity supervision, using CLIP to simulate realistic annotations and proposing free-grain learning methods.

Details

Motivation: Real-world image annotations vary in granularity due to factors like image quality and annotator expertise, but existing methods assume complete fine-grained labels.

Method: Created ImageNet-F benchmark with basic/subordinate/fine-grained levels, used CLIP for semantic ambiguity simulation, proposed free-grain learning with pseudo-attributes from vision-language models and semi-supervised learning.

Result: Developed methods and baselines that substantially improve performance under mixed supervision conditions.

Conclusion: The benchmark and proposed methods advance hierarchical classification under real-world constraints with varying annotation granularity.

Abstract: Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.

[249] DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models

Simone Carnemolla, Matteo Pennisi, Sarinda Samarasinghe, Giovanni Bellitto, Simone Palazzo, Daniela Giordano, Mubarak Shah, Concetto Spampinato

Main category: cs.CV

TL;DR: DEXTER is a data-free framework that uses diffusion models and LLMs to generate global, textual explanations of visual classifiers by optimizing text prompts to create class-conditional images that activate the classifier, then producing natural language reports about decision patterns and biases.

Details

Motivation: To build transparent and trustworthy AI systems by understanding and explaining machine learning model behavior, particularly for visual classifiers, without needing access to training data or ground-truth labels.

Method: Uses diffusion models and large language models to optimize text prompts that synthesize class-conditional images strongly activating the target classifier, then generates natural language reports from these synthetic samples to describe decision patterns and biases.

Result: Outperforms existing approaches in global model explanation and class-level bias reporting across ImageNet, Waterbirds, CelebA, and FairFaces datasets, producing accurate and interpretable outputs as confirmed by quantitative/qualitative evaluations and user study.

Conclusion: DEXTER provides an effective data-free framework for generating natural language explanations of visual classifiers, demonstrating flexibility across multiple tasks and superior performance in uncovering internal mechanisms and biases compared to prior methods.

Abstract: Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier’s decision process without access to training data or ground-truth labels. We demonstrate DEXTER’s flexibility across three tasks-activation maximization, slice discovery and debiasing, and bias explanation-each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting. Code is available at https://github.com/perceivelab/dexter.

[250] LightQANet: Quantized and Adaptive Feature Learning for Low-Light Image Enhancement

Xu Wu, Zhihui Lai, Xianxu Hou, Jie Zhou, Ya-nan Zhang, Linlin Shen

Main category: cs.CV

TL;DR: LightQANet introduces quantized and adaptive feature learning for low-light image enhancement, using a Light Quantization Module for structured light factor extraction and a Light-Aware Prompt Module for dynamic adaptation to varying lighting conditions.

Details

Motivation: Existing low-light image enhancement methods fail to extract reliable feature representations due to severely degraded pixel-level information, resulting in poor texture restoration, color inconsistency, and artifacts.

Method: Proposes LightQANet with two key components: Light Quantization Module (LQM) for explicit illumination factor quantification and structured learning, and Light-Aware Prompt Module (LAPM) that encodes illumination priors into learnable prompts for dynamic adaptation.

Result: Extensive experiments on multiple low-light datasets show state-of-the-art performance with superior qualitative and quantitative results across various challenging lighting scenarios.

Conclusion: The proposed quantized and adaptive feature learning approach effectively addresses low-light enhancement challenges, achieving consistent and robust image quality across diverse lighting conditions.

Abstract: Low-light image enhancement (LLIE) aims to improve illumination while preserving high-quality color and texture. However, existing methods often fail to extract reliable feature representations due to severely degraded pixel-level information under low-light conditions, resulting in poor texture restoration, color inconsistency, and artifact. To address these challenges, we propose LightQANet, a novel framework that introduces quantized and adaptive feature learning for low-light enhancement, aiming to achieve consistent and robust image quality across diverse lighting conditions. From the static modeling perspective, we design a Light Quantization Module (LQM) to explicitly extract and quantify illumination-related factors from image features. By enforcing structured light factor learning, LQM enhances the extraction of light-invariant representations and mitigates feature inconsistency across varying illumination levels. From the dynamic adaptation perspective, we introduce a Light-Aware Prompt Module (LAPM), which encodes illumination priors into learnable prompts to dynamically guide the feature learning process. LAPM enables the model to flexibly adapt to complex and continuously changing lighting conditions, further improving image enhancement. Extensive experiments on multiple low-light datasets demonstrate that our method achieves state-of-the-art performance, delivering superior qualitative and quantitative results across various challenging lighting scenarios.

[251] Inpainting the Red Planet: Diffusion Models for the Reconstruction of Martian Environments in Virtual Reality

Giuseppe Lorenzo Catalano, Agata Marta Soccini

Main category: cs.CV

TL;DR: Proposes an unconditional diffusion model for reconstructing Martian surface terrain from incomplete heightmaps, outperforming traditional interpolation methods in accuracy and perceptual quality.

Details

Motivation: Mars terrain datasets often contain missing values from satellite imagery, and current interpolation methods fail to preserve geometric coherence. Deep learning can help but Earth's conditional methods don't work on Mars due to limited data.

Method: Uses unconditional diffusion model trained on 12000 Martian heightmaps from NASA HiRISE survey, with non-homogeneous rescaling to capture multi-scale terrain features before resizing to 128x128 resolution.

Result: Outperforms Inverse Distance Weighting, kriging, and Navier-Stokes algorithm by 4-15% on RMSE and 29-81% on LPIPS similarity metrics on 1000 test samples.

Conclusion: The diffusion-based approach provides more accurate and perceptually similar Martian terrain reconstruction compared to traditional void-filling techniques.

Abstract: Space exploration increasingly relies on Virtual Reality for several tasks, such as mission planning, multidisciplinary scientific analysis, and astronaut training. A key factor for the reliability of the simulations is having accurate 3D representations of planetary terrains. Extraterrestrial heightmaps derived from satellite imagery often contain missing values due to acquisition and transmission constraints. Mars is among the most studied planets beyond Earth, and its extensive terrain datasets make the Martian surface reconstruction a valuable task, although many areas remain unmapped. Deep learning algorithms can support void-filling tasks; however, whereas Earth’s comprehensive datasets enables the use of conditional methods, such approaches cannot be applied to Mars. Current approaches rely on simpler interpolation techniques which, however, often fail to preserve geometric coherence. In this work, we propose a method for reconstructing the surface of Mars based on an unconditional diffusion model. Training was conducted on an augmented dataset of 12000 Martian heightmaps derived from NASA’s HiRISE survey. A non-homogeneous rescaling strategy captures terrain features across multiple scales before resizing to a fixed 128x128 model resolution. We compared our method against established void-filling and inpainting techniques, including Inverse Distance Weighting, kriging, and Navier-Stokes algorithm, on an evaluation set of 1000 samples. Results show that our approach consistently outperforms these methods in terms of reconstruction accuracy (4-15% on RMSE) and perceptual similarity (29-81% on LPIPS) with the original data.

[252] MoCom: Motion-based Inter-MAV Visual Communication Using Event Vision and Spiking Neural Networks

Zhang Nengbo, Hann Woei Ho, Ye Zhou

Main category: cs.CV

TL;DR: A visual communication framework for MAV swarms using motion-based signaling inspired by honeybee waggle dance, with event cameras and SNN for decoding, achieving accurate communication with low power consumption.

Details

Motivation: Conventional radio-based communication in MAV swarms suffers from spectrum congestion, jamming, and high power consumption, requiring alternative communication methods for constrained environments.

Method: MAVs convey information through deliberate flight patterns using four motion primitives. Event cameras capture these patterns, and an event frame-based segmentation model with lightweight SNN decodes the signals using a predefined visual codebook.

Result: Experimental results show accurate decoding of MAV motion sequences and low power consumption, validating the framework’s effectiveness.

Conclusion: The proposed visual communication framework serves as an energy-efficient alternative for MAV communication in constrained environments, demonstrating reliable performance without radio-based methods.

Abstract: Reliable communication in Micro Air Vehicle (MAV) swarms is challenging in environments, where conventional radio-based methods suffer from spectrum congestion, jamming, and high power consumption. Inspired by the waggle dance of honeybees, which efficiently communicate the location of food sources without sound or contact, we propose a novel visual communication framework for MAV swarms using motion-based signaling. In this framework, MAVs convey information, such as heading and distance, through deliberate flight patterns, which are passively captured by event cameras and interpreted using a predefined visual codebook of four motion primitives: vertical (up/down), horizontal (left/right), left-to-up-to-right, and left-to-down-to-right, representing control symbols (start'', end’’, 1'', 0’’). To decode these signals, we design an event frame-based segmentation model and a lightweight Spiking Neural Network (SNN) for action recognition. An integrated decoding algorithm then combines segmentation and classification to robustly interpret MAV motion sequences. Experimental results validate the framework’s effectiveness, which demonstrates accurate decoding and low power consumption, and highlights its potential as an energy-efficient alternative for MAV communication in constrained environments.

[253] CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim

Main category: cs.CV

TL;DR: CoT-PL introduces chain-of-thought reasoning for open-vocabulary object detection, decomposing object understanding into three interpretable steps to improve pseudo-labeling in crowded/occluded scenes, achieving significant performance gains.

Details

Motivation: Current OVD methods rely on direct image-text matching which neglects intermediate reasoning steps, leading to limited robustness in crowded or occluded scenes where semantic complexity is high.

Method: Uses structured visual chain-of-thought reasoning with three steps: region perception, category recognition via zero-shot reasoning, and background grounding. Includes contrastive background learning (CBL) that uses background cues as negatives to disentangle object features.

Result: Achieves 103.4% and 168.4% relative improvements in novel-class pseudo-label quality for crowded and occluded scenes respectively. Sets new SOTA with +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes.

Conclusion: Chain-of-thought reasoning significantly enhances pseudo-labeling robustness in complex visual contexts, demonstrating that structured reasoning steps are crucial for open-vocabulary object detection in challenging scenarios.

Abstract: Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.

[254] Morphology-Aware Prognostic model for Five-Year Survival Prediction in Colorectal Cancer from H&E Whole Slide Images

Usama Sajjad, Abdul Rehman Akbar, Ziyu Su, Deborah Knight, Wendy L. Frankel, Metin N. Gurcan, Wei Chen, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: PRISM is an interpretable AI model for colorectal cancer prognosis that captures continuous morphological variability, achieving superior 5-year overall survival prediction (AUC=0.70) and outperforming existing methods by 15-23% accuracy.

Details

Motivation: Current foundation models in computational pathology overlook organ-specific morphological patterns that influence tumor behavior and patient outcomes. There's a need for models that capture the incremental evolutionary processes of cancer rather than abrupt phenotypic shifts.

Method: Developed PRISM model trained on 8.74 million histological images from 424 stage III CRC patients. The model incorporates continuous variability spectrum within distinct morphologies to characterize phenotypic diversity.

Result: PRISM achieved superior prognostic performance for five-year OS (AUC=0.70±0.04, accuracy=68.37%±4.75%, HR=3.34). Outperformed existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. Showed sex-agnostic robustness and stable performance across subgroups.

Conclusion: PRISM provides an interpretable AI approach that captures continuous morphological evolution in colorectal cancer, offering improved prognostic accuracy and clinical utility for personalized treatment planning.

Abstract: Colorectal cancer (CRC) remains the third most prevalent malignancy globally, with approximately 154,000 new cases and 54,000 projected deaths anticipated for 2025. The recent advancement of foundation models in computational pathology has been largely propelled by task agnostic methodologies that can overlook organ-specific crucial morphological patterns that represent distinct biological processes that can fundamentally influence tumor behavior, therapeutic response, and patient outcomes. The aim of this study is to develop a novel, interpretable AI model, PRISM (Prognostic Representation of Integrated Spatial Morphology), that incorporates a continuous variability spectrum within each distinct morphology to characterize phenotypic diversity and reflecting the principle that malignant transformation occurs through incremental evolutionary processes rather than abrupt phenotypic shifts. PRISM is trained on 8.74 million histological images extracted from surgical resection specimens of 424 patients with stage III CRC. PRISM achieved superior prognostic performance for five-year OS (AUC = 0.70 +- 0.04; accuracy = 68.37% +- 4.75%; HR = 3.34, 95% CI = 2.28-4.90; p < 0.0001), outperforming existing CRC-specific methods by 15% and AI foundation models by ~23% accuracy. It showed sex-agnostic robustness (AUC delta = 0.02; accuracy delta = 0.15%) and stable performance across clinicopathological subgroups, with minimal accuracy fluctuation (delta = 1.44%) between 5FU/LV and CPT-11/5FU/LV regimens, replicating the Alliance cohort finding of no survival difference between treatments.

[255] Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks

Pedro R. A. S. Bassi, Xinze Zhou, Wenxuan Li, Szymon Płotka, Jieneng Chen, Qi Chen, Zheren Zhu, Jakub Prządo, Ibrahim E. Hamacı, Sezgin Er, Yuhan Wang, Ashwin Kumar, Bjoern Menze, Jarosław B. Ćwikła, Yuyin Zhou, Akshay S. Chaudhari, Curtis P. Langlotz, Sergio Decherchi, Andrea Cavalli, Kang Wang, Yang Yang, Alan L. Yuille, Zongwei Zhou

Main category: cs.CV

TL;DR: R-Super trains AI models to segment tumors using medical reports instead of requiring expensive manual tumor masks, achieving comparable performance while enabling detection of previously unsupported tumor types.

Details

Motivation: Manual tumor mask creation for AI training is costly and time-consuming, while medical reports containing tumor descriptions are abundant and underutilized in clinical practice.

Method: R-Super trains AI models to segment tumors by matching their descriptions in medical reports, using a dataset of 101,654 reports to reduce dependency on manual tumor masks.

Result: Models trained with R-Super achieved performance comparable to those trained on 723 manual masks, with combined approach improving sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting 5 of 7 tumor types.

Conclusion: R-Super challenges the necessity of labor-intensive tumor mask creation and establishes a scalable path for early tumor detection across diverse tumor types, including previously unsupported organs.

Abstract: Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks–detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor’s size, number, appearance, and sometimes, pathology results–information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at https://github.com/MrGiovanni/R-Super

[256] Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning

Ji Cao, Yu Wang, Tongya Zheng, Zujie Ren, Canghong Jin, Gang Chen, Mingli Song

Main category: cs.CV

TL;DR: PRTraj is a novel trajectory representation learning framework that integrates environment perception and route choice modeling to create more effective trajectory embeddings for downstream tasks.

Details

Motivation: Existing trajectory representation learning methods treat trajectories as isolated sequences without considering external environmental factors and internal route choice behaviors that shape trajectory formation.

Method: PRTraj uses an Environment Perception Module to capture multi-granularity environmental semantics from POI distributions, and a Route Choice Encoder that models road segment transitions as sequential decisions to capture route choice behavior.

Result: Extensive experiments on 3 real-world datasets across 5 downstream tasks show PRTraj’s effectiveness, generalizability, and strong data efficiency with robust performance in few-shot scenarios.

Conclusion: PRTraj successfully bridges the gap in trajectory representation learning by unifying environment perception and route choice modeling, demonstrating superior performance across multiple applications.

Abstract: Trajectory Representation Learning (TRL) aims to encode raw trajectories into low-dimensional vectors, which can then be leveraged in various downstream tasks, including travel time estimation, location prediction, and trajectory similarity analysis. However, existing TRL methods suffer from a key oversight: treating trajectories as isolated spatio-temporal sequences, without considering the external environment and internal route choice behavior that govern their formation. To bridge this gap, we propose a novel framework that unifies comprehensive environment \textbf{P}erception and explicit \textbf{R}oute choice modeling for effective \textbf{Traj}ectory representation learning, dubbed \textbf{PRTraj}. Specifically, PRTraj first introduces an Environment Perception Module to enhance the road network by capturing multi-granularity environmental semantics from surrounding POI distributions. Building on this environment-aware backbone, a Route Choice Encoder then captures the route choice behavior inherent in each trajectory by modeling its constituent road segment transitions as a sequence of decisions. These route-choice-aware representations are finally aggregated to form the global trajectory embedding. Extensive experiments on 3 real-world datasets across 5 downstream tasks validate the effectiveness and generalizability of PRTraj. Moreover, PRTraj demonstrates strong data efficiency, maintaining robust performance under few-shot scenarios. Our code is available at: https://anonymous.4open.science/r/PRTraj.

[257] FraQAT: Quantization Aware Training with Fractional bits

Luca Morreale, Alberto Gil C. P. Ramos, Malcolm Chadwick, Mehid Noroozi, Ruchika Chavhan, Abhinav Mehrotra, Sourav Bhattacharya

Main category: cs.CV

TL;DR: Proposes fractional bits quantization (SHORT) to progressively reduce model precision from 32 to 4 bits while maintaining generation quality, enabling deployment of large generative models on smartphones.

Details

Motivation: Large generative models cannot be deployed on smartphones due to memory and computation constraints, and aggressive quantization methods struggle to preserve model quality.

Method: Progressive quantization approach that reduces model precision from 32 to 4 bits per parameter while exploiting fractional bits during optimization to maintain quality.

Result: SHORT yields improved quality on various diffusion models (SD3.5-Medium, Sana, PixArt, FLUX.1-schnell) with 4-7% lower FiD than standard QAT, and successfully deployed Sana on Samsung S25U smartphone.

Conclusion: Fractional bits quantization enables efficient deployment of large generative models on resource-constrained devices while maintaining high generation quality.

Abstract: State-of-the-art (SOTA) generative models have demonstrated impressive capabilities in image synthesis or text generation, often with a large capacity model. However, these large models cannot be deployed on smartphones due to the limited availability of on-board memory and computations. Quantization methods lower the precision of the model parameters, allowing for efficient computations, \eg, in \INT{8}. Although aggressive quantization addresses efficiency and memory constraints, preserving the quality of the model remains a challenge. To retain quality in previous aggressive quantization, we propose a new fractional bits quantization (\short) approach. The novelty is a simple yet effective idea: we progressively reduce the model’s precision from 32 to 4 bits per parameter, and exploit the fractional bits during optimization to maintain high generation quality. We show that the \short{} yields improved quality on a variety of diffusion models, including SD3.5-Medium, Sana, \pixart, and FLUX.1-schnell, while achieving $4-7%$ lower FiD than standard QAT. Finally, we deploy and run Sana on a Samsung S25U, which runs on the Qualcomm SM8750-AB Snapdragon 8 Elite Hexagon Tensor Processor (HTP).

[258] Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data

Qi Chen, Xinze Zhou, Chen Liu, Hao Chen, Wenxuan Li, Zekun Jiang, Ziyan Huang, Yuxuan Zhao, Dexin Yu, Junjun He, Yefeng Zheng, Ling Shao, Alan Yuille, Zongwei Zhou

Main category: cs.CV

TL;DR: Synthetic data can achieve same tumor segmentation performance with fewer real scans, enabling more efficient AI training. AbdomenAtlas 2.0 provides 10,135 CT scans with 15,130 tumor instances across six organs, significantly outperforming existing datasets.

Details

Motivation: AI for tumor segmentation is limited by lack of large, voxel-wise annotated datasets which are hard to create and require medical experts. Found that AI performance stopped improving after 1,500 scans in proprietary dataset.

Method: Used synthetic data to reach same performance with only 500 real scans. Created AbdomenAtlas 2.0 with 10,135 CT scans, 15,130 tumor instances across six organs (pancreas, liver, kidney, colon, esophagus, uterus) with 5,893 control scans, annotated by 23 expert radiologists.

Result: Synthetic data steepens data scaling laws, enabling more efficient model training. AbdomenAtlas 2.0 achieves +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests compared to public datasets.

Conclusion: Synthetic data can significantly reduce the need for large real datasets in medical AI. AbdomenAtlas 2.0 provides a strong foundation for training AI to segment tumors in multiple organs, with substantial performance improvements over existing datasets.

Abstract: AI for tumor segmentation is limited by the lack of large, voxel-wise annotated datasets, which are hard to create and require medical experts. In our proprietary JHH dataset of 3,000 annotated pancreatic tumor scans, we found that AI performance stopped improving after 1,500 scans. With synthetic data, we reached the same performance using only 500 real scans. This finding suggests that synthetic data can steepen data scaling laws, enabling more efficient model training than real data alone. Motivated by these lessons, we created AbdomenAtlas 2.0–a dataset of 10,135 CT scans with a total of 15,130 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 5,893 control scans. Annotated by 23 expert radiologists, it is several orders of magnitude larger than existing public tumor datasets. While we continue expanding the dataset, the current version of AbdomenAtlas 2.0 already provides a strong foundation–based on lessons from the JHH dataset–for training AI to segment tumors in six organs. It achieves notable improvements over public datasets, with a +7% DSC gain on in-distribution tests and +16% on out-of-distribution tests.

[259] QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li

Main category: cs.CV

TL;DR: QDepth-VLA enhances VLA models with depth prediction to improve spatial reasoning for manipulation tasks.

Details

Motivation: Existing VLA models lack 3D structure understanding needed for precise control in manipulation tasks.

Method: Augments VLA models with auxiliary depth prediction using a depth expert that predicts quantized latent tokens of depth maps from VQ-VAE encoder.

Result: Demonstrates strong spatial reasoning and competitive performance on simulation benchmarks and real-world manipulation tasks.

Conclusion: QDepth-VLA effectively improves spatial perception and reasoning in VLA models through depth-aware representations.

Abstract: Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.

[260] ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang

Main category: cs.CV

TL;DR: ImagerySearch is a prompt-guided adaptive test-time search strategy that dynamically adjusts inference search space and reward functions to improve video generation in imaginative scenarios with rare concept combinations.

Details

Motivation: Current video generation models perform poorly on imaginative scenarios involving rarely co-occurring concepts with long-distance semantic relationships that fall outside training distributions.

Method: Proposes ImagerySearch with dynamic adjustment of inference search space and reward function based on prompt semantics, and introduces LDT-Bench benchmark for evaluating long-distance semantic prompts.

Result: ImagerySearch consistently outperforms strong baselines and existing test-time scaling approaches on LDT-Bench, achieving competitive improvements on VBench across diverse prompt types.

Conclusion: The method effectively addresses imaginative video generation challenges and the introduced benchmark will facilitate future research in this direction.

Abstract: Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.

[261] A Multi-Task Deep Learning Framework for Skin Lesion Classification, ABCDE Feature Quantification, and Evolution Simulation

Harsha Kotla, Arun Kumar Rajasekaran, Hannah Rana

Main category: cs.CV

TL;DR: A deep learning framework that classifies skin lesions and quantifies ABCD features (Asymmetry, Border irregularity, Color variation, Diameter) while simulating evolution over time for the E feature, achieving 89% accuracy and 0.96 AUC for melanoma detection.

Details

Motivation: Early melanoma detection is crucial for survival rates, but current deep learning methods treat skin lesion analysis as a black box without explaining human-interpretable ABCDE features used in clinical practice.

Method: Proposed framework classifies skin lesions and quantifies scores for each ABCD feature, simulates feature evolution over time for the E aspect, and visualizes ABCD feature trajectories in latent space as lesions progress from benign to malignant.

Result: Achieved 89% classification accuracy with melanoma AUC of 0.96. Feature evaluation performed well for asymmetry, color variation, and diameter prediction, though border irregularity remained more challenging to model.

Conclusion: This framework enables doctors to link machine learning diagnoses to clinically relevant ABCDE criteria, improving understanding of skin cancer progression and providing interpretable results.

Abstract: Early detection of melanoma has grown to be essential because it significantly improves survival rates, but automated analysis of skin lesions still remains challenging. ABCDE, which stands for Asymmetry, Border irregularity, Color variation, Diameter, and Evolving, is a well-known classification method for skin lesions, but most deep learning mechanisms treat it as a black box, as most of the human interpretable features are not explained. In this work, we propose a deep learning framework that both classifies skin lesions into categories and also quantifies scores for each ABCD feature. It simulates the evolution of these features over time in order to represent the E aspect, opening more windows for future exploration. The A, B, C, and D values are quantified particularly within this work. Moreover, this framework also visualizes ABCD feature trajectories in latent space as skin lesions evolve from benign nevuses to malignant melanoma. The experiments are conducted using the HAM10000 dataset that contains around ten thousand images of skin lesions of varying stages. In summary, the classification worked with an accuracy of around 89 percent, with melanoma AUC being 0.96, while the feature evaluation performed well in predicting asymmetry, color variation, and diameter, though border irregularity remains more difficult to model. Overall, this work provides a deep learning framework that will allow doctors to link ML diagnoses to clinically relevant criteria, thus improving our understanding of skin cancer progression.

Mihai-Cristian Pîrvu, Marius Leordeanu

Main category: cs.CV

TL;DR: The paper proposes a method to combine multiple visual modalities using minimal human supervision, leveraging pre-trained experts and procedural combinations on raw videos through an autonomous data pipeline. The approach uses PHG-MAE for multi-modal learning and achieves competitive results with a highly efficient model (<1M parameters) compared to larger models (~300M parameters).

Details

Motivation: The real world is multi-modal, but traditional ML models are unimodal or bimodal. To truly understand the world, all independent modalities need to be integrated. The work aims to combine as many visual modalities as possible with little to no human supervision.

Method: Uses pre-trained experts and procedural combinations between them on raw videos through a fully autonomous data pipeline (open-sourced). Employs PHG-MAE model specifically designed for multi-modal data, which is efficiently distilled into a low-parameter model (<1M parameters).

Result: The distilled model (<1M parameters) achieves competitive results compared to models with ~300M parameters. Successfully deployed for real-time semantic segmentation from handheld devices/webcams on commodity hardware, and also deployed other off-the-shelf models like DPT for near real-time depth estimation.

Conclusion: The approach demonstrates that efficient multi-modal integration is feasible with minimal human supervision, achieving competitive performance with significantly reduced model size, enabling practical deployment on commodity hardware for real-time applications.

Abstract: The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.

[263] Benchmarking Multimodal Large Language Models for Face Recognition

Hatef Otroshi Shahreza, Sébastien Marcel

Main category: cs.CV

TL;DR: Systematic benchmark of MLLMs for face recognition shows they capture rich semantic cues but lag behind specialized models in zero-shot high-precision scenarios.

Details

Motivation: To evaluate and compare the performance of open-source multimodal large language models (MLLMs) with existing face recognition models on standard benchmarks, as their potential in face recognition remains underexplored.

Method: Conducted systematic benchmarking of state-of-the-art MLLMs on several face recognition datasets including LFW, CALFW, CPLFW, CFP, AgeDB and RFW using similar protocols as existing models.

Result: MLLMs capture rich semantic cues useful for face-related tasks but perform worse than specialized face recognition models in zero-shot high-precision recognition scenarios.

Conclusion: The benchmark provides foundation for advancing MLLM-based face recognition and offers insights for designing next-generation models with higher accuracy and generalization.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance across diverse vision-and-language tasks. However, their potential in face recognition remains underexplored. In particular, the performance of open-source MLLMs needs to be evaluated and compared with existing face recognition models on standard benchmarks with similar protocol. In this work, we present a systematic benchmark of state-of-the-art MLLMs for face recognition on several face recognition datasets, including LFW, CALFW, CPLFW, CFP, AgeDB and RFW. Experimental results reveal that while MLLMs capture rich semantic cues useful for face-related tasks, they lag behind specialized models in high-precision recognition scenarios in zero-shot applications. This benchmark provides a foundation for advancing MLLM-based face recognition, offering insights for the design of next-generation models with higher accuracy and generalization. The source code of our benchmark is publicly available in the project page.

[264] TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

Guangyi Han, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha

Main category: cs.CV

TL;DR: The paper introduces Free-Form HOI Generation to overcome limitations of fixed grasping patterns in hand-object interaction research, enabling diverse interactions like pushing and rotating through fine-grained intent control.

Details

Motivation: Existing HOI generation is limited to fixed grasping patterns tied to physical priors, which imposes strong inductive bias for stable grasps and fails to capture the diversity of daily hand-object interactions.

Method: Proposes TOUCH, a three-stage framework with a multi-level diffusion model that uses explicit contact modeling for conditioning, refined with contact consistency and physical constraints. Built on WildO2 dataset containing 4.4k unique interactions across 92 intents and 610 object categories.

Result: The method generates controllable, diverse, and physically plausible hand interactions representative of daily activities, extending beyond grasping to free-form interactions like pushing, poking, and rotating.

Conclusion: The approach successfully addresses limitations of previous HOI generation methods by enabling fine-grained semantic control and generating versatile hand poses beyond traditional grasping priors, supported by comprehensive experimental validation.

Abstract: Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method’s ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities. The project page is $\href{https://guangyid.github.io/hoi123touch}{here}$.

[265] BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data

Roni Goldshmidt, Hamish Scott, Lorenzo Niccolini, Shizhan Zhu, Daniel Moura, Orly Zvitia

Main category: cs.CV

TL;DR: BADAS is a family of collision prediction models that addresses the problem of distinguishing ego-vehicle threats from random accidents, achieving state-of-the-art performance through ego-centric evaluation on real-world dashcam data.

Details

Motivation: Existing collision prediction methods fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment.

Method: BADAS uses a V-JEPA2 backbone trained end-to-end, with two variants: BADAS-Open (trained on 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). The approach involves re-annotating benchmarks to identify ego involvement, adding consensus alert-time labels, and synthesizing negatives where needed.

Result: BADAS achieves state-of-the-art AP/AUC across DAD, DADA-2000, DoTA, and Nexar datasets, outperforms forward-collision ADAS baseline, and produces more realistic time-to-accident estimates.

Conclusion: The authors release BADAS-Open model weights, code, and re-annotations of evaluation datasets to promote ego-centric collision prediction research.

Abstract: Existing collision prediction methods often fail to distinguish between ego-vehicle threats and random accidents not involving the ego vehicle, leading to excessive false alerts in real-world deployment. We present BADAS, a family of collision prediction models trained on Nexar’s real-world dashcam collision dataset – the first benchmark designed explicitly for ego-centric evaluation. We re-annotate major benchmarks to identify ego involvement, add consensus alert-time labels, and synthesize negatives where needed, enabling fair AP/AUC and temporal evaluation. BADAS uses a V-JEPA2 backbone trained end-to-end and comes in two variants: BADAS-Open (trained on our 1.5k public videos) and BADAS1.0 (trained on 40k proprietary videos). Across DAD, DADA-2000, DoTA, and Nexar, BADAS achieves state-of-the-art AP/AUC and outperforms a forward-collision ADAS baseline while producing more realistic time-to-accident estimates. We release our BADAS-Open model weights and code, along with re-annotations of all evaluation datasets to promote ego-centric collision prediction research.

[266] ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention

Keli Liu, Zhendong Wang, Wengang Zhou, Shaodong Xu, Ruixiao Dong, Houqiang Li

Main category: cs.CV

TL;DR: ScaleWeaver is a parameter-efficient fine-tuning framework for visual autoregressive (VAR) models that enables precise controllable text-to-image generation through improved MMDiT blocks with Reference Attention.

Details

Motivation: While control mechanisms exist for diffusion models, achieving precise and flexible control within the VAR paradigm remains underexplored, creating a critical gap that needs to be addressed.

Method: Proposes ScaleWeaver with improved MMDiT blocks featuring Reference Attention module, which efficiently incorporates conditional information by discarding unnecessary image→condition attention, emphasizing parameter reuse, and using zero-initialized linear projection to preserve base model capabilities.

Result: Extensive experiments show ScaleWeaver delivers high-quality generation with precise control while achieving superior efficiency over diffusion-based methods.

Conclusion: ScaleWeaver provides a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm, balancing quality, control precision, and computational efficiency.

Abstract: Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.

[267] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Logan Lawrence, Oindrila Saha, Megan Wei, Chen Sun, Subhransu Maji, Grant Van Horn

Main category: cs.CV

TL;DR: The paper proposes nlg2choice, a two-stage method for evaluating MLLMs on fine-grained visual classification tasks with hundreds to thousands of choices, addressing challenges in free-form response evaluation and computational efficiency.

Details

Motivation: Existing methods struggle with evaluating free-form responses in zero-shot visual classification, especially for fine-grained tasks with hundreds to thousands of highly related choices, and face computational challenges in retrieval-based problems.

Method: nlg2choice uses a two-stage approach: first asking MLLMs open-ended questions with minimal constraints, then using text-only constrained decoding to predict the most likely choice. For retrieval, it computes probability of constrained responses with early stopping to improve throughput.

Result: The method shows improvement over seven fine-grained visual datasets in both classification and retrieval tasks, and maintains performance across various natural language task implementations.

Conclusion: nlg2choice effectively addresses the challenges of evaluating MLLMs on fine-grained visual classification with large choice sets while maintaining computational efficiency through constrained decoding and early stopping.

Abstract: Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don’t consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

[268] Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

Main category: cs.CV

TL;DR: A novel semi-supervised video anomaly detection framework using Multimodal Large Language Models to analyze object interactions over time, providing explainable anomaly detection that outperforms existing methods.

Details

Motivation: Existing semi-supervised VAD methods struggle with complex object interaction anomalies and lack explainability, motivating the need for a more interpretable approach that can handle interaction-based anomalies.

Method: Uses MLLMs to extract textual descriptions of object activities and interactions from nominal videos by querying with visual inputs of object pairs over time, then compares test video descriptions to training data for anomaly detection.

Result: The method effectively detects complex interaction-based anomalies and achieves state-of-the-art performance on datasets without interaction anomalies, while providing inherent explainability.

Conclusion: The proposed MLLM-based framework successfully addresses limitations in detecting complex object interaction anomalies and provides explainable anomaly detection that can be combined with traditional VAD methods to enhance interpretability.

Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

[269] MaskCaptioner : Learning to Jointly Segment and Caption Object Trajectories in Videos

Gabriel Fiastre, Antoine Yang, Cordelia Schmid

Main category: cs.CV

TL;DR: The paper proposes MaskCaptioner, an end-to-end model for Dense Video Object Captioning (DVOC) that jointly detects, tracks, segments, and captions object trajectories using synthetic captions from extended datasets.

Details

Motivation: Previous approaches use disjoint training strategies due to the complexity and high cost of manual annotation for DVOC, potentially leading to suboptimal performance.

Method: Generate synthetic captions using a state-of-the-art VLM, extend LVIS and LV-VIS datasets with these captions (creating LVISCap and LV-VISCap), and train MaskCaptioner end-to-end for joint detection, segmentation, tracking, and captioning.

Result: MaskCaptioner achieves state-of-the-art DVOC results on three benchmarks: VidSTG, VLN, and BenSMOT after pretraining on LVISCap and LV-VISCap.

Conclusion: The proposed approach successfully addresses the annotation cost issue in DVOC through synthetic caption generation and demonstrates superior performance with an end-to-end model.

Abstract: Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.

[270] 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation

JoungBin Lee, Jaewoo Jung, Jisang Han, Takuya Narihira, Kazumi Fukuda, Junyoung Seo, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim

Main category: cs.CV

TL;DR: 3DScenePrompt is a framework for generating video chunks with precise camera control and scene consistency using dual spatio-temporal conditioning and a 3D scene memory that separates static geometry from dynamic elements.

Details

Motivation: To address limitations of existing methods that are conditioned on single images or short clips, enabling generation of arbitrary-length video chunks while maintaining scene consistency and precise camera control.

Method: Uses dual spatio-temporal conditioning with temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. Introduces a 3D scene memory constructed via dynamic SLAM with dynamic masking to separate static geometry from moving elements, allowing projection to any target viewpoint.

Result: Significantly outperforms existing methods in scene consistency, camera controllability, and generation quality while maintaining computational efficiency and motion realism.

Conclusion: The framework successfully enables long-range spatial coherence and precise camera control in video generation by leveraging 3D scene representations and dynamic separation techniques.

Abstract: We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/

[271] OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

Zhe Li, Weihao Yuan, Weichao Shen, Siyu Zhu, Zilong Dong, Chang Xu

Main category: cs.CV

TL;DR: A continuous masked autoregressive motion transformer with gated linear attention and RMSNorm for whole-body multi-modal human motion generation, outperforming previous methods across text-to-motion, speech-to-gesture, and music-to-dance tasks.

Details

Motivation: To address challenges in whole-body multi-modal human motion generation: creating effective motion generation mechanisms and integrating various modalities (text, speech, music) into a cohesive framework.

Method: Developed a continuous masked autoregressive motion transformer with causal attention, gated linear attention, RMSNorm module, and DiT structure for diffusion. Used AdaLN and cross-attention to fuse text, speech, and music modalities.

Result: Outperforms previous methods across all modalities including text-to-motion, speech-to-gesture, and music-to-dance.

Conclusion: The proposed framework successfully addresses multi-modal human motion generation challenges and demonstrates superior performance across different modality tasks.

Abstract: Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.

[272] RealDPO: Real or Not Real, that is the Preference

Guo Cheng, Danni Yang, Ziqi Huang, Jianlou Si, Chenyang Si, Ziwei Liu

Main category: cs.CV

TL;DR: RealDPO is a novel alignment paradigm that uses real-world videos as positive samples for preference learning to improve complex motion synthesis in video generation models.

Details

Motivation: Existing video generative models struggle with producing natural, smooth, and contextually consistent complex motions, limiting their practical applicability.

Method: RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function, contrasting real-world videos with erroneous model outputs for iterative self-correction. Also introduces RealAction-5K dataset for post-training.

Result: RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

Conclusion: RealDPO effectively addresses the motion synthesis challenge in video generation through real-world data-driven preference learning.

Abstract: Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

[273] MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, Hongsheng Li

Main category: cs.CV

TL;DR: MathCanvas is a framework that enables Large Multimodal Models to perform Visual Chain-of-Thought reasoning for mathematics through diagram generation, editing, and strategic visual-textual reasoning.

Details

Motivation: LLMs struggle with mathematical domains like geometry that require visual aids, and existing VCoT approaches are limited by rigid tools or poor diagram quality.

Method: Two-phase approach: Visual Manipulation pre-training on 15.2M caption-to-diagram pairs and editing trajectories, followed by Strategic Visual-Aided Reasoning fine-tuning on 219K interleaved visual-textual reasoning examples.

Result: BAGEL-Canvas model achieves 86% relative improvement over LMM baselines on MathCanvas-Bench and shows excellent generalization to other math benchmarks.

Conclusion: MathCanvas provides a complete toolkit to unlock complex, human-like visual-aided reasoning in LMMs for mathematical problem-solving.

Abstract: While Large Language Models (LLMs) have excelled in textual reasoning, they struggle with mathematical domains like geometry that intrinsically rely on visual aids. Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. Our approach consists of two phases. First, a Visual Manipulation stage pre-trains the model on a novel 15.2M-pair corpus, comprising 10M caption-to-diagram pairs (MathCanvas-Imagen) and 5.2M step-by-step editing trajectories (MathCanvas-Edit), to master diagram generation and editing. Second, a Strategic Visual-Aided Reasoning stage fine-tunes the model on MathCanvas-Instruct, a new 219K-example dataset of interleaved visual-textual reasoning paths, teaching it when and how to leverage visual aids. To facilitate rigorous evaluation, we introduce MathCanvas-Bench, a challenging benchmark with 3K problems that require models to produce interleaved visual-textual solutions. Our model, BAGEL-Canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines on MathCanvas-Bench, demonstrating excellent generalization to other public math benchmarks. Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs. Project Page: https://mathcanvas.github.io/

[274] C4D: 4D Made from 3D through Dual Correspondences

Shizun Wang, Zhenxiang Jiang, Xingyi Yang, Xinchao Wang

Main category: cs.CV

TL;DR: C4D is a framework that extends 3D reconstruction to 4D by leveraging temporal correspondences (optical flow and point tracking) to handle dynamic scenes, enabling joint estimation of dynamic geometry and camera poses from monocular video.

Details

Motivation: Existing pointmap-based 3D reconstruction methods fail in dynamic scenes because moving objects violate multi-view geometric constraints, leading to inaccurate results. There's a need to handle dynamic elements while maintaining reconstruction quality.

Method: C4D captures short-term optical flow and long-term point tracking, trains a dynamic-aware point tracker to identify moving objects, and introduces dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters while lifting 2D trajectories to smooth 3D trajectories.

Result: The framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks including depth estimation, camera pose estimation, and point tracking.

Conclusion: C4D successfully extends 3D reconstruction to dynamic scenes by leveraging temporal correspondences, enabling robust 4D reconstruction from monocular video while maintaining accuracy across various reconstruction tasks.

Abstract: Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes. Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking. Project Page: https://littlepure2333.github.io/C4D

[275] RainDiff: End-to-end Precipitation Nowcasting Via Token-wise Attention Diffusion

Thao Nguyen, Jiaqi Ma, Fahad Shahbaz Khan, Souhaib Ben Taieb, Salman Khan

Main category: cs.CV

TL;DR: A novel precipitation nowcasting method using token-wise attention integrated into U-Net diffusion models that captures multi-scale spatio-temporal dependencies without requiring separate latent modules.

Details

Motivation: Address scalability issues in diffusion-based precipitation nowcasting models - latent-space approaches require complex autoencoders while pixel-space approaches are computationally intensive and lack attention mechanisms for long-range dependencies.

Method: Propose token-wise attention integrated into both U-Net diffusion model and spatio-temporal encoder to dynamically capture multi-scale spatial interactions and temporal evolution, eliminating need for separate latent modules.

Result: Significantly outperforms state-of-the-art approaches with superior local fidelity, generalization, and robustness across diverse datasets in complex precipitation forecasting scenarios.

Conclusion: The proposed method provides an effective solution for precipitation nowcasting by natively integrating attention into diffusion architecture without high resource costs, achieving better performance than existing approaches.

Abstract: Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.

[276] ChangingGrounding: 3D Visual Grounding in Changing Scenes

Miao Hu, Zhiwei Huang, Tai Wang, Jiangmiao Pang, Dahua Lin, Nanning Zheng, Runsen Xu

Main category: cs.CV

TL;DR: The paper introduces ChangingGrounding, the first benchmark for 3D visual grounding in changing scenes, and proposes Mem-ChangingGrounder, a zero-shot method that uses memory-driven exploration and multi-view fusion to localize objects efficiently.

Details

Motivation: Existing 3D visual grounding methods assume static, reconstructed point clouds, which requires costly re-scans and hinders real-world deployment. The authors argue that 3DVG should be formulated as an active, memory-driven problem to handle changing scenes.

Method: Proposes Mem-ChangingGrounder, a zero-shot method that: 1) identifies object type from query, 2) retrieves relevant memories to guide actions, 3) explores target efficiently, 4) falls back when previous operations are invalid, 5) performs multi-view scanning, and 6) projects fused evidence from multi-view scans to get accurate bounding boxes.

Result: Mem-ChangingGrounder achieves the highest localization accuracy on the ChangingGrounding benchmark while greatly reducing exploration cost compared to other baselines.

Conclusion: The benchmark and method aim to catalyze a shift toward practical, memory-centric 3DVG research for real-world applications where scenes are dynamic and constantly changing.

Abstract: Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .

[277] WithAnyone: Towards Controllable and ID Consistent Image Generation

Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang

Main category: cs.CV

TL;DR: The paper addresses copy-paste artifacts in identity-consistent text-to-image generation by creating a large-scale dataset, proposing a benchmark, and introducing a contrastive identity loss to balance identity fidelity with diversity.

Details

Motivation: Current identity-consistent generation models suffer from copy-paste artifacts where they directly replicate reference faces instead of preserving identity across natural variations, limiting controllability and expressive power.

Method: Constructed MultiID-2M dataset for multi-person scenarios, introduced a benchmark for quantifying copy-paste artifacts, and proposed a training paradigm with contrastive identity loss using paired data.

Result: WithAnyone model significantly reduces copy-paste artifacts, improves controllability over pose and expression, maintains strong perceptual quality, and achieves high identity fidelity with expressive generation according to user studies.

Conclusion: The proposed approach effectively mitigates copy-paste artifacts while preserving identity similarity, enabling more controllable and expressive identity-consistent image generation.

Abstract: Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.

[278] Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation

Shaowei Liu, Chuan Guo, Bing Zhou, Jian Wang

Main category: cs.CV

TL;DR: Ponimator is a framework that uses close-proximity human-human interactive poses to generate versatile interaction animations through two conditional diffusion models: one for motion generation from poses, and another for pose synthesis from single poses or text.

Details

Motivation: Human proximity poses convey rich contextual information about interaction dynamics, and humans can intuitively infer context and anticipate dynamics from such poses. The goal is to leverage these interactive pose priors for versatile animation.

Method: Uses two conditional diffusion models: (1) pose animator that generates dynamic motion sequences from interactive poses using temporal prior, and (2) pose generator that synthesizes interactive poses from single pose, text, or both using spatial prior.

Result: Empirical experiments across diverse datasets demonstrate the universality of pose prior and effectiveness of the framework in supporting image-based interaction animation, reaction animation, and text-to-interaction synthesis.

Conclusion: Ponimator successfully transfers interaction knowledge from high-quality mocap data to open-world scenarios, showing robustness and effectiveness in diverse applications.

Abstract: Close-proximity human-human interactive poses convey rich contextual information about interaction dynamics. Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. Our training data consists of close-contact two-person poses and their surrounding temporal context from motion-capture interaction datasets. Leveraging interactive pose priors, Ponimator employs two conditional diffusion models: (1) a pose animator that uses the temporal prior to generate dynamic motion sequences from interactive poses, and (2) a pose generator that applies the spatial prior to synthesize interactive poses from a single pose, text, or both when interactive poses are unavailable. Collectively, Ponimator supports diverse tasks, including image-based interaction animation, reaction animation, and text-to-interaction synthesis, facilitating the transfer of interaction knowledge from high-quality mocap data to open-world scenarios. Empirical experiments across diverse datasets and applications demonstrate the universality of the pose prior and the effectiveness and robustness of our framework.

[279] Terra: Explorable Native 3D World Model with Point Latents

Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: Terra is a native 3D world model that represents environments in an intrinsic 3D latent space using point-to-Gaussian variational autoencoder and sparse point flow matching, achieving state-of-the-art reconstruction and generation with high 3D consistency.

Details

Motivation: Existing world models rely on pixel-aligned representations that neglect the inherent 3D nature of the physical world, undermining 3D consistency and modeling efficiency.

Method: Proposes P2G-VAE that encodes 3D inputs into latent point representation decoded as 3D Gaussian primitives, and SPFlow network for generating latent point representation by denoising positions and features.

Result: Achieves state-of-the-art performance on ScanNet v2 indoor scenes in both reconstruction and generation with high 3D consistency, enabling exact multi-view consistency and flexible rendering from any viewpoint.

Conclusion: Terra demonstrates that native 3D representation enables explorable world modeling with superior 3D consistency and efficiency compared to pixel-aligned approaches.

Abstract: World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.

[280] Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang

Main category: cs.CV

TL;DR: A new training paradigm for image editing that eliminates the need for paired data by using vision-language models to provide direct feedback and gradients for optimization, combined with distribution matching loss for visual fidelity.

Details

Motivation: Current image editing models rely on supervised fine-tuning with large datasets of input-target pairs, which are hard to curate at scale. Synthetic training pairs can propagate artifacts from pretrained models.

Method: Directly optimizes a few-step diffusion model by unrolling it during training and using feedback from vision-language models (VLMs) to evaluate if edits follow instructions and preserve unchanged content. Incorporates distribution matching loss (DMD) to maintain visual fidelity.

Result: Performs on par with various image editing diffusion models trained on extensive supervised paired data in the few-step setting. Outperforms RL-based techniques like Flow-GRPO when using the same VLM as reward model.

Conclusion: The method successfully eliminates the need for paired data in image editing model training while achieving competitive performance through direct VLM feedback and distribution matching constraints.

Abstract: Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

[281] From Pixels to Words – Towards Native Vision-Language Primitives at Scale

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu

Main category: cs.CV

TL;DR: The paper introduces NEO, a family of native Vision-Language Models that address fundamental constraints of modular VLMs by developing unified vision-language primitives for alignment, integration, and reasoning.

Details

Motivation: To overcome limitations of modular VLMs and make native VLM research more accessible and democratized, addressing fundamental constraints and accelerating progress in the field.

Method: Developed NEO family of native VLMs built from first principles with three core primitives: pixel-word alignment in shared semantic space, integration of vision/language modules, and inherent cross-modal properties for unified encoding, aligning, and reasoning.

Result: NEO rivals top-tier modular VLMs across diverse real-world scenarios using only 390M image-text examples, efficiently developing visual perception from scratch while mitigating vision-language conflicts in a dense monolithic model.

Conclusion: NEO serves as a cornerstone for scalable and powerful native VLMs, providing reusable components for a cost-effective and extensible ecosystem, with publicly available code and models.

Abstract: The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

[282] Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Hadi Alzayer, Yunzhi Zhang, Chen Geng, Jia-Bin Huang, Jiajun Wu

Main category: cs.CV

TL;DR: A diffusion sampling method for multi-view consistent image editing using pre-trained 2D models, avoiding explicit 3D optimization through implicit 3D regularization.

Details

Motivation: Existing 2D image editing models produce high-quality edits but lack consistency across multiple views of 3D scenes, while current 3D optimization approaches are slow and unstable with sparse views.

Method: Coupled diffusion sampling that concurrently samples from multi-view and 2D edited image distributions, using a coupling term to enforce multi-view consistency without explicit 3D representations.

Result: Validated on three distinct multi-view image editing tasks, showing effectiveness and generality across various model architectures.

Conclusion: The framework provides a general solution for multi-view consistent editing that works with existing 2D models without lengthy optimization processes.

Abstract: We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.

[283] Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, Xiao Xiang Zhu

Main category: cs.CV

TL;DR: DOFA is a unified multimodal foundation model for Earth observation that handles diverse satellite sensors through a dynamic hypernetwork, achieving SOTA performance across multiple tasks with efficient training.

Details

Motivation: Existing EO foundation models are tailored to specific sensor types and lack flexibility for the heterogeneous landscape of Earth observation data with different sensor modalities, ground sampling distances, and spectral ranges.

Method: Uses a wavelength-conditioned dynamic hypernetwork inspired by neural plasticity to process inputs from five distinct satellite sensors. Employs continual pretraining on five EO modalities and hybrid continual pretraining for enhanced efficiency.

Result: Achieves state-of-the-art performance across multiple downstream tasks, generalizes well to unseen modalities, and requires significantly fewer computational resources while outperforming counterparts trained with extensive GPU budgets.

Conclusion: DOFA demonstrates strong potential as a foundation for general-purpose vision models in the sensor-diverse Earth observation domain, offering a unified framework that overcomes sensor-specific limitations.

Abstract: Earth observation (EO) in open-world settings presents a unique challenge: different applications rely on diverse sensor modalities, each with varying ground sampling distances, spectral ranges, and numbers of spectral bands. However, existing EO foundation models are typically tailored to specific sensor types, making them inflexible when generalizing across the heterogeneous landscape of EO data. To address this, we propose the Dynamic One-For-All (DOFA) model, a unified, multimodal foundation framework designed for diverse vision tasks in EO. Inspired by neural plasticity, DOFA utilizes a wavelength-conditioned dynamic hypernetwork to process inputs from five distinct satellite sensors flexibly. By continually pretraining on five EO modalities, DOFA achieves state-of-the-art performance across multiple downstream tasks and generalizes well to unseen modalities. Enhanced with hybrid continual pretraining, DOFA+ requires significantly fewer computational resources while outperforming counterparts trained with extensive GPU budgets. Experiments on diverse datasets highlight DOFA’s potential as a foundation for general-purpose vision models in the sensor-diverse EO domain. The code and pre-trained weights are publicly available at https://github.com/zhu-xlab/DOFA.

[284] SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang, Handong Zhao, Ani Nenkova, Liang-Yan Gui, Tong Sun, Yu-Xiong Wang

Main category: cs.CV

TL;DR: SOHES is a self-supervised method for open-world entity segmentation that eliminates human annotations through a three-phase approach: self-exploration, self-instruction, and self-correction, achieving state-of-the-art performance without labeled data.

Details

Motivation: Existing entity segmentation methods like SAM rely heavily on costly human annotations, which limits scalability and accessibility. The authors aim to develop a method that can achieve high-quality open-world entity segmentation without requiring any human-annotated masks.

Method: Three-phase approach: 1) Self-exploration: uses pre-trained self-supervised representations and visual feature clustering to generate pseudo-labels; 2) Self-instruction: trains a segmentation model on the pseudo-labels; 3) Self-correction: refines pseudo-labels through teacher-student mutual learning. The method also captures hierarchical relationships between entities and their parts.

Result: Achieves unprecedented performance in self-supervised open-world segmentation, providing high-quality entity segmentation without human annotations. The method also successfully captures hierarchical entity-part relationships.

Conclusion: SOHES marks a significant milestone towards high-quality open-world entity segmentation without human-annotated masks, demonstrating that self-supervised approaches can effectively replace costly human annotation processes while maintaining competitive performance.

Abstract: Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES-ICLR.github.io.

Wenbo Sui, Daniel Lichau, Josselin Lefèvre, Harold Phelippeau

Main category: cs.CV

TL;DR: Proposes CMDIAD, a cross-modal distillation framework for industrial anomaly detection that enables multi-modal training but works with incomplete modalities during inference.

Details

Motivation: Practical production lines need multimodal anomaly detection but face cost/time constraints where only subsets of samples get full multimodal inspection. Models must handle incomplete modalities during inference.

Method: Cross-modal distillation framework (CMDIAD) with Multi-modal Training, Few-modal Inference (MTFI) pipeline that leverages multimodal training data but works with incomplete modalities.

Result: MTFI pipeline more effectively utilizes incomplete multimodal information compared to single modality approaches. Asymmetric performance improvement observed between point clouds and RGB images as main inference modalities.

Conclusion: Provides foundation for future multimodal dataset construction in manufacturing scenarios and demonstrates feasibility of MTFI approach for practical industrial applications.

Abstract: Recent studies of multimodal industrial anomaly detection (IAD) based on 3D point clouds and RGB images have highlighted the importance of exploiting the redundancy and complementarity among modalities for accurate classification and segmentation. However, achieving multimodal IAD in practical production lines remains a work in progress. It is essential to consider the trade-offs between the costs and benefits associated with the introduction of new modalities while ensuring compatibility with current processes. Existing quality control processes combine rapid in-line inspections, such as optical and infrared imaging with high-resolution but time-consuming near-line characterization techniques, including industrial CT and electron microscopy to manually or semi-automatically locate and analyze defects in the production of Li-ion batteries and composite materials. Given the cost and time limitations, only a subset of the samples can be inspected by all in-line and near-line methods, and the remaining samples are only evaluated through one or two forms of in-line inspection. To fully exploit data for deep learning-driven automatic defect detection, the models must have the ability to leverage multimodal training and handle incomplete modalities during inference. In this paper, we propose CMDIAD, a Cross-Modal Distillation framework for IAD to demonstrate the feasibility of a Multi-modal Training, Few-modal Inference (MTFI) pipeline. Our findings show that the MTFI pipeline can more effectively utilize incomplete multimodal information compared to applying only a single modality for training and inference. Moreover, we investigate the reasons behind the asymmetric performance improvement using point clouds or RGB images as the main modality of inference. This provides a foundation for our future multimodal dataset construction with additional modalities from manufacturing scenarios.

[286] AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

Junjie Shentu, Matthew Watson, Noura Al Moubayed

Main category: cs.CV

TL;DR: AttenCraft is an attention-based method for multiple-concept disentanglement in text-to-image customization that addresses feature fusion and asynchronous learning issues through attention maps and adaptive sampling ratios.

Details

Motivation: Multiple-concept disentanglement remains challenging in text-to-image customization, with existing methods suffering from feature fusion and asynchronous learning across different concepts.

Method: Uses attention maps to generate concept masks automatically, introduces adaptive algorithm for sampling ratios, and employs feature-retaining training framework with various loss functions.

Result: Effectively mitigates feature fusion and asynchronous learning issues, achieving state-of-the-art image fidelity and comparable prompt fidelity to baseline models.

Conclusion: AttenCraft provides an effective solution for multiple-concept disentanglement without requiring manual mask preparation or specialized models.

Abstract: Text-to-image (T2I) customization empowers users to adapt the T2I diffusion model to new concepts absent in the pre-training dataset. On this basis, capturing multiple new concepts from a single image has emerged as a new task, allowing the model to learn multiple concepts simultaneously or discard unwanted concepts. However, multiple-concept disentanglement remains a key challenge. Existing disentanglement models often exhibit two main issues: feature fusion and asynchronous learning across different concepts. To address these issues, we propose AttenCraft, an attention-based method for multiple-concept disentanglement. Our method uses attention maps to generate accurate masks for each concept in a single initialization step, aiding in concept disentanglement without requiring mask preparation from humans or specialized models. Moreover, we introduce an adaptive algorithm based on attention scores to estimate sampling ratios for different concepts, promoting balanced feature acquisition and synchronized learning. AttenCraft also introduces a feature-retaining training framework that employs various loss functions to enhance feature recognition and prevent fusion. Extensive experiments show that our model effectively mitigates these two issues, achieving state-of-the-art image fidelity and comparable prompt fidelity to baseline models.

[287] Multi-level Reliable Guidance for Unpaired Multi-view Clustering

Like Xin, Wanqi Yang, Lei Wang, Ming Yang

Main category: cs.CV

TL;DR: A novel method called MRG-UMC is proposed for unpaired multi-view clustering, using multi-level clustering and reliable view guidance to achieve consistent cluster structures without paired samples.

Details

Motivation: Traditional multi-view clustering methods rely on paired samples, which are unavailable in unpaired scenarios. Existing approaches struggle to maintain consistent cluster structures when confidence is low.

Method: MRG-UMC integrates three modules: inner-view multi-level clustering using high-confidence sample pairs, synthesized-view alignment to reduce cross-view discrepancies, and cross-view guidance to enhance clustering confidence in poorly clustered views.

Result: Extensive experiments show MRG-UMC outperforms state-of-the-art UMC methods with an average NMI improvement of 12.95% on multi-view datasets.

Conclusion: The proposed method effectively addresses unpaired multi-view clustering by learning consistent and confident cluster structures through multi-level reliable guidance, with theoretical and experimental validation.

Abstract: In this thesis, we address the challenging problem of unpaired multi-view clustering (UMC), which aims to achieve effective joint clustering using unpaired samples observed across multiple views. Traditional incomplete multi-view clustering (IMC) methods typically rely on paired samples to capture complementary information between views. However, such strategies become impractical in the UMC due to the absence of paired samples. Although some researchers have attempted to address this issue by preserving consistent cluster structures across views, effectively mining such consistency remains challenging when the cluster structures {with low confidence}. Therefore, we propose a novel method, Multi-level Reliable Guidance for UMC (MRG-UMC), which integrates multi-level clustering and reliable view guidance to learn consistent and confident cluster structures from three perspectives. Specifically, inner-view multi-level clustering exploits high-confidence sample pairs across different levels to reduce the impact of boundary samples, resulting in more confident cluster structures. Synthesized-view alignment leverages a synthesized-view to mitigate cross-view discrepancies and promote consistency. Cross-view guidance employs a reliable view guidance strategy to enhance the clustering confidence of poorly clustered views. These three modules are jointly optimized across multiple levels to achieve consistent and confident cluster structures. Furthermore, theoretical analyses verify the effectiveness of MRG-UMC in enhancing clustering confidence. Extensive experimental results show that MRG-UMC outperforms state-of-the-art UMC methods, achieving an average NMI improvement of 12.95% on multi-view datasets. {The source code is available at: https://anonymous.4open.science/r/MRG-UMC-5E20.

[288] Shape of Motion: 4D Reconstruction from a Single Video

Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, Angjoo Kanazawa

Main category: cs.CV

TL;DR: A method for reconstructing dynamic scenes from monocular videos using SE(3) motion bases and data-driven priors, achieving state-of-the-art performance in motion estimation and novel view synthesis.

Details

Motivation: Existing monocular dynamic reconstruction approaches have limitations: they depend on templates, work only in quasi-static scenes, or fail to model 3D motion explicitly. The paper aims to overcome these limitations for generic dynamic scenes.

Method: Uses two key insights: 1) Represents scene motion with compact SE(3) motion bases where each point’s motion is a linear combination, enabling soft decomposition into rigidly-moving groups. 2) Leverages off-the-shelf data-driven priors (monocular depth maps, long-range 2D tracks) and consolidates them into globally consistent dynamic scene representations.

Result: Achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.

Conclusion: The proposed method successfully reconstructs generic dynamic scenes with explicit, persistent 3D motion trajectories from monocular videos, outperforming existing approaches.

Abstract: Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos. We tackle the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point’s motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we take advantage of off-the-shelf data-driven priors such as monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes. Project Page: https://shape-of-motion.github.io/

[289] GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Jonathan Roberts, Kai Han, Samuel Albanie

Main category: cs.CV

TL;DR: GRAB is a challenging graph analysis benchmark for large multimodal models, featuring 3284 synthetic questions across 5 tasks and 23 graph properties, with current models achieving only up to 21.0% accuracy.

Details

Motivation: Existing benchmarks have insufficient headroom for evaluating next-generation LMMs, and graph analysis represents an important area where LMMs show potential for tasks like estimating means, intercepts, and correlations.

Method: Created a predominantly synthetic benchmark (GRAB) with 3284 high-quality, noise-free questions covering 5 tasks and 23 graph properties, then evaluated 20 LMMs on this benchmark.

Result: GRAB proved highly challenging, with the highest performing model achieving only 21.0% accuracy, demonstrating significant room for improvement in graph analysis capabilities.

Conclusion: GRAB serves as a demanding benchmark for current and future LMMs, encouraging progress in graph analysis capabilities through both the full GRAB and lightweight GRAB-Lite versions.

Abstract: Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is predominantly synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 3284 questions, covering five tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.0%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB and a lightweight GRAB-Lite to encourage progress in this important, growing domain.

[290] The Fluorescent Veil: A Stealthy and Effective Physical Adversarial Patch Against Traffic Sign Recognition

Shuai Yuan, Xingshuo Han, Hongwei Li, Guowen Xu, Wenbo Jiang, Tao Ni, Qingchuan Zhao, Yuguang Fang

Main category: cs.CV

TL;DR: FIPatch is a novel physical adversarial attack using fluorescent ink that causes traffic sign recognition systems to misclassify signs when triggered by invisible UV light, achieving high success rates while bypassing defenses.

Details

Motivation: Existing physical adversarial attacks on traffic sign recognition systems rely on conspicuous methods (stickers, projections) or easily blocked signals (light, acoustic). There's a need for more stealthy and effective attack methods.

Method: Model fluorescence effect digitally to identify optimal attack settings, then apply carefully designed fluorescence perturbation to target signs. The attack is triggered later using invisible ultraviolet light to cause misclassification.

Result: Achieved 98.31% success rate in low-light conditions and successfully bypassed five popular defenses with 96.72% success rate.

Conclusion: Fluorescent ink provides a stealthy and effective medium for physical adversarial attacks on traffic sign recognition systems, advancing the state-of-the-art in this domain.

Abstract: Recently, traffic sign recognition (TSR) systems have become a prominent target for physical adversarial attacks. These attacks typically rely on conspicuous stickers and projections, or using invisible light and acoustic signals that can be easily blocked. In this paper, we introduce a novel attack medium, i.e., fluorescent ink, to design a stealthy and effective physical adversarial patch, namely FIPatch, to advance the state-of-the-art. Specifically, we first model the fluorescence effect in the digital domain to identify the optimal attack settings, which guide the real-world fluorescence parameters. By applying a carefully designed fluorescence perturbation to the target sign, the attacker can later trigger a fluorescent effect using invisible ultraviolet light, causing the TSR system to misclassify the sign and potentially leading to traffic accidents. We conducted a comprehensive evaluation to investigate the effectiveness of FIPatch, which shows a success rate of 98.31% in low-light conditions. Furthermore, our attack successfully bypasses five popular defenses and achieves a success rate of 96.72%.

[291] Impact of Regularization on Calibration and Robustness: from the Representation Space Perspective

Jonghyun Park, Juyeop Kim, Jong-Seok Lee

Main category: cs.CV

TL;DR: The paper investigates how soft label regularization techniques (label smoothing, Mixup, CutMix) improve model calibration and robustness by analyzing representation space structure, decision boundaries, and feature distributions.

Details

Motivation: To understand the underlying mechanisms of why soft label regularization techniques not only improve classification accuracy but also enhance model calibration and adversarial robustness, as current explanations remain underexplored.

Method: Analysis of representation space structure by examining decision boundaries, feature distributions, confidence contours, and gradient directions in the penultimate layer features, focusing on how regularization affects these elements.

Result: The study uncovers central mechanisms in the representation space that induce improved calibration and robustness through adjustments in feature distributions relative to confidence contours and gradient directions.

Conclusion: The findings provide new insights into the characteristics of high-dimensional representation space and how training with soft label regularization affects model behavior, offering explanations for improved calibration and robustness.

Abstract: Recent studies have shown that regularization techniques using soft labels, e.g., label smoothing, Mixup, and CutMix, not only enhance image classification accuracy but also mitigate miscalibration due to overconfident predictions, and improve robustness against adversarial attacks. However, the underlying mechanisms of such improvements remain underexplored. In this paper, we offer a novel explanation from the perspective of the representation space (i.e., the space of the features obtained at the penultimate layer). Based on examination of decision boundaries and structure of features (or representation vectors), our study investigates confidence contours and gradient directions within the representation space. Furthermore, we analyze the adjustments in feature distributions due to regularization in relation to these contours and directions, from which we uncover central mechanisms inducing improved calibration and robustness. Our findings provide new insights into the characteristics of the high-dimensional representation space in relation to training and regularization using soft labels.

[292] Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang

Main category: cs.CV

TL;DR: The paper reveals that grounding ability can emerge in large multimodal models without explicit supervision, introduces an “attend-and-segment” method for pixel-level segmentation, and proposes DIFFLMM with diffusion-based visual encoder that achieves competitive performance on grounding tasks.

Details

Motivation: Current LMMs face challenges in grounding (relating language to visual entities), and existing approaches rely on fine-tuning with additional grounding supervision, which limits generalizability and scalability.

Method: Introduces “attend-and-segment” method using attention maps from standard LMMs for pixel-level segmentation, and proposes DIFFLMM with diffusion-based visual encoder trained with weak supervision instead of explicit grounding supervision.

Result: Achieves competitive performance on both grounding-specific and general VQA benchmarks, with 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the supervised model GLaMM.

Conclusion: Grounding ability can emerge in LMMs without explicit supervision, and using diffusion-based visual encoders with weak supervision provides more generalizable and scalable approach compared to grounding-specific supervision.

Abstract: Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an “attend-and-segment” method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://GroundLMM-ICCV.github.io.

[293] CAP: Evaluation of Persuasive and Creative Image Generation

Aysan Aghazadeh, Adriana Kovashka

Main category: cs.CV

TL;DR: Proposes three novel metrics (CAP) to evaluate creativity, alignment, and persuasiveness in advertisement image generation, and introduces an approach to enhance T2I models’ performance on implicit prompts.

Details

Motivation: Existing T2I evaluation methods focus on explicit descriptions but fail to assess alignment with visually implicit prompts, creativity, and persuasiveness - essential qualities for effective advertisement images.

Method: Introduces three novel evaluation metrics for Creativity, prompt Alignment, and Persuasiveness (CAP), and proposes a simple yet effective approach to enhance T2I models’ capabilities.

Result: Current T2I models struggle with creativity, persuasiveness, and alignment when input text contains implicit messages rather than explicit descriptions.

Conclusion: The proposed CAP metrics address gaps in current T2I evaluation, and the enhancement approach helps improve models’ performance on generating creative, aligned, and persuasive advertisement images from implicit prompts.

Abstract: We address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) generation and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. Our findings reveal that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models’ capabilities in producing images that are better aligned, more creative, and more persuasive.

[294] Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation

Jhe-Hao Lin, Yi Yao, Chan-Feng Hsu, Hongxia Xie, Hong-Han Shuai, Wen-Huang Cheng

Main category: cs.CV

TL;DR: PAT is a knowledge distillation framework that enables feature distillation across heterogeneous architectures using prompt tuning blocks and region-aware attention to address view mismatch between different model types.

Details

Motivation: As diverse neural architectures (CNNs, ViTs, MLPs) emerge, traditional KD methods assuming teacher-student homogeneity become insufficient, creating need for universal KD framework compatible with any architecture.

Method: Two key components: 1) Prompt tuning blocks with student feedback to adapt teacher features to student’s learning process, 2) Region-aware attention to mitigate view mismatch problem between heterogeneous architectures.

Result: Extensive experiments on CIFAR, ImageNet, and COCO datasets demonstrate superiority of the proposed method over existing approaches.

Conclusion: The PAT framework effectively enables knowledge distillation across diverse architectures through perspective-aware teaching mechanisms.

Abstract: Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a perspective-aware teaching (PAT) KD framework to enable feature distillation across diverse architectures. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model’s learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architectures. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method. Our code is available at https://github.com/jimmylin0979/PAT.git.

[295] HuGDiffusion: Generalizable Single-Image Human Rendering via 3D Gaussian Diffusion

Yingzhi Tang, Qijian Zhang, Junhui Hou

Main category: cs.CV

TL;DR: HuGDiffusion is a diffusion-based framework that generates 3D Gaussian splatting attributes from single-view images for novel view synthesis of human characters, using human priors and multi-stage attribute generation.

Details

Motivation: Existing methods require monocular videos or calibrated multi-view images, which limits applicability in real-world scenarios with arbitrary or unknown camera poses. The goal is to enable novel view synthesis from single-view input images.

Method: Uses human-centric feature extraction to generate conditioning signals, employs multi-stage generation strategy for different 3DGS attributes, and constructs proxy ground-truth 3D Gaussian attributes for supervision during training.

Result: HuGDiffusion shows significant performance improvements over state-of-the-art methods in extensive experiments.

Conclusion: The proposed method successfully achieves novel view synthesis of human characters from single-view images using a diffusion-based framework with human priors and multi-stage attribute generation.

Abstract: We present HuGDiffusion, a generalizable 3D Gaussian splatting (3DGS) learning pipeline to achieve novel view synthesis (NVS) of human characters from single-view input images. Existing approaches typically require monocular videos or calibrated multi-view images as inputs, whose applicability could be weakened in real-world scenarios with arbitrary and/or unknown camera poses. In this paper, we aim to generate the set of 3DGS attributes via a diffusion-based framework conditioned on human priors extracted from a single image. Specifically, we begin with carefully integrated human-centric feature extraction procedures to deduce informative conditioning signals. Based on our empirical observations that jointly learning the whole 3DGS attributes is challenging to optimize, we design a multi-stage generation strategy to obtain different types of 3DGS attributes. To facilitate the training process, we investigate constructing proxy ground-truth 3D Gaussian attributes as high-quality attribute-level supervision signals. Through extensive experiments, our HuGDiffusion shows significant performance improvements over the state-of-the-art methods. Our code will be made publicly available.

[296] LinPrim: Linear Primitives for Differentiable Volumetric Rendering

Nicolas von Lützow, Matthias Nießner

Main category: cs.CV

TL;DR: The paper introduces two new volumetric scene representations using linear primitives (octahedra and tetrahedra) with a differentiable rasterizer for efficient optimization, achieving comparable performance to state-of-the-art methods with fewer primitives.

Details

Motivation: To explore alternative volumetric scene representations beyond NeRF and 3D Gaussians, using linear primitives to potentially expand the design space for 3D scene representations.

Method: Introduces octahedra and tetrahedra as homogeneous volumetric primitives bounded by triangular faces, with a differentiable rasterizer for GPU-efficient optimization and real-time rendering capabilities.

Result: Achieves comparable performance to state-of-the-art volumetric methods on real-world datasets while requiring fewer primitives to achieve similar reconstruction fidelity.

Conclusion: The work deepens understanding of 3D representations by analyzing transparent polyhedra characteristics and suggests that novel primitives can expand the available design space for volumetric rendering.

Abstract: Volumetric rendering has become central to modern novel view synthesis methods, which use differentiable rendering to optimize 3D scene representations directly from observed views. While many recent works build on NeRF or 3D Gaussians, we explore an alternative volumetric scene representation. More specifically, we introduce two new scene representations based on linear primitives - octahedra and tetrahedra - both of which define homogeneous volumes bounded by triangular faces. To optimize these primitives, we present a differentiable rasterizer that runs efficiently on GPUs, allowing end-to-end gradient-based optimization while maintaining real-time rendering capabilities. Through experiments on real-world datasets, we demonstrate comparable performance to state-of-the-art volumetric methods while requiring fewer primitives to achieve similar reconstruction fidelity. Our findings deepen the understanding of 3D representations by providing insights into the fidelity and performance characteristics of transparent polyhedra and suggest that adopting novel primitives can expand the available design space.

[297] TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Jonghyun Choi, Jaesik Park, Kyungsu Kim

Main category: cs.CV

TL;DR: TRACE leverages diffusion models as hidden instance edge annotators by identifying object boundaries in self-attention maps, achieving significant improvements in unsupervised instance segmentation without needing instance-level labels.

Details

Motivation: Traditional instance and panoptic segmentation methods rely on costly dense annotations like masks or boxes. Unsupervised approaches are limited by semantic backbone constraints and human bias, often producing poor quality outputs. The paper aims to discover that diffusion models inherently contain instance boundary information.

Method: TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This eliminates the need for per-image diffusion inversion.

Result: TRACE achieves 81x faster inference while producing sharper and more connected boundaries. On COCO benchmark, it improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels.

Conclusion: Diffusion models encode hidden instance boundary priors, and decoding these signals offers a practical and scalable alternative to costly manual annotation for instance segmentation tasks.

Abstract: High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81x faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Code is available at https://github.com/shjo-april/DiffEGG.

[298] Falcon: A Remote Sensing Vision-Language Foundation Model (Technical Report)

Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrungrotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, Chao Li

Main category: cs.CV

TL;DR: Falcon is a vision-language foundation model for remote sensing that uses a unified prompt-based approach to handle 14 different tasks including classification, detection, segmentation, and captioning across 67 datasets with only 0.7B parameters.

Details

Motivation: To create a holistic vision-language foundation model specifically for remote sensing that can handle comprehensive and complex tasks through simple natural language instructions, addressing the need for unified AI systems in this domain.

Method: Developed Falcon_SFT - a large-scale multi-task instruction-tuning dataset with 78M data samples covering 5.6M multi-resolution, multi-view remote sensing images with hierarchical annotations. The model uses a prompt-based paradigm to process natural language instructions and remote sensing images.

Result: Falcon achieves remarkable performance across 67 datasets and 14 tasks despite having only 0.7B parameters. It demonstrates powerful understanding and reasoning abilities at image, region, and pixel levels.

Conclusion: Falcon represents a significant advancement in remote sensing AI, providing a unified foundation model that effectively handles diverse tasks through natural language instructions. The release of dataset, code, and model weights aims to foster further development in the open-source community.

Abstract: This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon’s training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at https://github.com/TianHuiLab/Falcon, hoping to help further develop the open-source community.

[299] Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, Weijia Li

Main category: cs.CV

TL;DR: FakeVLM is a large multimodal model for synthetic image and DeepFake detection that provides natural language explanations of image artifacts, trained on the FakeClue dataset with over 100,000 images across 7 categories.

Details

Motivation: Existing methods for image authenticity assessment lack human interpretability and struggle with the growing complexity of synthetic data generated by AIGC technologies.

Method: Developed FakeVLM, a specialized large multimodal model that distinguishes real from fake images and provides natural language explanations for image artifacts. Created FakeClue dataset with 100,000+ images across 7 categories annotated with fine-grained artifact clues.

Result: FakeVLM achieves performance comparable to expert models without needing additional classifiers, demonstrating superiority in both authenticity classification and artifact explanation tasks across multiple datasets.

Conclusion: FakeVLM sets a new benchmark for synthetic image detection by combining high performance with enhanced interpretability through natural language explanations of image artifacts.

Abstract: With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The code, model weights, and dataset can be found here: https://github.com/opendatalab/FakeVLM.

[300] OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models

Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, Rami Ben-Ari

Main category: cs.CV

TL;DR: OmnimatteZero is a training-free video decomposition method that uses pre-trained video diffusion models to remove objects, extract object layers with effects, and composite them onto new videos in real-time.

Details

Motivation: Existing omnimatte methods require extensive training or costly self-supervised optimization, creating a need for more efficient approaches.

Method: Adapts zero-shot image inpainting for video object removal using temporal and spatial attention guidance modules, and leverages self-attention maps to capture object footprints and effects.

Result: Achieves superior background reconstruction performance and sets new speed records with real-time performance and minimal frame runtime.

Conclusion: OmnimatteZero provides an efficient, training-free solution for video decomposition that outperforms existing methods in both quality and speed.

Abstract: In Omnimatte, one aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization. In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos. These are accomplished by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. To overcome this, we introduce temporal and spatial attention guidance modules that steer the diffusion process for accurate object removal and temporally consistent background reconstruction. We further show that self-attention maps capture information about the object and its footprints and use them to inpaint the object’s effects, leaving a clean background. Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

[301] Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Deepayan Das, Davide Talon, Yiming Wang, Massimiliano Mancini, Elisa Ricci

Main category: cs.CV

TL;DR: R2P is a training-free personalization method for Vision Language Models that uses concept fingerprints, retrieval, reasoning, and cross-modal verification to understand user-specific concepts without requiring model training.

Details

Motivation: VLMs struggle with user-specific concepts, and existing personalization methods rely on costly training procedures. This work explores training-free personalization for the first time.

Method: Extracts concept fingerprints (key attributes), retrieves similar fingerprints via chain-of-thought reasoning, validates scores through cross-modal verification, and refines associations via pairwise multimodal matching.

Result: R2P consistently outperforms state-of-the-art approaches on various downstream tasks across multiple benchmarks, including a new dataset for visually ambiguous concepts.

Conclusion: The proposed training-free R2P method effectively addresses personalization in VLMs without requiring model training, demonstrating superior performance over existing approaches.

Abstract: Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.

[302] 3DOT: Texture Transfer for 3DGS Objects from a Single Reference Image

Xiao Cao, Beibei Lin, Bo Wang, Zhiyong Huang, Robby T. Tan

Main category: cs.CV

TL;DR: 3DSwapping is a novel method for 3D texture swapping that overcomes limitations of existing approaches by integrating progressive generation, view-consistency gradient guidance, and prompt-tuned gradient guidance to achieve high-fidelity texture transfer with structural coherence across multiple viewpoints.

Details

Motivation: Current methods for 3D texture swapping have limitations: 2D editing requires frame-by-frame manipulation causing inconsistencies, while text-driven 3D editing struggles to preserve texture characteristics from reference images. There is no dedicated method for efficient and versatile 3D texture customization.

Method: 3DSwapping integrates three key components: 1) Progressive generation that starts with editing a single reference image and gradually propagates edits to adjacent views, 2) View-consistency gradient guidance that conditions generation on feature differences between consistent and inconsistent outputs, and 3) Prompt-tuning-based gradient guidance that learns a token capturing the difference between reference image and 3D object to guide editing.

Result: The method achieves higher-fidelity texture transfer while preserving structural coherence across multiple viewpoints. Extensive qualitative and quantitative evaluations confirm that the three novel components enable convincing and effective 2D texture swapping for 3D objects.

Conclusion: 3DSwapping successfully addresses the challenges of 3D texture swapping by integrating progressive generation, view-consistency guidance, and prompt-tuned guidance, enabling efficient customization of 3D object textures with improved consistency and texture preservation.

Abstract: 3D texture swapping allows for the customization of 3D object textures, enabling efficient and versatile visual transformations in 3D editing. While no dedicated method exists, adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing requires frame-by-frame manipulation, causing inconsistencies across views, while text-driven 3D editing struggles to preserve texture characteristics from reference images. To tackle these challenges, we introduce 3DSwapping, a 3D texture swapping method that integrates: 1) progressive generation, 2) view-consistency gradient guidance, and 3) prompt-tuned gradient guidance. To ensure view consistency, our progressive generation process starts by editing a single reference image and gradually propagates the edits to adjacent views. Our view-consistency gradient guidance further reinforces consistency by conditioning the generation model on feature differences between consistent and inconsistent outputs. To preserve texture characteristics, we introduce prompt-tuning-based gradient guidance, which learns a token that precisely captures the difference between the reference image and the 3D object. This token then guides the editing process, ensuring more consistent texture preservation across views. Overall, 3DSwapping integrates these novel strategies to achieve higher-fidelity texture transfer while preserving structural coherence across multiple viewpoints. Extensive qualitative and quantitative evaluations confirm that our three novel components enable convincing and effective 2D texture swapping for 3D objects. Code will be available upon acceptance.

[303] On Large Multimodal Models as Open-World Image Classifiers

Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yiming Wang, Paolo Rota, Elisa Ricci

Main category: cs.CV

TL;DR: Evaluating Large Multimodal Models (LMMs) for open-world image classification using natural language prompts, revealing challenges with granularity and fine-grained recognition that can be improved with tailored prompting.

Details

Motivation: Existing LMM classification studies are limited to closed-world settings with predefined categories, but LMMs can classify images using natural language without predefined categories. This work addresses the gap by evaluating LMMs in truly open-world settings.

Method: Formalized the open-world classification task and introduced an evaluation protocol with various metrics to assess alignment between predicted and ground truth classes. Evaluated 13 models across 10 benchmarks covering prototypical, non-prototypical, fine-grained, and very fine-grained classes.

Result: Demonstrated significant challenges LMMs face in open-world classification, particularly with granularity and fine-grained capabilities. Analysis revealed specific error types and showed that tailored prompting and reasoning can help alleviate these challenges.

Conclusion: LMMs struggle with open-world image classification, especially with fine-grained distinctions, but targeted prompting strategies can improve performance. The study provides a comprehensive evaluation framework for assessing LMM classification capabilities beyond traditional closed-world settings.

Abstract: Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt “What is the main object in the image?”). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.

[304] ELASTIC: Efficient Once For All Iterative Search for Object Detection on Microcontrollers

Tony Tran, Qin Lin, Bin Hu

Main category: cs.CV

TL;DR: ELASTIC is a hardware-aware Neural Architecture Search framework that optimizes object detectors for TinyML platforms by alternating module optimization with population passthrough, achieving better performance and efficiency than existing methods.

Details

Motivation: Deploying high-performance object detectors on TinyML platforms is challenging due to hardware constraints and modular complexity of detection pipelines. Existing NAS methods either optimize modules individually (sacrificing synergy) or require computationally intensive global searches.

Method: ELASTIC uses a unified, hardware-aware NAS framework that alternates optimization across modules (backbone, neck, head) cyclically. It introduces Population Passthrough mechanism in evolutionary search to retain high-quality candidates between search stages for faster convergence.

Result: ELASTIC achieves +4.75% higher mAP and 2x faster convergence than progressive NAS on SVHN, +9.09% mAP improvement on PascalVOC, and 72.3% mAP on PascalVOC (outperforming MCUNET by 20.9% and TinyissimoYOLO by 16.3%). On microcontrollers, it reduces energy by up to 71.6%, lowers latency by 2.4x, and improves mAP by up to 6.99 percentage points.

Conclusion: ELASTIC provides an effective solution for automated design of efficient object detectors on resource-constrained platforms, demonstrating superior performance and efficiency compared to existing methods through its cyclic optimization approach and population passthrough mechanism.

Abstract: Deploying high-performance object detectors on TinyML platforms poses significant challenges due to tight hardware constraints and the modular complexity of modern detection pipelines. Neural Architecture Search (NAS) offers a path toward automation, but existing methods either restrict optimization to individual modules, sacrificing cross-module synergy, or require global searches that are computationally intractable. We propose ELASTIC (Efficient Once for AlL IterAtive Search for ObjecT DetectIon on MiCrocontrollers), a unified, hardware-aware NAS framework that alternates optimization across modules (e.g., backbone, neck, and head) in a cyclic fashion. ELASTIC introduces a novel Population Passthrough mechanism in evolutionary search that retains high-quality candidates between search stages, yielding faster convergence, up to an 8% final mAP gain, and eliminates search instability observed without population passthrough. In a controlled comparison, empirical results show ELASTIC achieves +4.75% higher mAP and 2x faster convergence than progressive NAS strategies on SVHN, and delivers a +9.09% mAP improvement on PascalVOC given the same search budget. ELASTIC achieves 72.3% mAP on PascalVOC, outperforming MCUNET by 20.9% and TinyissimoYOLO by 16.3%. When deployed on MAX78000/MAX78002 microcontrollers, ELASTICderived models outperform Analog Devices’ TinySSD baselines, reducing energy by up to 71.6%, lowering latency by up to 2.4x, and improving mAP by up to 6.99 percentage points across multiple datasets.

[305] EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

Wenfeng Feng, Hongxiang Wang, Jianlong Wang, Xin Zhang, Jingjing Zhao, Yueyue Liang, Xiang Chen, Duokui Han

Main category: cs.CV

TL;DR: EDIT is a novel encoder-decoder Vision Transformer architecture that addresses attention sink by using layer-aligned cross-attention to progressively refine [CLS] token representations from low-level to high-level features.

Details

Motivation: To mitigate the attention sink phenomenon where Vision Transformers allocate excessive attention to the [CLS] token, distorting their ability to effectively process image patches.

Method: Layer-aligned encoder-decoder architecture: encoder uses self-attention on image patches, decoder uses cross-attention on [CLS] token, allowing progressive refinement from low-level to high-level features.

Result: Consistent performance improvements over DeiT3 models on ImageNet-1k and ImageNet-21k, with enhanced transfer learning capabilities. Naturally interpretable through sequential attention maps.

Conclusion: EDIT effectively addresses attention sink and improves visual feature extraction through its novel encoder-decoder design with progressive layer-by-layer refinement.

Abstract: In this paper, we propose EDIT (Encoder-Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model’s ability to effectively process image patches. To address this, we introduce a layer-aligned encoder-decoder architecture, where the encoder utilizes self-attention to process image patches, while the decoder uses cross-attention to focus on the [CLS] token. Unlike traditional encoder-decoder framework, where the decoder depends solely on high-level encoder representations, EDIT allows the decoder to extract information starting from low-level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention maps, illustrating the refined, layer-by-layer focus on key image features. Experiments on ImageNet-1k and ImageNet-21k, along with transfer learning tasks, show that EDIT achieves consistent performance improvements over DeiT3 models. These results highlight the effectiveness of EDIT’s design in addressing attention sink and improving visual feature extraction.

[306] KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum

Main category: cs.CV

TL;DR: KeyVID is a keyframe-aware audio-to-visual animation framework that improves generation quality for dramatic motions by localizing keyframe time steps from audio, generating corresponding visual keyframes, and interpolating intermediate frames.

Details

Motivation: Current audio-to-visual animation models use uniformly sampled frames which fail to capture significant key moments in dramatic motions at low frame rates and require excessive memory when increasing frame count directly.

Method: 1) Localize keyframe time steps from audio input, 2) Generate corresponding visual keyframes using a keyframe generator, 3) Generate all intermediate frames using a motion interpolator.

Result: KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions, while maintaining computational efficiency.

Conclusion: The proposed KeyVID framework effectively addresses the limitations of uniform frame sampling by focusing on key moments in audio signals, leading to better quality animation for dramatic motions with improved efficiency.

Abstract: Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.

[307] On Equivariance and Fast Sampling in Video Diffusion Models Trained with Warped Noise

Chao Liu, Arash Vahdat

Main category: cs.CV

TL;DR: EquiVDM is a video diffusion model that uses warped noise training to achieve spatial equivariance, enabling coherent video generation with better motion alignment and temporal consistency while requiring fewer sampling steps.

Details

Motivation: To address the need for temporally consistent video-to-video generation in applications like style transfer and upsampling, overcoming challenges in motion alignment and sampling efficiency.

Method: Combines warped noise training with standard denoising objectives to implicitly train models to be equivariant to spatial transformations of input noise, eliminating need for specialized modules or auxiliary losses.

Result: Achieves superior motion alignment, temporal consistency, and perceptual quality compared to prior methods, with substantially lower sampling costs and comparable/superior quality in fewer steps.

Conclusion: EquiVDM demonstrates that spatial equivariance through warped noise training enables more coherent video generation with better motion controllability and sampling efficiency, outperforming existing approaches across multiple benchmarks.

Abstract: Temporally consistent video-to-video generation is critical for applications such as style transfer and upsampling. In this paper, we provide a theoretical analysis of warped noise - a recently proposed technique for training video diffusion models - and show that pairing it with the standard denoising objective implicitly trains models to be equivariant to spatial transformations of the input noise, which we term EquiVDM. This equivariance enables motion in the input noise to align naturally with motion in the generated video, yielding coherent, high-fidelity outputs without the need for specialized modules or auxiliary losses. A further advantage is sampling efficiency: EquiVDM achieves comparable or superior quality in far fewer sampling steps. When distilled into one-step student models, EquiVDM preserves equivariance and delivers stronger motion controllability and fidelity than distilled nonequivariant baselines. Across benchmarks, EquiVDM consistently outperforms prior methods in motion alignment, temporal consistency, and perceptual quality, while substantially lowering sampling cost.

[308] ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization

Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kaiwen Feng, Zhe Yang, Chi-Man Pun, Jian Liu, Jizhe Zhou

Main category: cs.CV

TL;DR: ForensicHub is the first unified benchmark and codebase for fake image detection and localization (FIDL) that addresses domain fragmentation across deepfake detection, image manipulation detection, AI-generated content detection, and document manipulation detection.

Details

Motivation: The FIDL field is highly fragmented with four domains operating independently without interoperability, preventing cross-domain comparisons and hindering overall field development. No unified benchmark exists across all domains.

Method: Proposes a modular, configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators. Implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks through adapter-based design.

Result: Provides 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards through in-depth analysis using ForensicHub.

Conclusion: ForensicHub represents a significant leap forward in breaking domain silos in the FIDL field and inspiring future breakthroughs by enabling unified evaluation and comparison across all fake image detection domains.

Abstract: The field of Fake Image Detection and Localization (FIDL) is highly fragmented, encompassing four domains: deepfake detection (Deepfake), image manipulation detection and localization (IMDL), artificial intelligence-generated image detection (AIGC), and document image manipulation localization (Doc). Although individual benchmarks exist in some domains, a unified benchmark for all domains in FIDL remains blank. The absence of a unified benchmark results in significant domain silos, where each domain independently constructs its datasets, models, and evaluation protocols without interoperability, preventing cross-domain comparisons and hindering the development of the entire FIDL field. To close the domain silo barrier, we propose ForensicHub, the first unified benchmark & codebase for all-domain fake image detection and localization. Considering drastic variations on dataset, model, and evaluation configurations across all domains, as well as the scarcity of open-sourced baseline models and the lack of individual benchmarks in some domains, ForensicHub: i) proposes a modular and configuration-driven architecture that decomposes forensic pipelines into interchangeable components across datasets, transforms, models, and evaluators, allowing flexible composition across all domains; ii) fully implements 10 baseline models, 6 backbones, 2 new benchmarks for AIGC and Doc, and integrates 2 existing benchmarks of DeepfakeBench and IMDLBenCo through an adapter-based design; iii) conducts indepth analysis based on the ForensicHub, offering 8 key actionable insights into FIDL model architecture, dataset characteristics, and evaluation standards. ForensicHub represents a significant leap forward in breaking the domain silos in the FIDL field and inspiring future breakthroughs.

[309] TinyRS-R1: Compact Multimodal Language Model for Remote Sensing

Aybora Koksal, A. Aydin Alatan

Main category: cs.CV

TL;DR: TinyRS is a 2B-parameter multimodal small language model for remote sensing that achieves performance comparable to 7B models with 1/3 the memory and latency, with TinyRS-R1 variant adding reasoning capabilities.

Details

Motivation: Remote-sensing applications on edge hardware cannot host today's 7B-parameter multimodal models, requiring smaller but capable alternatives.

Method: Four-stage training pipeline: pre-training on satellite images, instruction tuning, fine-tuning with Chain-of-Thought annotations, and alignment via Group Relative Policy Optimization (GRPO).

Result: TinyRS-R1 achieves or surpasses 7B-parameter remote sensing models across classification, VQA, visual grounding, and open-ended QA while using 1/3 memory and latency.

Conclusion: CoT reasoning benefits spatial grounding and scene understanding, while non-reasoning TinyRS excels in latency-sensitive VQA tasks; TinyRS-R1 is the first domain-specialized MSLM with GRPO-aligned CoT reasoning for remote sensing.

Abstract: Remote-sensing applications often run on edge hardware that cannot host today’s 7B-parameter multimodal language models. This paper introduces TinyRS, the first 2B-parameter multimodal small language model (MSLM) optimized for remote sensing tasks, and TinyRS-R1, its reasoning-augmented variant. Built upon Qwen2-VL-2B, TinyRS is trained through a four-stage pipeline: pre-training on million satellite images, instruction tuning on visual instruction examples, fine-tuning with Chain-of-Thought (CoT) annotations from the proposed reasoning dataset, and alignment via Group Relative Policy Optimization (GRPO). TinyRS-R1 achieves or surpasses the performance of recent 7B-parameter remote sensing models across classification, VQA, visual grounding, and open-ended question answering-while requiring just one-third of the memory and latency. Our analysis shows that CoT reasoning substantially benefits spatial grounding and scene understanding, while the non-reasoning TinyRS excels in concise, latency-sensitive VQA tasks. TinyRS-R1 represents the first domain-specialized MSLM with GRPO-aligned CoT reasoning for general-purpose remote sensing.

[310] Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression

Sreetama Sarkar, Yue Che, Alex Gavin, Peter A. Beerel, Souvik Kundu

Main category: cs.CV

TL;DR: SPIN is an attention-guided head suppression method that reduces hallucinations in large vision language models by selectively suppressing attention heads with low attention to image tokens, achieving up to 2.7x reduction in hallucination scores without significant latency overhead.

Details

Motivation: Large vision language models suffer from hallucinations (texts misaligned with visual context), and existing inference-time intervention methods significantly increase latency.

Method: Task-agnostic attention-guided head suppression that selectively suppresses attention heads with low attention to image tokens while keeping top-K attention heads intact.

Result: Reduces hallucination scores up to 2.7x while maintaining F1 score, and improves throughput by 1.8x compared to existing alternatives on visual question answering and image description tasks.

Conclusion: SPIN effectively reduces hallucinations in LVLMs by targeting specific attention heads, providing a latency-free solution that maintains performance while improving throughput.

Abstract: Despite their remarkable progress in multimodal understanding tasks, large vision language models (LVLMs) often suffer from “hallucinations”, generating texts misaligned with the visual context. Existing methods aimed at reducing hallucinations through inference time intervention incur a significant increase in latency. To mitigate this, we present SPIN, a task-agnostic attention-guided head suppression strategy that can be seamlessly integrated during inference, without incurring any significant compute or latency overhead. We investigate whether hallucination in LVLMs can be linked to specific model components. Our analysis suggests that hallucinations can be attributed to a dynamic subset of attention heads in each layer. Leveraging this insight, for each text query token, we selectively suppress attention heads that exhibit low attention to image tokens, keeping the top-K attention heads intact. Extensive evaluations on visual question answering and image description tasks demonstrate the efficacy of SPIN in reducing hallucination scores up to 2.7x while maintaining F1, and improving throughput by 1.8x compared to existing alternatives. Code is available at https://github.com/YUECHE77/SPIN.

[311] Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models

Maria-Teresa De Rosa Palmini, Eva Cetinic

Main category: cs.CV

TL;DR: This paper introduces a benchmark to evaluate how Text-to-Image diffusion models represent historical contexts, revealing systematic inaccuracies including stereotyping, anachronisms, and implausible demographic patterns.

Details

Motivation: While prior research has examined demographic and cultural biases in TTI models, their ability to accurately represent historical contexts remains largely underexplored, creating a gap in understanding their societal and cultural implications.

Method: The authors created HistVis - a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, combined with a reproducible evaluation protocol.

Result: TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms (modern artifacts in pre-modern contexts), and fail to reflect plausible demographic patterns compared to historically plausible baselines.

Conclusion: The work provides an initial step toward building more historically accurate TTI models by establishing a reproducible benchmark for evaluating historical representation in generated imagery.

Abstract: As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.

[312] InfoDet: A Dataset for Infographic Element Detection

Jiangning Zhu, Yuxing Zhou, Zheng Wang, Juntao Yao, Yima Gu, Yuhui Yuan, Shixia Liu

Main category: cs.CV

TL;DR: InfoDet is a dataset for improving object detection in charts and human-recognizable objects (HROs) in infographics, containing over 100K infographics with 14M bounding box annotations.

Details

Motivation: Existing vision-language models have inaccurate visual grounding of infographic elements like charts and HROs, which limits their chart understanding capabilities that require identifying and reasoning over relevant elements.

Method: Created InfoDet dataset with 11,264 real and 90,000 synthetic infographics using model-in-the-loop and programmatic annotation methods. Applied it to develop a Thinking-with-Boxes scheme for VLMs and evaluate object detection models.

Result: Dataset contains over 14 million bounding box annotations. Demonstrated applications in boosting VLM chart understanding, comparing detection models, and extending to document layout and UI element detection.

Conclusion: InfoDet addresses the visual grounding limitation in VLMs for chart understanding and provides a valuable resource for developing accurate object detection models in infographics.

Abstract: Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

[313] ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

Zhen Li, Duan Li, Yukai Guo, Xinyuan Guo, Bowen Li, Lanxi Xiao, Shenyu Qiao, Jiashu Chen, Zijian Wu, Hui Zhang, Xinhuan Shu, Shixia Liu

Main category: cs.CV

TL;DR: ChartGalaxy is a million-scale dataset for improving LVLMs’ understanding and generation of infographic charts, addressing their limitations with visually rich charts.

Details

Motivation: Infographic charts combine visual and textual elements, posing challenges for LVLMs trained on plain charts, creating a gap in handling complex chart structures.

Method: Created through inductive process identifying 75 chart types, 440 variations, and 68 layout templates from real infographics, then programmatically generating synthetic charts.

Result: Dataset enables: 1) improved infographic chart understanding via fine-tuning, 2) benchmarking code generation, and 3) example-based chart generation.

Conclusion: ChartGalaxy captures visual and structural complexity of real designs, providing valuable resource for enhancing multimodal reasoning and generation in LVLMs.

Abstract: Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 440 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.

[314] SphereDrag: Spherical Geometry-Aware Panoramic Image Editing

Zhiao Feng, Xuewei Li, Junjie Yang, Jingchao Li, Yuxin Peng, Xi Li

Main category: cs.CV

TL;DR: SphereDrag is a novel panoramic image editing framework that addresses geometric challenges in spherical images through adaptive reprojection, great-circle trajectory adjustment, and spherical search region tracking, achieving significant improvements over existing methods.

Details

Motivation: Panoramic image editing remains underexplored due to challenges from spherical geometry and projection distortions, including boundary discontinuity, trajectory deformation, and uneven pixel density.

Method: Proposes SphereDrag framework with three key components: adaptive reprojection (AR) for discontinuity, great-circle trajectory adjustment (GCTA) for accurate movement tracking, and spherical search region tracking (SSRT) for handling uneven pixel density. Also introduces PanoBench benchmark for evaluation.

Result: SphereDrag achieves considerable improvement in geometric consistency and image quality compared to existing methods, with up to 10.5% relative improvement.

Conclusion: The proposed SphereDrag framework effectively addresses panoramic editing challenges through spherical geometry-aware techniques, demonstrating superior performance in complex editing tasks.

Abstract: Image editing has made great progress on planar images, but panoramic image editing remains underexplored. Due to their spherical geometry and projection distortions, panoramic images present three key challenges: boundary discontinuity, trajectory deformation, and uneven pixel density. To tackle these issues, we propose SphereDrag, a novel panoramic editing framework utilizing spherical geometry knowledge for accurate and controllable editing. Specifically, adaptive reprojection (AR) uses adaptive spherical rotation to deal with discontinuity; great-circle trajectory adjustment (GCTA) tracks the movement trajectory more accurate; spherical search region tracking (SSRT) adaptively scales the search range based on spherical location to address uneven pixel density. Also, we construct PanoBench, a panoramic editing benchmark, including complex editing tasks involving multiple objects and diverse styles, which provides a standardized evaluation framework. Experiments show that SphereDrag gains a considerable improvement compared with existing methods in geometric consistency and image quality, achieving up to 10.5% relative improvement.

[315] Mapping Farmed Landscapes from Remote Sensing

Michelangelo Conserva, Alex Wilson, Charlotte Stanton, Vishal Batchu, Varun Gulshan

Main category: cs.CV

TL;DR: Farmscapes is the first large-scale, high-resolution map of rural landscape features in England, created using deep learning on aerial imagery to identify ecologically important elements like hedgerows, woodlands, and stone walls.

Details

Motivation: Agricultural landscape management is crucial for global biodiversity targets, but current efforts are limited by the absence of detailed, large-scale ecological maps.

Method: Used a deep learning segmentation model trained on 942 manually annotated tiles from aerial imagery to identify landscape features at 25cm resolution across most of England.

Result: Achieved high accuracy with F1-scores of 96% for woodland, 95% for farmed land, and 72% for hedgerows. The England-wide map is available on Google Earth Engine as an open-access tool.

Conclusion: This work enables data-driven habitat restoration planning, supports monitoring of biodiversity initiatives, and provides foundation for advanced landscape connectivity analysis.

Abstract: Effective management of agricultural landscapes is critical for meeting global biodiversity targets, but efforts are hampered by the absence of detailed, large-scale ecological maps. To address this, we introduce Farmscapes, the first large-scale (covering most of England), high-resolution (25cm) map of rural landscape features, including ecologically vital elements like hedgerows, woodlands, and stone walls. This map was generated using a deep learning segmentation model trained on a novel, dataset of 942 manually annotated tiles derived from aerial imagery. Our model accurately identifies key habitats, achieving high f1-scores for woodland (96%) and farmed land (95%), and demonstrates strong capability in segmenting linear features, with an F1-score of 72% for hedgerows. By releasing the England-wide map on Google Earth Engine, we provide a powerful, open-access tool for ecologists and policymakers. This work enables data-driven planning for habitat restoration, supports the monitoring of initiatives like the EU Biodiversity Strategy, and lays the foundation for advanced analysis of landscape connectivity.

[316] CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization

Jan Ackermann, Jonas Kulhanek, Shengqu Cai, Haofei Xu, Marc Pollefeys, Gordon Wetzstein, Leonidas Guibas, Songyou Peng

Main category: cs.CV

TL;DR: CL-Splats enables incremental updates to 3D Gaussian splatting representations from sparse scene captures, using change detection to optimize only modified areas while maintaining previous scene states for temporal analysis.

Details

Motivation: Dynamic 3D environments require efficient methods to update scene representations over time without re-optimizing entire scenes, which is computationally expensive for robotics, mixed reality, and embodied AI applications.

Method: Integrates a robust change-detection module that segments updated and static scene components, enabling focused local optimization. Supports storing and recovering previous scene states for temporal segmentation.

Result: Achieves efficient updates with improved reconstruction quality over state-of-the-art methods through incremental optimization of only changed areas.

Conclusion: Establishes a robust foundation for real-time adaptation in 3D scene reconstruction tasks by enabling efficient incremental updates while maintaining scene history.

Abstract: In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.

[317] TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

Yiming Yang, Yueru Luo, Bingkun He, Hongbin Lin, Suzhong Fu, Chao Zheng, Zhipeng Cao, Erlong Li, Chao Yan, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: TopoStreamer is an end-to-end temporal perception model for lane segment topology reasoning that improves road network reconstruction through streaming attribute constraints, dynamic positional encoding, and lane segment denoising.

Details

Motivation: Existing methods for lane segment topology reasoning have limitations in consistent positional embedding and temporal multiple attribute learning, which hinders accurate road network reconstruction needed for autonomous driving maneuvers like turning and lane changing.

Method: TopoStreamer introduces three key improvements: streaming attribute constraints for temporal consistency in centerline and boundary coordinates/classifications, dynamic lane boundary positional encoding for up-to-date positional information, and lane segment denoising to capture diverse lane patterns.

Result: On the OpenLane-V2 dataset, TopoStreamer achieves significant improvements: +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks compared to state-of-the-art methods.

Conclusion: TopoStreamer demonstrates substantial performance gains in lane segment topology reasoning, addressing key limitations in existing methods and providing better road network reconstruction for autonomous driving applications.

Abstract: Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.0% mAP in lane segment perception and +1.7% OLS in centerline perception tasks.

[318] Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification

Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Hemendra Kumar Pandey, Amitabha Das

Main category: cs.CV

TL;DR: This study evaluates hyperparameter optimization’s impact on 7 lightweight CNN/transformer models for real-time image classification, showing 1.5-3.5% accuracy improvements and identifying models suitable for edge deployment with <5ms latency and >9,800 FPS.

Details

Motivation: To understand how hyperparameter optimization affects lightweight neural networks' accuracy and deployment feasibility on resource-constrained devices for real-time image classification.

Method: Evaluated 7 lightweight architectures (ConvNeXt-T, EfficientNetV2-S, MobileNetV3-L, MobileViT v2, RepVGG-A2, TinyViT-21M) on 90,000 ImageNet-1K subset with standardized training settings, studying learning rate schedules, augmentation, optimizers, and initialization. Performed inference benchmarks on NVIDIA L40s GPU with batch sizes 1-512.

Result: Hyperparameter tuning improved top-1 accuracy by 1.5-3.5% over baselines. Selected models (RepVGG-A2, MobileNetV3-L) achieved <5ms latency and >9,800 FPS, making them ideal for edge deployment. Controlled hyperparameter variation significantly altered convergence dynamics.

Conclusion: Hyperparameter optimization is crucial for balancing speed and accuracy in lightweight models, providing insights into stability regions and deployment feasibility for edge AI applications.

Abstract: Lightweight convolutional and transformer-based networks are increasingly preferred for real-time image classification, especially on resource-constrained devices. This study evaluates the impact of hyperparameter optimization on the accuracy and deployment feasibility of seven modern lightweight architectures: ConvNeXt-T, EfficientNetV2-S, MobileNetV3-L, MobileViT v2 (S/XS), RepVGG-A2, and TinyViT-21M, trained on a class-balanced subset of 90,000 images from ImageNet-1K. Under standardized training settings, this paper investigates the influence of learning rate schedules, augmentation, optimizers, and initialization on model performance. Inference benchmarks are performed using an NVIDIA L40s GPU with batch sizes ranging from 1 to 512, capturing latency and throughput in real-time conditions. This work demonstrates that controlled hyperparameter variation significantly alters convergence dynamics in lightweight CNN and transformer backbones, providing insight into stability regions and deployment feasibility in edge artificial intelligence. Our results reveal that tuning alone leads to a top-1 accuracy improvement of 1.5 to 3.5 percent over baselines, and select models (e.g., RepVGG-A2, MobileNetV3-L) deliver latency under 5 milliseconds and over 9,800 frames per second, making them ideal for edge deployment. This work provides reproducible, subset-based insights into lightweight hyperparameter tuning and its role in balancing speed and accuracy. The code and logs may be seen at: https://vineetkumarrakesh.github.io/lcnn-opt

[319] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng, Xinrui Yan, Shuqi Mei, Kun Tang, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: FASTopoWM is a fast-slow lane topology reasoning framework with latent world models that improves temporal perception for autonomous driving by addressing limitations of existing methods.

Details

Motivation: Existing lane topology reasoning methods fail to effectively leverage temporal information and are vulnerable to pose estimation failures, limiting their performance in autonomous driving systems.

Method: Proposes a unified fast-slow framework with parallel supervision of historical and new queries, and introduces latent query and BEV world models conditioned on action latent to propagate state representations.

Result: Outperforms state-of-the-art methods on OpenLane-V2 benchmark with 37.4% mAP for lane detection (vs 33.6%) and 46.3% OLS for centerline perception (vs 41.5%).

Conclusion: FASTopoWM effectively addresses temporal perception limitations and significantly improves lane topology reasoning performance through its fast-slow architecture and latent world models.

Abstract: Lane segment topology reasoning provides comprehensive bird’s-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

[320] UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, Ehsan Adeli

Main category: cs.CV

TL;DR: UniEgoMotion is a unified conditional motion diffusion model that generates and forecasts human motion from first-person images without explicit 3D scene data, using a novel head-centric representation and achieving state-of-the-art performance.

Details

Motivation: To address limitations of existing third-person motion synthesis methods in real-world egocentric settings where limited field of view, occlusions, and dynamic cameras hinder scene perception, enabling better AR/VR experiences, human-robot interaction, assistive technologies, and healthcare solutions.

Method: Proposes UniEgoMotion - a unified conditional motion diffusion model with head-centric motion representation that extracts image-based scene context from first-person visual inputs to infer plausible 3D motion. Uses EE4D-Motion dataset derived from EgoExo4D with pseudo-ground-truth 3D motion annotations.

Result: Achieves state-of-the-art performance in egocentric motion reconstruction and is the first model to generate motion from a single egocentric image. Extensive evaluations demonstrate effectiveness of the unified framework.

Conclusion: Sets a new benchmark for egocentric motion modeling and unlocks new possibilities for egocentric applications by effectively bridging the gap between first-person visual inputs and 3D motion synthesis without explicit 3D scene data.

Abstract: Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion’s simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.

[321] Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong

Main category: cs.CV

TL;DR: ADAPT is a test-time adaptation method that reframes TTA as Gaussian probabilistic inference, enabling closed-form, training-free adaptation without backpropagation or source data.

Details

Motivation: Current TTA methods face limitations: reliance on backpropagation hinders real-time deployment, and lack of explicit class-conditional feature distribution modeling affects decision boundaries and prediction calibration.

Method: Models class-conditional likelihoods using gradually updated class means and shared covariance matrix. Uses CLIP priors and historical knowledge bank for regularization. No gradient updates, source data, or full target data access required.

Result: Achieves state-of-the-art performance across diverse benchmarks under various distribution shifts, with superior scalability and robustness.

Conclusion: ADAPT provides an effective backpropagation-free TTA solution that models class distributions explicitly, supporting both online and transductive settings with strong performance.

Abstract: Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.

[322] UrbanTwin: Synthetic LiDAR Datasets (LUMPI, V2X-Real-IC, and TUMTraf-I)

Muhammad Shahbaz, Shaurya Agarwal

Main category: cs.CV

TL;DR: UrbanTwin presents synthetic replicas of three real roadside lidar datasets with 10K annotated frames each, created using realistic digital twins that enable training and testing of 3D perception models with performance comparable to real data.

Details

Motivation: To create high-fidelity synthetic datasets that can replace or augment real-world lidar datasets for 3D perception tasks, addressing data scarcity and enabling custom scenario testing.

Method: Synthesized datasets using emulated lidar sensors within realistic digital twins modeled on actual locations’ geometry, road alignment, lane topology, and vehicle movement patterns.

Result: The synthetic datasets show strong alignment with real data through statistical and structural similarity analysis. Models trained solely on synthetic data achieve improved detection performance on real, unseen data compared to models trained on real data.

Conclusion: UrbanTwin datasets effectively enhance existing benchmarks by increasing sample size and scene diversity, and represent the first digitally synthesized datasets that can replace in-domain real-world datasets for lidar perception tasks.

Abstract: This article presents UrbanTwin datasets, high-fidelity, realistic replicas of three public roadside lidar datasets: LUMPI, V2X-Real-IC}}, and TUMTraf-I. Each UrbanTwin dataset contains 10K annotated frames corresponding to one of the public datasets. Annotations include 3D bounding boxes, instance segmentation labels, and tracking IDs for six object classes, along with semantic segmentation labels for nine classes. These datasets are synthesized using emulated lidar sensors within realistic digital twins, modeled based on surrounding geometry, road alignment at lane level, and the lane topology and vehicle movement patterns at intersections of the actual locations corresponding to each real dataset. Due to the precise digital twin modeling, the synthetic datasets are well aligned with their real counterparts, offering strong standalone and augmentative value for training deep learning models on tasks such as 3D object detection, tracking, and semantic and instance segmentation. We evaluate the alignment of the synthetic replicas through statistical and structural similarity analysis with real data, and further demonstrate their utility by training 3D object detection models solely on synthetic data and testing them on real, unseen data. The high similarity scores and improved detection performance, compared to the models trained on real data, indicate that the UrbanTwin datasets effectively enhance existing benchmark datasets by increasing sample size and scene diversity. In addition, the digital twins can be adapted to test custom scenarios by modifying the design and dynamics of the simulations. To our knowledge, these are the first digitally synthesized datasets that can replace in-domain real-world datasets for lidar perception tasks. UrbanTwin datasets are publicly available at https://dataverse.harvard.edu/dataverse/ucf-ut.

[323] AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective

Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, Yuxing Han

Main category: cs.CV

TL;DR: AvatarSync is an autoregressive framework that generates realistic talking-head animations from audio/text using phoneme representations, addressing flicker issues and slow inference in diffusion models through temporal modeling and parallel acceleration.

Details

Motivation: To overcome limitations of diffusion models in talking-head animation, particularly inter-frame flicker and slow inference speed, which restrict practical deployment.

Method: Uses autoregressive pipeline on phoneme representations with two-stage generation strategy, phoneme-frame causal attention mask, and many-to-one mapping from text/audio to phonemes for precise alignment.

Result: Outperforms existing methods on Chinese (CMLR) and English (HDTF) datasets in visual fidelity, temporal consistency, and computational efficiency.

Conclusion: AvatarSync provides a scalable and controllable solution for realistic talking-head animation with improved performance over current approaches.

Abstract: Talking-head animation focuses on generating realistic facial videos from audio input. Following Generative Adversarial Networks (GANs), diffusion models have become the mainstream, owing to their robust generative capacities. However, inherent limitations of the diffusion process often lead to inter-frame flicker and slow inference, restricting their practical deployment. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven directly by text or audio input. To mitigate flicker and ensure continuity, AvatarSync leverages an autoregressive pipeline that enhances temporal modeling. In addition, to ensure controllability, we introduce phonemes, which are the basic units of speech sounds, and construct a many-to-one mapping from text/audio to phonemes, enabling precise phoneme-to-visual alignment. Additionally, to further accelerate inference, we adopt a two-stage generation strategy that decouples semantic modeling from visual dynamics, and incorporate a customized Phoneme-Frame Causal Attention Mask to support multi-step parallel acceleration. Extensive experiments conducted on both Chinese (CMLR) and English (HDTF) datasets demonstrate that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable and controllable solution.

[324] EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing

Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou

Main category: cs.CV

TL;DR: EdiVal-Agent is an automated evaluation framework for instruction-based image editing that uses object-centric analysis to assess single-turn and multi-turn editing through novel metrics for instruction following, content consistency, and visual quality.

Details

Motivation: Current evaluation methods for instruction-based image editing are limited - they either depend on paired reference images (limited coverage, biased) or rely on imprecise zero-shot vision-language models. There's a need for more reliable and interpretable evaluation.

Method: EdiVal-Agent decomposes images into semantic objects, synthesizes context-aware editing instructions, and uses three novel metrics: EdiVal-IF (instruction following via object detectors + VLMs), EdiVal-CC (content consistency via semantic similarity), and EdiVal-VQ (visual quality via human preference models).

Result: The framework was used to build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models. It successfully identifies failure modes in existing models.

Conclusion: EdiVal-Agent provides a fine-grained, automated evaluation approach that can inform the development of next-generation image editing models by offering precise assessment of instruction following, content consistency, and visual quality.

Abstract: Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.

[325] Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu

Main category: cs.CV

TL;DR: SD-RPN is an efficient, annotation-free method that uses self-distillation to train a lightweight Region Proposal Network from MLLM’s attention maps, enabling high-resolution visual perception without costly supervision.

Details

Motivation: Current RoI methods for MLLMs face trade-offs: training-based approaches need large annotated datasets, while training-free methods are computationally inefficient and less accurate.

Method: Transform noisy attention maps from MLLM’s middle layers into pseudo-RoI labels through denoising, then train a lightweight RPN that predicts RoIs in a single forward pass.

Result: Achieves over 10% absolute accuracy improvement on unseen benchmarks (TextVQA, DocVQA, V-Star) with exceptional data efficiency using only 10K QA pairs.

Conclusion: Provides a practical and scalable solution for enhancing MLLM fine-grained perception without costly supervision or full model fine-tuning.

Abstract: Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model’s internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM’s middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM’s middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations. To validate our approach, we integrate the framework into multiple MLLM families. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.

[326] From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, Sicong Leng

Main category: cs.CV

TL;DR: MIR is a new benchmark for multi-image interleaved reasoning that requires joint reasoning across multiple images with interleaved textual contexts, addressing gaps in existing multi-image benchmarks.

Details

Motivation: Current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, limiting MLLMs' ability to reason across complex multi-modal scenarios.

Method: Introduces reasoning steps for each benchmark instance and proposes a stage-wise curriculum learning strategy that follows an “easy to hard” approach, progressively guiding models from simple to complex scenarios.

Result: Extensive experiments show that the proposed method significantly enhances MLLMs’ reasoning performance on both the MIR benchmark and other established benchmarks.

Conclusion: MIR will encourage further research into multi-image interleaved reasoning and facilitate advancements in MLLMs’ capability to handle complex inter-modal tasks.

Abstract: Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an “easy to hard” approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.

[327] ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation

Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, Kfir Aberman

Main category: cs.CV

TL;DR: A new method for attribute-specific image prompting that enables compositional control over human appearance attributes like hair, clothing, and identity using separate reference images and diffusion models.

Details

Motivation: Prior methods lack modularity and fail to provide disentangled control over specific visual attributes in human image generation, focusing mainly on identity preservation.

Method: Encodes attribute-specific reference images into tokens injected into pre-trained text-to-image diffusion models, using cross-reference training with misaligned inputs to promote natural composition and robust disentanglement.

Result: Achieves state-of-the-art performance in accurately following both visual and textual prompts, enabling compositional control over multiple visual factors across multiple people in a single image.

Conclusion: The framework enables more configurable human image synthesis by combining visual prompting with text-driven generation, paving the way for better attribute-level control.

Abstract: Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: https://snap-research.github.io/composeme/.

[328] DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces

Tianshuo Zhang, Li Gao, Siran Peng, Xiangyu Zhu, Zhen Lei

Main category: cs.CV

TL;DR: The paper proposes a continual learning approach for face forgery detection using a Developmental Mixture of Experts (MoE) architecture with LoRA models to adapt to evolving forgery techniques while preventing catastrophic forgetting.

Details

Motivation: The rapid evolution of digital face generation and manipulation techniques outpaces existing detection models, requiring systems that can quickly adapt to new forgery types with limited data while retaining knowledge of previous types.

Method: Uses a Developmental Mixture of Experts architecture with LoRA models organized into Real-LoRA for real faces and multiple Fake-LoRAs for different forgery types. Employs orthogonal learning directions and orthogonal gradients to prevent catastrophic forgetting.

Result: Experimental results show effectiveness under both datasets and manipulation types incremental protocols, demonstrating successful adaptation to new forgery types while maintaining detection capabilities for previously learned types.

Conclusion: The proposed continual learning framework effectively addresses the challenge of evolving face forgery techniques by enabling incremental learning while preventing catastrophic forgetting through orthogonal constraints.

Abstract: The rise of realistic digital face generation and manipulation poses significant social risks. The primary challenge lies in the rapid and diverse evolution of generation techniques, which often outstrip the detection capabilities of existing models. To defend against the ever-evolving new types of forgery, we need to enable our model to quickly adapt to new domains with limited computation and data while avoiding forgetting previously learned forgery types. In this work, we posit that genuine facial samples are abundant and relatively stable in acquisition methods, while forgery faces continuously evolve with the iteration of manipulation techniques. Given the practical infeasibility of exhaustively collecting all forgery variants, we frame face forgery detection as a continual learning problem and allow the model to develop as new forgery types emerge. Specifically, we employ a Developmental Mixture of Experts (MoE) architecture that uses LoRA models as its individual experts. These experts are organized into two groups: a Real-LoRA to learn and refine knowledge of real faces, and multiple Fake-LoRAs to capture incremental information from different forgery types. To prevent catastrophic forgetting, we ensure that the learning direction of Fake-LoRAs is orthogonal to the established subspace. Moreover, we integrate orthogonal gradients into the orthogonal loss of Fake-LoRAs, preventing gradient interference throughout the training process of each task. Experimental results under both the datasets and manipulation types incremental protocols demonstrate the effectiveness of our method.

Binod Singh, Sayan Deb Sarkar, Iro Armeni

Main category: cs.CV

TL;DR: SGAligner++ is a cross-modal, language-aided framework for 3D scene graph alignment that outperforms state-of-the-art methods by up to 40% on noisy real-world data.

Details

Motivation: Current 3D scene graph alignment methods rely on single-modality point cloud data and struggle with incomplete or noisy input, limiting their effectiveness in real-world applications.

Method: Uses cross-modal learning with lightweight unimodal encoders and attention-based fusion to create a unified joint embedding space for aligning partially overlapping scene observations across heterogeneous modalities.

Result: Extensive evaluations show SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions and enables cross-modal generalization.

Conclusion: SGAligner++ provides accurate 3D scene graph alignment under low-overlap conditions and sensor noise, enhancing scene understanding for visual localization, 3D reconstruction, and navigation with minimal computational overhead.

Abstract: Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.

[330] Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: SHINE is a training-free framework for high-fidelity image composition that addresses complex lighting and high-resolution challenges using pretrained diffusion models without requiring latent inversion or attention surgery.

Details

Motivation: Existing image composition models struggle with complex lighting conditions (accurate shadows, water reflections) and diverse high-resolution inputs, while current diffusion models lack frameworks to utilize their physical and resolution priors effectively without problematic techniques like latent inversion.

Method: SHINE introduces manifold-steered anchor loss using pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity, along with degradation-suppression guidance and adaptive background blending to eliminate low-quality outputs and visible seams.

Result: Experiments on ComplexCompo (a new benchmark with diverse resolutions and challenging conditions) and DreamEditBench show state-of-the-art performance on standard metrics (DINOv2) and human-aligned scores (DreamSim, ImageReward, VisionReward).

Conclusion: SHINE provides an effective training-free solution for high-fidelity image composition that handles complex lighting and high-resolution inputs while maintaining seamless integration, with publicly available code and benchmark upon publication.

Abstract: Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

[331] HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection

Mohammad Mahdi Hemmatyar, Mahdi Jafari, Mohammad Amin Yousefi, Mohammad Reza Nemati, Mobin Azadani, Hamid Reza Rastad, Amirmohammad Akbari

Main category: cs.CV

TL;DR: HyCoVAD is a hybrid SSL-LLM model for complex video anomaly detection that combines self-supervised learning for temporal analysis with LLM semantic validation, achieving 72.5% AUC on ComplexVAD dataset.

Details

Motivation: Existing SSL methods struggle with semantic understanding of complex anomalies involving multiple entities and temporal dependencies, while LLMs are computationally expensive and lack spatial localization for frame-level analysis.

Method: Hybrid SSL-LLM approach: SSL module with nnFormer backbone performs multi-task temporal analysis to identify suspicious frames, then LLM applies rule-based reasoning for semantic validation of anomalies.

Result: Achieves 72.5% frame-level AUC on ComplexVAD dataset, outperforming baselines by 12.5% while reducing LLM computational load.

Conclusion: HyCoVAD effectively combines SSL’s spatiotemporal modeling with LLM’s semantic reasoning for complex anomaly detection, providing a balanced solution that maintains performance while reducing computational costs.

Abstract: Video anomaly detection (VAD) is crucial for intelligent surveillance, but a significant challenge lies in identifying complex anomalies, which are events defined by intricate relationships and temporal dependencies among multiple entities rather than by isolated actions. While self-supervised learning (SSL) methods effectively model low-level spatiotemporal patterns, they often struggle to grasp the semantic meaning of these interactions. Conversely, large language models (LLMs) offer powerful contextual reasoning but are computationally expensive for frame-by-frame analysis and lack fine-grained spatial localization. We introduce HyCoVAD, Hybrid Complex Video Anomaly Detection, a hybrid SSL-LLM model that combines a multi-task SSL temporal analyzer with LLM validator. The SSL module is built upon an nnFormer backbone which is a transformer-based model for image segmentation. It is trained with multiple proxy tasks, learns from video frames to identify those suspected of anomaly. The selected frames are then forwarded to the LLM, which enriches the analysis with semantic context by applying structured, rule-based reasoning to validate the presence of anomalies. Experiments on the challenging ComplexVAD dataset show that HyCoVAD achieves a 72.5% frame-level AUC, outperforming existing baselines by 12.5% while reducing LLM computation. We release our interaction anomaly taxonomy, adaptive thresholding protocol, and code to facilitate future research in complex VAD scenarios.

[332] Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, Weijia Li

Main category: cs.CV

TL;DR: Earth-Agent is the first agentic framework that unifies RGB and spectral Earth observation data with MCP-based tools for complex multi-step reasoning, overcoming limitations of current MLLMs in handling domain-specific tasks.

Details

Motivation: Current MLLMs lack capability for complex multi-step reasoning and domain-specific tool usage in Earth observation. Existing agent-based methods are limited to RGB perception, shallow reasoning, and lack systematic evaluation.

Method: Earth-Agent integrates RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning. It dynamically invokes expert tools and models across modalities for tasks like geophysical parameter retrieval.

Result: Comprehensive experiments show Earth-Agent’s effectiveness across different LLM backbones, outperforming general agent frameworks and MLLMs on remote sensing benchmarks. The framework supports 248 expert-curated tasks with 13,729 images.

Conclusion: Earth-Agent establishes a new paradigm for EO analysis, advancing the field toward scientifically grounded, next-generation LLM applications in Earth observation with robust evaluation through the Earth-Bench benchmark.

Abstract: Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation.

[333] WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving

Ziyue Zhu, Zhanqian Wu, Zhenxin Zhu, Lijun Zhou, Haiyang Sun, Bing Wan, Kun Ma, Guang Chen, Hangjun Ye, Jin Xie, jian Yang

Main category: cs.CV

TL;DR: WorldSplat is a feed-forward framework for 4D driving-scene generation that bridges the gap between scene generation and reconstruction by producing consistent multi-track videos with high-quality novel-view synthesis.

Details

Motivation: To overcome the limitations of existing methods where generation approaches lack 3D consistency and sparse viewpoint coverage for novel-view synthesis, while reconstruction methods lack generative capabilities.

Method: Uses a 4D-aware latent diffusion model to generate pixel-aligned 4D Gaussians in feed-forward manner, then refines novel view videos with enhanced video diffusion model.

Result: Extensive experiments show WorldSplat generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos.

Conclusion: WorldSplat effectively bridges the gap between scene generation and reconstruction, enabling scalable and controllable training data for autonomous driving systems.

Abstract: Recent advances in driving-scene generation and reconstruction have demonstrated significant potential for enhancing autonomous driving systems by producing scalable and controllable training data. Existing generation methods primarily focus on synthesizing diverse and high-fidelity driving videos; however, due to limited 3D consistency and sparse viewpoint coverage, they struggle to support convenient and high-quality novel-view synthesis (NVS). Conversely, recent 3D/4D reconstruction approaches have significantly improved NVS for real-world driving scenes, yet inherently lack generative capabilities. To overcome this dilemma between scene generation and reconstruction, we propose WorldSplat, a novel feed-forward framework for 4D driving-scene generation. Our approach effectively generates consistent multi-track videos through two key steps: (i) We introduce a 4D-aware latent diffusion model integrating multi-modal information to produce pixel-aligned 4D Gaussians in a feed-forward manner. (ii) Subsequently, we refine the novel view videos rendered from these Gaussians using a enhanced video diffusion model. Extensive experiments conducted on benchmark datasets demonstrate that WorldSplat effectively generates high-fidelity, temporally and spatially consistent multi-track novel view driving videos. Project: https://wm-research.github.io/worldsplat/

[334] Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian

Main category: cs.CV

TL;DR: Attention Surgery is an efficient framework that linearizes or hybridizes attention in pretrained video diffusion models without full retraining, reducing attention FLOPs by up to 40% while maintaining generation quality.

Details

Motivation: Transformer-based video diffusion models suffer from quadratic self-attention costs, making long sequences and high resolutions computationally expensive. Linear attention alternatives fail to match softmax attention expressiveness without costly retraining.

Method: Combines hybrid attention (mixing softmax and linear tokens) with lightweight distillation and fine-tuning, plus cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Requires only a few GPU-days.

Result: Applied to Wan2.1 1.3B VDM, achieves first competitive sub-quadratic attention video diffusion models with 40% FLOPs reduction while maintaining quality on VBench and VBench-2.0 benchmarks.

Conclusion: Attention Surgery successfully enables efficient sub-quadratic attention in pretrained VDMs without sacrificing generation quality, making high-resolution and long-sequence video generation more computationally feasible.

Abstract: Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce Attention Surgery, an efficient framework for linearizing or hybridizing attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.

[335] ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On

Junseo Park, Hyeryung Jang

Main category: cs.CV

TL;DR: ART-VITON is a measurement-guided diffusion framework for virtual try-on that preserves non-try-on regions while eliminating boundary artifacts, using trajectory-aligned solvers and frequency-level correction.

Details

Motivation: Existing virtual try-on methods using latent diffusion models struggle to preserve identity and background in non-try-on regions, often causing boundary artifacts when directly replacing these regions with original content.

Method: Reformulates VITON as a linear inverse problem using trajectory-aligned solvers, integrates residual prior-based initialization to mitigate training-inference mismatch, and employs artifact-free measurement-guided sampling with data consistency, frequency-level correction, and periodic standard denoising.

Result: Experiments on VITON-HD, DressCode, and SHHQ-1.0 show effective preservation of identity and background, elimination of boundary artifacts, and consistent improvement in visual fidelity and robustness over state-of-the-art baselines.

Conclusion: ART-VITON successfully addresses the challenge of preserving non-try-on regions in virtual try-on by ensuring measurement adherence while maintaining artifact-free synthesis through a novel diffusion framework.

Abstract: Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.

[336] TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

Main category: cs.CV

TL;DR: TTT3R improves 3D reconstruction length generalization using test-time training with a closed-form learning rate based on memory-observation alignment confidence.

Details

Motivation: Modern RNNs for 3D reconstruction suffer from performance degradation beyond training context length due to limited length generalization.

Method: Framing 3D reconstruction as test-time training, using alignment confidence between memory state and observations to derive closed-form learning rate for memory updates.

Result: Achieves 2x improvement in global pose estimation over baselines, operates at 20 FPS with 6 GB GPU memory for thousands of images.

Conclusion: TTT3R provides effective training-free intervention for improving length generalization in 3D reconstruction models.

Abstract: Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R

[337] A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy

Guoliang Gong, Man Yu

Main category: cs.CV

TL;DR: Proposes Image Purification strategy and Frequency-domain Flow Matching model to address spatial misalignment in ultra-low dose CT denoising, achieving state-of-the-art structure preservation.

Details

Motivation: Ultra-low dose CT reduces radiation but introduces severe noise, artifacts, and spatial misalignment with normal dose CT, making existing denoising methods ineffective for real clinical data.

Method: Image Purification strategy generates structurally aligned uLDCT-NDCT pairs, combined with Frequency-domain Flow Matching model that preserves anatomical structure integrity.

Result: IP strategy significantly enhances multiple denoising models on real clinical uLDCT dataset. FFM model with IP achieves SOTA results in anatomical structure preservation.

Conclusion: Provides effective solution to data mismatch problem in real-world uLDCT denoising, with code and dataset publicly available.

Abstract: Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at https://github.com/MonkeyDadLufy/flow-matching.

[338] Ctrl-VI: Controllable Video Synthesis via Variational Inference

Haoyi Duan, Yunzhi Zhang, Yilun Du, Jiajun Wu

Main category: cs.CV

TL;DR: Ctrl-VI is a video synthesis method that enables mixed-granularity control over video generation, supporting everything from precise 4D trajectories to coarse text prompts while maintaining diversity for under-specified elements.

Details

Motivation: Existing video generative models are typically trained for fixed input formats, but many video workflows require a mixture of user controls with varying granularity - from exact 4D object trajectories and camera paths to coarse text prompts.

Method: The approach casts video synthesis as variational inference to approximate a composed distribution, leveraging multiple video generation backbones. It uses step-wise KL divergence minimization over an annealed sequence of distributions and a context-conditioned factorization technique to reduce modes in the solution space and avoid local optima.

Result: Experiments show that Ctrl-VI produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

Conclusion: Ctrl-VI successfully addresses the need for mixed-granularity control in video generation while maintaining sample diversity and quality.

Abstract: Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop Ctrl-VI, a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

[339] CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving

Tianrui Zhang, Yichen Liu, Zilin Guo, Yuxin Guo, Jingcheng Ni, Chenjing Ding, Dan Xu, Lewei Lu, Zehuan Wu

Main category: cs.CV

TL;DR: CVD-STORM is a cross-view video diffusion model that generates multi-view videos with 4D reconstruction capabilities using a spatial-temporal VAE and Gaussian Splatting Decoder.

Details

Motivation: Address the growing demand in autonomous driving for high-fidelity video generation with diverse information like depth estimation and 4D reconstruction under various control inputs.

Method: Fine-tune a spatial-temporal VAE with auxiliary 4D reconstruction task, then integrate it into video diffusion process. Use Gaussian Splatting Decoder for dynamic scene reconstruction.

Result: Achieves substantial improvements in FID and FVD metrics. Effectively reconstructs dynamic scenes with valuable geometric information.

Conclusion: The proposed approach successfully generates high-quality multi-view videos with 4D reconstruction capabilities, enhancing comprehensive scene understanding for autonomous driving applications.

Abstract: Generative models have been widely applied to world modeling for environment simulation and future state prediction. With advancements in autonomous driving, there is a growing demand not only for high-fidelity video generation under various controls, but also for producing diverse and meaningful information such as depth estimation. To address this, we propose CVD-STORM, a cross-view video diffusion model utilizing a spatial-temporal reconstruction Variational Autoencoder (VAE) that generates long-term, multi-view videos with 4D reconstruction capabilities under various control inputs. Our approach first fine-tunes the VAE with an auxiliary 4D reconstruction task, enhancing its ability to encode 3D structures and temporal dynamics. Subsequently, we integrate this VAE into the video diffusion process to significantly improve generation quality. Experimental results demonstrate that our model achieves substantial improvements in both FID and FVD metrics. Additionally, the jointly-trained Gaussian Splatting Decoder effectively reconstructs dynamic scenes, providing valuable geometric information for comprehensive scene understanding. Our project page is https://sensetime-fvg.github.io/CVD-STORM.

[340] Real-Time Motion-Controllable Autoregressive Video Diffusion

Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang

Main category: cs.CV

TL;DR: AR-Drag is a reinforcement learning-enhanced autoregressive video diffusion model that enables real-time image-to-video generation with diverse motion control, achieving high visual quality and precise motion alignment with low latency.

Details

Motivation: To address the challenges of real-time motion-controllable video generation, including the latency of bidirectional diffusion models and limitations of existing autoregressive approaches that suffer from quality degradation and motion artifacts.

Method: Fine-tune a base image-to-video model for basic motion control, then enhance it via reinforcement learning with a trajectory-based reward model, using Self-Rollout mechanism to preserve Markov property and selective stochasticity in denoising steps.

Result: AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared to state-of-the-art motion-controllable video diffusion models while using only 1.3B parameters.

Conclusion: The proposed AR-Drag framework successfully enables real-time motion-controllable video generation with improved quality and reduced latency, demonstrating the effectiveness of RL-enhanced autoregressive approaches for video synthesis.

Abstract: Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.

[341] One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

Haipeng Liu, Yang Wang, Meng Wang

Main category: cs.CV

TL;DR: NTN-Diff is a frequency-aware diffusion model for text-guided image inpainting that addresses semantics consistency and unmasked region preservation by disentangling mid-and-low frequency bands during denoising.

Details

Motivation: Previous text-guided inpainting methods failed to simultaneously preserve unmasked regions and achieve semantics consistency between masked and unmasked regions, due to entanglement of hybrid frequency bands with different robustness to text prompts.

Method: Proposes null-text-null frequency-aware diffusion model that divides denoising into early (high-level noise) and late (low-level noise) stages. Uses stable mid-frequency band as guidance for null-text denoising of low-frequency band, followed by text-guided denoising for consistency.

Result: Extensive experiments show NTN-Diff outperforms state-of-the-art diffusion models for text-guided image inpainting, achieving better semantics consistency while preserving unmasked regions.

Conclusion: NTN-Diff successfully addresses both preservation of unmasked regions and semantics consistency by frequency band disentanglement during denoising process, offering superior performance over existing methods.

Abstract: Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.

[342] MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition

Deng Li, Jun Shao, Bohao Xing, Rong Gao, Bihan Wen, Heikki Kälviäinen, Xin Liu

Main category: cs.CV

TL;DR: MSF-Mamba enhances vanilla Mamba for micro-gesture recognition by adding motion-aware local spatiotemporal modeling through state fusion and multiscale processing, achieving state-of-the-art performance with high efficiency.

Details

Motivation: Existing methods have limitations: CNNs struggle with long-range dependencies, Transformers have high computational costs, and vanilla Mamba lacks local spatiotemporal modeling and motion-awareness needed for micro-gesture recognition.

Method: Propose MSF-Mamba with motion-aware state fusion module using central frame difference and multiscale version (MSF-Mamba+) with adaptive scale weighting to fuse local contextual neighboring states for spatiotemporal modeling.

Result: Experiments on two public MGR datasets show MSF-Mamba achieves state-of-the-art performance, outperforming CNN-, Transformer-, and SSM-based models while maintaining high efficiency.

Conclusion: MSF-Mamba effectively addresses vanilla Mamba’s limitations by enabling motion-aware local spatiotemporal modeling, making it suitable for capturing subtle motion cues in micro-gesture recognition tasks.

Abstract: Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.

[343] MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

Zhenxin Lei, Zhangwei Gao, Changyao Tian, Erfei Cui, Guanzhou Chen, Danni Yang, Yuchen Duan, Zhaokai Wang, Wenhao Li, Weiyun Wang, Xiangyu Zhao, Jiayi Ji, Yu Qiao, Wenhai Wang, Gen Luo

Main category: cs.CV

TL;DR: CapFlow is a multi-agent collaboration workflow that achieves GPT-4.1-level caption quality using open-source models with 89.5% cost reduction, enabling scalable high-quality visual caption synthesis.

Details

Motivation: To bridge the performance gap between open-source and commercial visual captioning models, enabling cost-effective applications like data synthesis.

Method: Proposes CapFlow, a multi-agent collaboration workflow that leverages open-source models to generate high-quality visual captions, then uses this as data synthesizer to train MetaCaptioner via fine-tuning.

Result: Achieves caption quality comparable to GPT-4.1 across various domains with 89.5% cost reduction. MetaCaptioner reaches top-tier multimodal performance in open-source community.

Conclusion: CapFlow and MetaCaptioner provide a strong, cost-effective visual captioning solution that can benefit future multimodal research.

Abstract: Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

[344] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, Rakshith Sharma Srinivasa

Main category: cs.CV

TL;DR: VisualToolBench is a new benchmark for evaluating MLLMs’ ability to actively manipulate and think with images, showing current models struggle with visual tool-use reasoning tasks.

Details

Motivation: Current MLLMs treat images as passive context rather than manipulable cognitive workspaces, limiting their ability to solve complex visual-textual tasks that require active image transformations.

Method: Created VisualToolBench with 1,204 challenging vision tasks across 5 domains (603 single-turn, 601 multi-turn) with detailed rubrics to systematically evaluate MLLMs’ visual tool-use reasoning.

Result: Current MLLMs perform poorly, with the strongest model (GPT-5-think) achieving only 18.68% pass rate. Models show divergent behaviors - OpenAI models benefit from image manipulations while Gemini-2.5-pro shows no improvement.

Conclusion: VisualToolBench provides critical insights for advancing visual intelligence in MLLMs by introducing the first benchmark centered on ’think with images’ paradigm, highlighting significant gaps in current models’ visual tool-use capabilities.

Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce VisualToolBench, a visual tool-use reasoning benchmark that rigorously evaluates MLLMs’ ability to perceive, transform, and reason across complex visual-textual tasks under the think-with-images paradigm. VisualToolBench comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, VisualToolBench offers critical insights for advancing visual intelligence in MLLMs.

[345] SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms

Haithem Turki, Qi Wu, Xin Kang, Janick Martinez Esturo, Shengyu Huang, Ruilong Li, Zan Gojcic, Riccardo de Lutio

Main category: cs.CV

TL;DR: SimULi enables real-time rendering of arbitrary camera models and LiDAR data for autonomous robot testing, addressing limitations of existing neural rendering methods.

Details

Motivation: Existing neural rendering methods have low rendering speeds, limited camera model support, and cross-sensor inconsistencies, making them unsuitable for rigorous testing of autonomous robots like self-driving vehicles.

Method: Extends 3DGUT with LiDAR support using automated tiling for spinning LiDAR models and ray-based culling. Uses factorized 3D Gaussian representation and anchoring strategy to address cross-sensor inconsistencies.

Result: Renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work. Reduces mean camera and depth error by up to 40% compared to existing methods. Matches or exceeds state-of-the-art fidelity on autonomous driving datasets.

Conclusion: SimULi provides a practical solution for high-fidelity, real-time multi-sensor simulation needed for rigorous testing of autonomous robots, overcoming key limitations of existing neural rendering approaches.

Abstract: Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.

[346] SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl

Main category: cs.CV

TL;DR: The paper introduces SVAG, a new task for spatio-temporal video action grounding that requires detecting, tracking, and temporally localizing objects based on natural language action descriptions, along with a benchmark dataset and baseline model.

Details

Motivation: Existing video understanding methods focus on coarse-grained action recognition or generic object tracking, but lack the capability to jointly detect and track multiple objects according to their actions while grounding them temporally.

Method: Proposed SVAGFormer, a baseline framework that adapts state-of-the-art vision language models for joint spatial and temporal grounding, and introduced SVAG-Bench dataset with 688 videos and 19,590 annotated records.

Result: Empirical results show that existing models perform poorly on SVAG, especially in dense or complex scenes, highlighting the need for better reasoning over fine-grained object-action interactions.

Conclusion: The SVAG task presents significant challenges for current models and underscores the need for more advanced reasoning capabilities in video understanding systems.

Abstract: Understanding fine-grained actions and accurately localizing their corresponding actors in space and time are fundamental capabilities for advancing next-generation AI systems, including embodied agents, autonomous platforms, and human-AI interaction frameworks. Despite recent progress in video understanding, existing methods predominantly address either coarse-grained action recognition or generic object tracking, thereby overlooking the challenge of jointly detecting and tracking multiple objects according to their actions while grounding them temporally. To address this gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task that requires models to simultaneously detect, track, and temporally localize all referent objects in videos based on natural language descriptions of their actions. To support this task, we construct SVAG-Bench, a large-scale benchmark comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering a diverse range of objects, actions, and real-world scenes. We further propose SVAGFormer, a baseline framework that adapts state of the art vision language models for joint spatial and temporal grounding, and introduce SVAGEval, a standardized evaluation toolkit for fair and reproducible benchmarking. Empirical results show that existing models perform poorly on SVAG, particularly in dense or complex scenes, underscoring the need for more advanced reasoning over fine-grained object-action interactions in long videos.

[347] CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

Li Liang, Bo Miao, Xinyu Wang, Naveed Akhtar, Jordan Vice, Ajmal Mian

Main category: cs.CV

TL;DR: SketchSem3D is the first large-scale benchmark for generating 3D outdoor semantic scenes from sketches and satellite images, with a proposed Cylinder Mamba Diffusion (CymbaDiff) method that enhances spatial coherence in scene generation.

Details

Motivation: Advances in outdoor 3D semantic scene generation are constrained by the absence of publicly available, well-annotated datasets.

Method: Proposed Cylinder Mamba Diffusion (CymbaDiff) that imposes structured spatial ordering, captures cylindrical continuity and vertical hierarchy, and preserves physical neighborhood relationships and global context.

Result: Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization.

Conclusion: The code and dataset will be publicly available to enable standardized, rigorous, and diverse evaluations in outdoor 3D scene generation.

Abstract: Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at https://github.com/Lillian-research-hub/CymbaDiff

[348] Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

Hong-Kai Zheng, Piji Li

Main category: cs.CV

TL;DR: Group-VQ improves VQ-VAE by using group-wise codebook optimization and training-free codebook resampling to address codebook collapse issues and enhance reconstruction quality.

Details

Motivation: To solve persistent codebook collapse problems in VQ-VAEs and overcome limitations of existing approaches that constrain codebook learning capability and reduce reconstruction quality.

Method: Proposes Group-VQ with group-wise codebook optimization where each group is independently optimized with joint optimization within groups, plus a training-free codebook resampling method for post-training size adjustment.

Result: Improved performance on reconstruction metrics in image reconstruction experiments across various settings, with the codebook resampling method achieving desired flexibility in adjusting codebook size.

Conclusion: Group-VQ provides a better trade-off between codebook utilization and reconstruction performance while offering flexible post-training codebook size adjustment.

Abstract: Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook’s learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.

[349] High Semantic Features for the Continual Learning of Complex Emotions: a Lightweight Solution

Thibault Geoffroy, Gauthier Gerspacher, Lionel Prevost

Main category: cs.CV

TL;DR: The paper proposes using Action Units (facial muscle movements) as non-transient features for incremental learning of complex emotions, achieving 0.75 accuracy on CFEE dataset with lightweight model.

Details

Motivation: Address catastrophic forgetting in incremental learning for emotion recognition, particularly when learning complex emotions after basic ones, similar to human learning process.

Method: Use Action Units describing facial muscle movements as non-transient, semantic features instead of features from shallow/deep CNNs for incremental learning of complex emotions.

Result: Achieves 0.75 accuracy on CFEE dataset for complex emotion recognition, comparable to state-of-the-art, with lightweight model and small memory footprint.

Conclusion: Action Units are effective non-transient features that prevent catastrophic forgetting in incremental learning of emotions, enabling successful learning of complex emotions after basic ones.

Abstract: Incremental learning is a complex process due to potential catastrophic forgetting of old tasks when learning new ones. This is mainly due to transient features that do not fit from task to task. In this paper, we focus on complex emotion recognition. First, we learn basic emotions and then, incrementally, like humans, complex emotions. We show that Action Units, describing facial muscle movements, are non-transient, highly semantical features that outperform those extracted by both shallow and deep convolutional neural networks. Thanks to this ability, our approach achieves interesting results when learning incrementally complex, compound emotions with an accuracy of 0.75 on the CFEE dataset and can be favorably compared to state-of-the-art results. Moreover, it results in a lightweight model with a small memory footprint.

[350] OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild

Hongyu Qu, Jianan Wei, Xiangbo Shu, Yazhou Yao, Wenguan Wang, Jinhui Tang

Main category: cs.CV

TL;DR: OmniGaze is a semi-supervised framework for 3D gaze estimation that uses large-scale unlabeled data from diverse real-world environments to overcome domain bias and improve generalization.

Details

Motivation: Current 3D gaze estimation methods struggle with generalization across diverse domains due to limited annotated datasets and insufficient diversity in labeled data.

Method: OmniGaze uses pseudo-labeling with a reward model that assesses pseudo label reliability using 3D direction vectors, visual embeddings from an off-the-shelf encoder, and semantic cues from a Multimodal Large Language Model.

Result: Achieves state-of-the-art performance on five datasets in both in-domain and cross-domain settings, and shows robust zero-shot generalization on four unseen datasets.

Conclusion: OmniGaze effectively leverages unlabeled data to improve 3D gaze estimation generalization and can serve as a scalable data engine for gaze estimation tasks.

Abstract: Current 3D gaze estimation methods struggle to generalize across diverse data domains, primarily due to i) the scarcity of annotated datasets, and ii) the insufficient diversity of labeled data. In this work, we present OmniGaze, a semi-supervised framework for 3D gaze estimation, which utilizes large-scale unlabeled data collected from diverse and unconstrained real-world environments to mitigate domain bias and generalize gaze estimation in the wild. First, we build a diverse collection of unlabeled facial images, varying in facial appearances, background environments, illumination conditions, head poses, and eye occlusions. In order to leverage unlabeled data spanning a broader distribution, OmniGaze adopts a standard pseudo-labeling strategy and devises a reward model to assess the reliability of pseudo labels. Beyond pseudo labels as 3D direction vectors, the reward model also incorporates visual embeddings extracted by an off-the-shelf visual encoder and semantic cues from gaze perspective generated by prompting a Multimodal Large Language Model to compute confidence scores. Then, these scores are utilized to select high-quality pseudo labels and weight them for loss computation. Extensive experiments demonstrate that OmniGaze achieves state-of-the-art performance on five datasets under both in-domain and cross-domain settings. Furthermore, we also evaluate the efficacy of OmniGaze as a scalable data engine for gaze estimation, which exhibits robust zero-shot generalization on four unseen datasets.

[351] Reasoning in Space via Grounding in the World

Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, Peidong Liu

Main category: cs.CV

TL;DR: GS-Reasoner is a 3D LLM that achieves autoregressive grounding without external modules through a dual-path pooling mechanism, creating unified 3D representations that bridge visual grounding and spatial reasoning.

Details

Motivation: Existing 3D LLMs lack unified representations that capture both semantic and geometric information, leading to poor grounding performance or excessive reliance on external modules, which hinders the integration of grounding and spatial reasoning.

Method: Proposes a dual-path pooling mechanism that aligns geometric features with semantic and positional cues to create unified image patch-based 3D representations without increasing input tokens. Also introduces Grounded Chain-of-Thought (GCoT) dataset with 3D bounding boxes and step-by-step reasoning paths.

Result: GS-Reasoner achieves impressive results on 3D visual grounding and state-of-the-art performance on spatial reasoning, demonstrating that grounding significantly enhances spatial reasoning capabilities.

Conclusion: The paper establishes a unified and self-contained framework for 3D spatial reasoning, showing that effective grounding is crucial for spatial reasoning and can be achieved without external modules through holistic 3D representations.

Abstract: In this paper, we claim that 3D visual grounding is the cornerstone of spatial reasoning and introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them. Existing 3D LLMs suffer from the absence of a unified 3D representation capable of jointly capturing semantic and geometric information. This deficiency is manifested either in poor performance on grounding or in an excessive reliance on external modules, ultimately hindering the seamless integration of grounding and spatial reasoning. To address this, we propose a simple yet effective dual-path pooling mechanism that tightly aligns geometric features with both semantic and positional cues, constructing a unified image patch-based 3D representation that encapsulates all essential information without increasing the number of input tokens. Leveraging this holistic representation, GS-Reasoner is the first 3D LLM that achieves autoregressive grounding entirely without external modules while delivering performance comparable to state-of-the-art models, establishing a unified and self-contained framework for 3D spatial reasoning. To further bridge grounding and spatial reasoning, we introduce the Grounded Chain-of-Thought (GCoT) dataset. This dataset is meticulously curated to include both 3D bounding box annotations for objects referenced in reasoning questions and step-by-step reasoning paths that integrate grounding as a core component of the problem-solving process. Extensive experiments demonstrate that GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities, leading to state-of-the-art performance.

cs.AI

[352] Decision Oriented Technique (DOTechnique): Finding Model Validity Through Decision-Maker Context

Raheleh Biglari, Joachim Denil

Main category: cs.AI

TL;DR: DOTechnique determines model validity by evaluating decision consistency between surrogate and high-fidelity models, enabling efficient validity region identification without predefined boundaries.

Details

Motivation: Traditional model validation relies on predefined validity frames that may be unavailable or insufficient, especially for decision-making processes where model validity is critical.

Method: Decision Oriented Technique (DOTechnique) evaluates model validity based on decision consistency rather than output similarity, integrating domain constraints and symbolic reasoning to narrow search space.

Result: Applied to a highway lane change system, DOTechnique successfully uncovered the validity region of a simulation model, demonstrating practical effectiveness.

Conclusion: DOTechnique shows potential for supporting model validity determination through decision-maker context, offering an efficient alternative to traditional validation approaches.

Abstract: Model validity is as critical as the model itself, especially when guiding decision-making processes. Traditional approaches often rely on predefined validity frames, which may not always be available or sufficient. This paper introduces the Decision Oriented Technique (DOTechnique), a novel method for determining model validity based on decision consistency rather than output similarity. By evaluating whether surrogate models lead to equivalent decisions compared to high-fidelity models, DOTechnique enables efficient identification of validity regions, even in the absence of explicit validity boundaries. The approach integrates domain constraints and symbolic reasoning to narrow the search space, enhancing computational efficiency. A highway lane change system serves as a motivating example, demonstrating how DOTechnique can uncover the validity region of a simulation model. The results highlight the potential of the technique to support finding model validity through decision-maker context.

Supriti Sinhamahapatra, Jan Niehues

Main category: cs.AI

TL;DR: This paper presents a multi-modal ASR system that integrates visual information from presentation slides to improve speech recognition, achieving 34-35% relative WER reduction compared to baseline models.

Details

Motivation: Current SOTA ASR systems rely only on acoustic information and ignore multi-modal context, while visual information from slides can help disambiguate domain-specific terminology in scientific presentations.

Method: Created a benchmark for multi-modal presentations, developed data augmentation techniques to overcome lack of slide datasets, and trained models by augmenting speech models with multi-modal slide information.

Result: The trained model achieved approximately 34% relative reduction in overall word error rate and 35% relative reduction for domain-specific terms compared to baseline models.

Conclusion: Integrating visual information from presentation slides significantly improves ASR performance, especially for domain-specific terminology in scientific presentations.

Abstract: State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.

[354] Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment

María Victoria Carro, Denise Alejandra Mester, Francisca Gauna Selasco, Giovanni Franco Gabriel Marraffini, Mario Alejandro Leiva, Gerardo I. Simari, María Vanina Martinez

Main category: cs.AI

TL;DR: Large language models systematically develop causal illusions in null contingency scenarios, showing they reproduce causal language without true comprehension.

Details

Motivation: To examine whether LLMs are prone to causal illusions like humans, using the classic contingency judgment task paradigm from cognitive science.

Method: Created a dataset of 1,000 null contingency medical scenarios and prompted LLMs to evaluate potential causes’ effectiveness.

Result: All evaluated models systematically inferred unwarranted causal relationships, showing strong susceptibility to the illusion of causality.

Conclusion: LLMs reproduce causal language without true comprehension, raising concerns about their use in domains requiring accurate causal reasoning.

Abstract: Causal learning is the cognitive process of developing the capability of making causal inferences based on available information, often guided by normative principles. This process is prone to errors and biases, such as the illusion of causality, in which people perceive a causal relationship between two variables despite lacking supporting evidence. This cognitive bias has been proposed to underlie many societal problems, including social prejudice, stereotype formation, misinformation, and superstitious thinking. In this work, we examine whether large language models are prone to developing causal illusions when faced with a classic cognitive science paradigm: the contingency judgment task. To investigate this, we constructed a dataset of 1,000 null contingency scenarios (in which the available information is not sufficient to establish a causal relationship between variables) within medical contexts and prompted LLMs to evaluate the effectiveness of potential causes. Our findings show that all evaluated models systematically inferred unwarranted causal relationships, revealing a strong susceptibility to the illusion of causality. While there is ongoing debate about whether LLMs genuinely understand causality or merely reproduce causal language without true comprehension, our findings support the latter hypothesis and raise concerns about the use of language models in domains where accurate causal reasoning is essential for informed decision-making.

[355] GammaZero: Learning To Guide POMDP Belief Space Search With Graph Representations

Rajesh Mangannavar, Prasad Tadepalli

Main category: cs.AI

TL;DR: GammaZero introduces an action-centric graph representation for POMDP planning that enables zero-shot generalization to larger problem sizes by learning structural patterns from small problems and applying them via graph neural networks to guide Monte Carlo tree search.

Details

Motivation: Existing approaches for POMDP planning require domain-specific neural architectures and struggle with scalability, limiting their practical application to larger problems.

Method: Uses a unified graph-based belief representation transformed into action-centric graphs, employs graph neural networks with decoder architecture to learn value functions and policies from expert demonstrations on small problems, then applies learned heuristics to guide Monte Carlo tree search on larger problems.

Result: Achieves comparable performance to BetaZero on same-sized problems while enabling zero-shot generalization to problems 2-4 times larger than training instances, maintaining solution quality with reduced search requirements.

Conclusion: GammaZero provides a scalable framework for POMDP planning that generalizes across problem sizes through action-centric graph representations and learned structural patterns.

Abstract: We introduce an action-centric graph representation framework for learning to guide planning in Partially Observable Markov Decision Processes (POMDPs). Unlike existing approaches that require domain-specific neural architectures and struggle with scalability, GammaZero leverages a unified graph-based belief representation that enables generalization across problem sizes within a domain. Our key insight is that belief states can be systematically transformed into action-centric graphs where structural patterns learned on small problems transfer to larger instances. We employ a graph neural network with a decoder architecture to learn value functions and policies from expert demonstrations on computationally tractable problems, then apply these learned heuristics to guide Monte Carlo tree search on larger problems. Experimental results on standard POMDP benchmarks demonstrate that GammaZero achieves comparable performance to BetaZero when trained and tested on the same-sized problems, while uniquely enabling zero-shot generalization to problems 2-4 times larger than those seen during training, maintaining solution quality with reduced search requirements.

[356] TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni

Main category: cs.AI

TL;DR: Systematic exploration of multimodal depression detection using EEG, speech, and text modalities, showing that trimodal models with pretrained embeddings achieve state-of-the-art performance.

Details

Motivation: Address limitations in existing depression detection studies including limited scope, lack of systematic feature comparisons, and inconsistent evaluation protocols.

Method: Systematic evaluation of feature representations (handcrafted vs pretrained embeddings), neural encoders, unimodal/bimodal/trimodal configurations, and fusion strategies with consistent subject-independent splits.

Result: Combination of EEG, speech and text enhances detection; pretrained embeddings outperform handcrafted features; carefully designed trimodal models achieve state-of-the-art performance.

Conclusion: The work lays groundwork for future multimodal depression detection research by providing systematic comparisons and robust benchmarking.

Abstract: Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.

[357] Position: Require Frontier AI Labs To Release Small “Analog” Models

Shriyash Upadhyay, Chaithanya Bandi, Narmeen Oozeer, Philip Quirke

Main category: cs.AI

TL;DR: Proposes mandating AI labs to release small analog models (scaled-down versions) of their large proprietary models to enable safety verification and innovation without full disclosure.

Details

Motivation: Address concerns about safety regulation costs and the safety-innovation tradeoff by creating an alternative regulatory approach that promotes both safety and innovation.

Method: Require large AI labs to release openly accessible analog models trained similarly to and distilled from their largest proprietary models, serving as public proxies for safety research.

Result: Enables broad participation in safety verification, interpretability research, and algorithmic transparency while minimizing regulatory burden and accelerating safety advancements.

Conclusion: This policy approach reduces the safety-innovation tradeoff by enabling deeper understanding of models through accessible analogs, promoting both safety and innovation with minimal additional costs.

Abstract: Recent proposals for regulating frontier AI models have sparked concerns about the cost of safety regulation, and most such regulations have been shelved due to the safety-innovation tradeoff. This paper argues for an alternative regulatory approach that ensures AI safety while actively promoting innovation: mandating that large AI laboratories release small, openly accessible analog models (scaled-down versions) trained similarly to and distilled from their largest proprietary models. Analog models serve as public proxies, allowing broad participation in safety verification, interpretability research, and algorithmic transparency without forcing labs to disclose their full-scale models. Recent research demonstrates that safety and interpretability methods developed using these smaller models generalize effectively to frontier-scale systems. By enabling the wider research community to directly investigate and innovate upon accessible analogs, our policy substantially reduces the regulatory burden and accelerates safety advancements. This mandate promises minimal additional costs, leveraging reusable resources like data and infrastructure, while significantly contributing to the public good. Our hope is not only that this policy be adopted, but that it illustrates a broader principle supporting fundamental research in machine learning: deeper understanding of models relaxes the safety-innovation tradeoff and lets us have more of both.

Carter Blair, Kate Larson

Main category: cs.AI

TL;DR: The paper proposes a multi-objective MDP framework for consensus statement generation that provides provable fairness guarantees using social choice theory principles.

Details

Motivation: Current consensus generation frameworks lack structure for provable fairness guarantees when aggregating diverse free-form opinions.

Method: Model the task as a token-level MDP with multiple objectives (agents’ preferences), derive rewards from personalized language models, and apply social choice theory approaches including ex-ante core guarantees and egalitarian welfare maximization.

Result: Empirical experiments show that search guided by egalitarian objective generates consensus statements with improved worst-case agent alignment compared to baseline methods.

Conclusion: The proposed MDP framework with social choice theory principles enables provably fair consensus generation with better worst-case performance than existing approaches.

Abstract: Current frameworks for consensus statement generation with large language models lack the inherent structure needed to provide provable fairness guarantees when aggregating diverse free-form opinions. We model the task as a multi-objective, token-level Markov Decision Process (MDP), where each objective corresponds to an agent’s preference. Token-level rewards for each agent are derived from their policy (e.g., a personalized language model). This approach utilizes the finding that such policies implicitly define optimal Q-functions, providing a principled way to quantify rewards at each generation step without a value function (Rafailov et al., 2024). This MDP formulation creates a formal structure amenable to analysis using principles from social choice theory. We propose two approaches grounded in social choice theory. First, we propose a stochastic generation policy guaranteed to be in the ex-ante core, extending core stability concepts from voting theory to text generation. This policy is derived from an underlying distribution over complete statements that maximizes proportional fairness (Nash Welfare). Second, for generating a single statement, we target the maximization of egalitarian welfare using search algorithms within the MDP framework. Empirically, experiments using language models to instantiate agent policies show that search guided by the egalitarian objective generates consensus statements with improved worst-case agent alignment compared to baseline methods, including the Habermas Machine (Tessler et al., 2024).

[359] STEMS: Spatial-Temporal Enhanced Safe Multi-Agent Coordination for Building Energy Management

Huiliang Zhang, Di Wu, Arnaud Zinflou, Benoit Boulet

Main category: cs.AI

TL;DR: STEMS is a safety-constrained multi-agent reinforcement learning framework that uses spatial-temporal graph learning and control barrier functions for coordinated building energy management, achieving significant cost and emission reductions while ensuring operational safety.

Details

Motivation: Address challenges in multi-building energy systems: insufficient spatial-temporal information exploitation, lack of rigorous safety guarantees, and system complexity for coordinated building energy management.

Method: Integrates GCN-Transformer fusion architecture for spatial-temporal graph representation learning and safety-constrained multi-agent RL with Control Barrier Functions for mathematical safety guarantees.

Result: 21% cost reduction, 18% emission reduction, safety violations reduced from 35.1% to 5.6%, maintains optimal comfort with only 0.13 discomfort proportion, and shows strong robustness during extreme weather.

Conclusion: STEMS framework effectively addresses multi-building energy management challenges, demonstrating superior performance, safety guarantees, and robustness across different building types.

Abstract: Building energy management is essential for achieving carbon reduction goals, improving occupant comfort, and reducing energy costs. Coordinated building energy management faces critical challenges in exploiting spatial-temporal dependencies while ensuring operational safety across multi-building systems. Current multi-building energy systems face three key challenges: insufficient spatial-temporal information exploitation, lack of rigorous safety guarantees, and system complexity. This paper proposes Spatial-Temporal Enhanced Safe Multi-Agent Coordination (STEMS), a novel safety-constrained multi-agent reinforcement learning framework for coordinated building energy management. STEMS integrates two core components: (1) a spatial-temporal graph representation learning framework using a GCN-Transformer fusion architecture to capture inter-building relationships and temporal patterns, and (2) a safety-constrained multi-agent RL algorithm incorporating Control Barrier Functions to provide mathematical safety guarantees. Extensive experiments on real-world building datasets demonstrate STEMS’s superior performance over existing methods, showing that STEMS achieves 21% cost reduction, 18% emission reduction, and dramatically reduces safety violations from 35.1% to 5.6% while maintaining optimal comfort with only 0.13 discomfort proportion. The framework also demonstrates strong robustness during extreme weather conditions and maintains effectiveness across different building types.

[360] Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems

Edoardo Allegrini, Ananth Shreekumar, Z. Berkay Celik

Main category: cs.AI

TL;DR: A unified modeling framework for agentic AI systems that addresses fragmentation in inter-agent communication protocols by providing formal models for host agents and task lifecycles, enabling formal verification of system properties.

Details

Motivation: Current agentic AI systems suffer from fragmented communication protocols (MCP for tools, A2A for coordination) that create semantic gaps, preventing rigorous analysis and introducing risks like architectural misalignment and exploitable coordination issues.

Method: Introduces two foundational models: 1) host agent model formalizing the top-level entity that interacts with users and orchestrates task execution, and 2) task lifecycle model detailing states and transitions of sub-tasks from creation to completion.

Result: Developed 17 properties for host agents and 14 for task lifecycles, categorized into liveness, safety, completeness, and fairness, expressed in temporal logic for formal verification of system behavior.

Conclusion: This is the first rigorously grounded, domain-agnostic framework for systematic analysis, design, and deployment of correct, reliable, and robust agentic AI systems.

Abstract: Agentic AI systems, which leverage multiple autonomous agents and Large Language Models (LLMs), are increasingly used to address complex, multi-step tasks. The safety, security, and functionality of these systems are critical, especially in high-stakes applications. However, the current ecosystem of inter-agent communication is fragmented, with protocols such as the Model Context Protocol (MCP) for tool access and the Agent-to-Agent (A2A) protocol for coordination being analyzed in isolation. This fragmentation creates a semantic gap that prevents the rigorous analysis of system properties and introduces risks such as architectural misalignment and exploitable coordination issues. To address these challenges, we introduce a modeling framework for agentic AI systems composed of two foundational models. The first, the host agent model, formalizes the top-level entity that interacts with the user, decomposes tasks, and orchestrates their execution by leveraging external agents and tools. The second, the task lifecycle model, details the states and transitions of individual sub-tasks from creation to completion, providing a fine-grained view of task management and error handling. Together, these models provide a unified semantic framework for reasoning about the behavior of multi-AI agent systems. Grounded in this framework, we define 17 properties for the host agent and 14 for the task lifecycle, categorized into liveness, safety, completeness, and fairness. Expressed in temporal logic, these properties enable formal verification of system behavior, detection of coordination edge cases, and prevention of deadlocks and security vulnerabilities. Through this effort, we introduce the first rigorously grounded, domain-agnostic framework for the systematic analysis, design, and deployment of correct, reliable, and robust agentic AI systems.

[361] A Multimodal Approach to Heritage Preservation in the Context of Climate Change

David Roqui, Adèle Cormier, nistor Grozavu, Ann Bourges

Main category: cs.AI

TL;DR: A lightweight multimodal architecture fusing sensor data and visual imagery to predict degradation severity at heritage sites, achieving 76.9% accuracy with simplified PerceiverIO and Adaptive Barlow Twins loss.

Details

Motivation: Cultural heritage sites face accelerating degradation due to climate change, but traditional unimodal monitoring (visual inspection or environmental sensors alone) fails to capture the complex interplay between environmental stressors and material deterioration.

Method: Proposed a lightweight multimodal architecture adapting PerceiverIO with simplified encoders (64D latent space) and Adaptive Barlow Twins loss that encourages modality complementarity rather than redundancy.

Result: Achieved 76.9% accuracy on Strasbourg Cathedral data, a 43% improvement over standard multimodal architectures and 25% over vanilla PerceiverIO. Sensor-only achieved 61.5% while image-only reached 46.2%, confirming multimodal synergy.

Conclusion: Architectural simplicity combined with contrastive regularization enables effective multimodal learning in data-scarce heritage monitoring contexts, providing a foundation for AI-driven conservation decision support systems.

Abstract: Cultural heritage sites face accelerating degradation due to climate change, yet tradi- tional monitoring relies on unimodal analysis (visual inspection or environmental sen- sors alone) that fails to capture the complex interplay between environmental stres- sors and material deterioration. We propose a lightweight multimodal architecture that fuses sensor data (temperature, humidity) with visual imagery to predict degradation severity at heritage sites. Our approach adapts PerceiverIO with two key innovations: (1) simplified encoders (64D latent space) that prevent overfitting on small datasets (n=37 training samples), and (2) Adaptive Barlow Twins loss that encourages modality complementarity rather than redundancy. On data from Strasbourg Cathedral, our model achieves 76.9% accu- racy, a 43% improvement over standard multimodal architectures (VisualBERT, Trans- former) and 25% over vanilla PerceiverIO. Ablation studies reveal that sensor-only achieves 61.5% while image-only reaches 46.2%, confirming successful multimodal synergy. A systematic hyperparameter study identifies an optimal moderate correlation target ({\tau} =0.3) that balances align- ment and complementarity, achieving 69.2% accuracy compared to other {\tau} values ({\tau} =0.1/0.5/0.7: 53.8%, {\tau} =0.9: 61.5%). This work demonstrates that architectural sim- plicity combined with contrastive regularization enables effective multimodal learning in data-scarce heritage monitoring contexts, providing a foundation for AI-driven con- servation decision support systems.

[362] CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization

Henrique Assumpção, Diego Ferreira, Leandro Campos, Fabricio Murai

Main category: cs.AI

TL;DR: CodeEvolve is an open-source evolutionary coding agent that combines LLMs with genetic algorithms to solve complex computational problems, outperforming Google DeepMind’s AlphaEvolve on several mathematical benchmarks.

Details

Motivation: To develop an open-source alternative to closed-source systems like AlphaEvolve that can solve complex computational problems by uniting LLMs with evolutionary algorithms.

Method: Uses island-based genetic algorithm for population diversity, inspiration-based crossover leveraging LLM context windows, and meta-prompting for dynamic solution space exploration.

Result: Surpassed AlphaEvolve’s performance on several challenging mathematical benchmarks from the evaluation set.

Conclusion: CodeEvolve demonstrates the effectiveness of combining LLMs with evolutionary algorithms for complex problem solving, and the framework is released as open-source to foster collaboration.

Abstract: In this work, we introduce CodeEvolve, an open-source evolutionary coding agent that unites Large Language Models (LLMs) with genetic algorithms to solve complex computational problems. Our framework adapts powerful evolutionary concepts to the LLM domain, building upon recent methods for generalized scientific discovery. CodeEvolve employs an island-based genetic algorithm to maintain population diversity and increase throughput, introduces a novel inspiration-based crossover mechanism that leverages the LLMs context window to combine features from successful solutions, and implements meta-prompting strategies for dynamic exploration of the solution space. We conduct a rigorous evaluation of CodeEvolve on a subset of the mathematical benchmarks used to evaluate Google DeepMind’s closed-source AlphaEvolve. Our findings show that our method surpasses AlphaEvolve’s performance on several challenging problems. To foster collaboration and accelerate progress, we release our complete framework as an open-source repository.

[363] Combining Reinforcement Learning and Behavior Trees for NPCs in Video Games with AMD Schola

Tian Liu, Alex Cann, Ian Colbert, Mehdi Saeedi

Main category: cs.AI

TL;DR: The paper explores combining reinforcement learning with behavior trees to create viable NPCs in commercial video games, addressing the slow adoption of RL in gaming.

Details

Motivation: To address the slow adoption of reinforcement learning in commercial video games and explore the intersection of RL with traditional behavior trees for practical Game AI applications.

Method: Used AMD Schola plugin in Unreal Engine to train RL agents, creating multi-task NPCs in a complex 3D environment inspired by “The Last of Us”, with detailed methodologies for joint training of RL models with behavior trees.

Result: Demonstrated the viability of combining RL with behavior trees for creating NPCs with various skills in complex game environments.

Conclusion: The RL+behavior tree intersection is a crucial approach that can help bridge the gap between RL research and practical commercial game development.

Abstract: While the rapid advancements in the reinforcement learning (RL) research community have been remarkable, the adoption in commercial video games remains slow. In this paper, we outline common challenges the Game AI community faces when using RL-driven NPCs in practice, and highlight the intersection of RL with traditional behavior trees (BTs) as a crucial juncture to be explored further. Although the BT+RL intersection has been suggested in several research papers, its adoption is rare. We demonstrate the viability of this approach using AMD Schola – a plugin for training RL agents in Unreal Engine – by creating multi-task NPCs in a complex 3D environment inspired by the commercial video game ``The Last of Us". We provide detailed methodologies for jointly training RL models with BTs while showcasing various skills.

[364] JEDA: Query-Free Clinical Order Search from Ambient Dialogues

Praphul Singh, Corey Barrett, Sumana Srivasta, Amitabh Saikia, Irfan Bulu, Sri Gadde, Krishnaram Kenthapadi

Main category: cs.AI

TL;DR: JEDA is a bi-encoder system that retrieves clinical orders directly from ambient dialogue using joint embedding of commands and context, eliminating LLM dependency for real-time clinical ordering.

Details

Motivation: Current systems rely on LLM rewriting which adds latency, instability, and opacity, making them unsuitable for real-time clinical ordering where explicit directives and implicit reasoning coexist.

Method: Domain-initialized bi-encoder from PubMedBERT with duplicate-safe contrastive objective, using constrained LLM guidance to align heterogeneous intent expressions to shared order concepts with query-free ambient dialogue encoding.

Result: JEDA achieves large performance gains, substantially outperforming base encoder and recent open embedders, with noise-resilient query-free mode that reduces sensitivity to disfluencies and ASR errors.

Conclusion: JEDA provides a fast, interpretable, LLM-free retrieval layer that links ambient clinical context to actionable orders in real time, enabling practical deployment in clinical settings.

Abstract: Clinical conversations mix explicit directives (order a chest X-ray) with implicit reasoning (the cough worsened overnight, we should check for pneumonia). Many systems rely on LLM rewriting, adding latency, instability, and opacity that hinder real-time ordering. We present JEDA (Joint Embedding for Direct and Ambient clinical orders), a domain-initialized bi-encoder that retrieves canonical orders directly and, in a query-free mode, encodes a short rolling window of ambient dialogue to trigger retrieval. Initialized from PubMedBERT and fine-tuned with a duplicate-safe contrastive objective, JEDA aligns heterogeneous expressions of intent to shared order concepts. Training uses constrained LLM guidance to tie each signed order to complementary formulations (command only, context only, command+context, context+reasoning), producing clearer inter-order separation, tighter query extendash order coupling, and stronger generalization. The query-free mode is noise-resilient, reducing sensitivity to disfluencies and ASR errors by conditioning on a short window rather than a single utterance. Deployed in practice, JEDA yields large gains and substantially outperforms its base encoder and recent open embedders (Linq Embed Mistral, SFR Embedding, GTE Qwen, BGE large, Embedding Gemma). The result is a fast, interpretable, LLM-free retrieval layer that links ambient context to actionable clinical orders in real time.

[365] ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Roger Creus Castanyer, Faisal Mohamed, Pablo Samuel Castro, Cyrus Neary, Glen Berseth

Main category: cs.AI

TL;DR: ARM-FM is a framework that uses foundation models to automatically generate reward machines from natural language specifications for reinforcement learning, enabling compositional reward design and zero-shot generalization.

Details

Motivation: Reinforcement learning algorithms are highly sensitive to reward function specification, which limits their broad applicability. Current approaches require manual reward engineering, which is challenging and time-consuming.

Method: Uses foundation models to automatically generate reward machines from natural language specifications, associates language embeddings with each automata-state for generalization, and employs the structured formalism of reward machines for effective task decomposition.

Result: Empirical evidence shows ARM-FM’s effectiveness in diverse challenging environments, including demonstration of zero-shot generalization capabilities.

Conclusion: ARM-FM provides an automated framework for compositional reward design in RL that leverages foundation models’ high-level reasoning capabilities, addressing the central challenge of reward function specification.

Abstract: Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) – an automata-based formalism for reward specification – are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM’s effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

[366] Implementation of AI in Precision Medicine

Göktuğ Bender, Samer Faraj, Anand Bhardwaj

Main category: cs.AI

TL;DR: Scoping review of AI implementation in precision medicine from 2019-2024, identifying barriers and enablers across data quality, clinical reliability, workflow integration, and governance.

Details

Motivation: AI is central to precision medicine but implementation in clinical settings remains limited, highlighting the need to understand barriers to real-world translation.

Method: Scoping review of literature from 2019-2024 using an ecosystem-based framework to analyze interdependent relationships in AI implementation.

Result: Identified key barriers and enablers across four domains: data quality, clinical reliability, workflow integration, and governance.

Conclusion: Proposes future directions to support trustworthy and sustainable implementation of AI in precision medicine through understanding ecosystem interdependencies.

Abstract: Artificial intelligence (AI) has become increasingly central to precision medicine by enabling the integration and interpretation of multimodal data, yet implementation in clinical settings remains limited. This paper provides a scoping review of literature from 2019-2024 on the implementation of AI in precision medicine, identifying key barriers and enablers across data quality, clinical reliability, workflow integration, and governance. Through an ecosystem-based framework, we highlight the interdependent relationships shaping real-world translation and propose future directions to support trustworthy and sustainable implementation.

[367] Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Trilok Padhi, Pinxian Lu, Abdulkadir Erol, Tanmay Sutar, Gauri Sharma, Mina Sonmez, Munmun De Choudhury, Ugur Kursuncu

Main category: cs.AI

TL;DR: This paper introduces a multi-turn harassment benchmark for LLM agents, showing that jailbreak attacks significantly increase toxic behavior in both open-source and closed-source models, with attack success rates up to 99.33% and refusal rates dropping to 1-2%.

Details

Motivation: Current jailbreak research focuses on single-turn prompts, but real harassment occurs over multiple interactions. There's a need to understand LLM vulnerability in multi-turn harassment scenarios to develop better safety measures.

Method: Created Online Harassment Agentic Benchmark with: synthetic multi-turn harassment dataset, multi-agent simulation using game theory, three jailbreak methods (memory, planning, fine-tuning), and mixed-methods evaluation on LLaMA-3.1-8B-Instruct and Gemini-2.0-flash.

Result: Jailbreak tuning dramatically increases harassment success (95.78-96.89% vs 57.25-64.19% in Llama; 99.33% vs 98.46% in Gemini) and reduces refusal rates to 1-2%. Insult and flaming behaviors become prevalent (84.9-87.8% and 81.2-85.1% respectively). Closed-source models show distinct escalation patterns with significant vulnerability.

Conclusion: Multi-turn, theory-grounded attacks successfully mimic human-like harassment dynamics at high rates, highlighting the need for robust safety guardrails to protect online platforms from LLM-powered harassment.

Abstract: Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78–96.89% vs. 57.25–64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9–87.8% vs. 44.2–50.8% without tuning, and Flaming with 81.2–85.1% vs. 31.5–38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

[368] RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li

Main category: cs.AI

TL;DR: RADAR is a multi-agent collaborative framework that improves LLM safety evaluation by decomposing risk into explicit, implicit, and non-risk subspaces, using multi-round debates and dynamic updates to reduce evaluator bias and enhance risk detection.

Details

Motivation: Existing LLM safety evaluation methods suffer from evaluator bias and detection failures due to model homogeneity, undermining the robustness of risk evaluation processes.

Method: Proposes RADAR framework with multi-agent collaboration using four specialized roles and multi-round debate mechanisms, decomposing risk concept space into explicit, implicit, and non-risk subspaces with dynamic update mechanisms for self-evolution.

Result: RADAR achieves 28.87% improvement in risk identification accuracy compared to strongest baseline, with significant outperformance across accuracy, stability, and self-evaluation risk sensitivity on challenging testset of 800 cases and public benchmarks.

Conclusion: RADAR provides a robust safety evaluation paradigm that effectively mitigates evaluator bias and comprehensively covers both explicit and implicit risks through multi-agent collaborative framework.

Abstract: Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.

[369] LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty

Main category: cs.AI

TL;DR: LiveResearchBench is a benchmark for evaluating deep research systems, featuring 100 expert-curated tasks requiring extensive web search and synthesis. DeepEval provides comprehensive evaluation metrics for citation-grounded reports.

Details

Motivation: Existing benchmarks fall short in evaluating deep research capabilities, lacking user-centric, dynamic, unambiguous, and search-intensive tasks that reflect realistic information needs.

Method: Introduced LiveResearchBench with 100 tasks spanning daily life, enterprise, and academia, built with 1,500+ hours of human labor. Developed DeepEval suite covering content- and report-level quality metrics with four complementary evaluation protocols.

Result: Comprehensive evaluation of 17 frontier deep research systems revealed current strengths, recurring failure modes, and identified key system components needed for reliable deep research.

Conclusion: The proposed benchmark and evaluation framework provide rigorous tools for advancing agentic systems’ deep research capabilities, highlighting areas for improvement in reliable information synthesis.

Abstract: Deep research – producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources – marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.

[370] RareAgent: Self-Evolving Reasoning for Drug Repurposing in Rare Diseases

Lang Qin, Zijian Gan, Xu Cao, Pengcheng Jiang, Yankai Jiang, Jiawei Han, Kaishun Wu, Jintai Chen

Main category: cs.AI

TL;DR: RareAgent is a self-evolving multi-agent system that reframes drug repurposing for rare diseases from passive pattern recognition to active evidence-seeking reasoning through adversarial debates.

Details

Motivation: Computational drug repurposing for rare diseases is challenging when no prior drug-disease associations exist, making knowledge graph completion and GNNs perform poorly due to lack of reliable signals.

Method: RareAgent organizes task-specific adversarial debates where agents dynamically construct evidence graphs from diverse perspectives to support, refute, or entail hypotheses. It uses a self-evolutionary loop to analyze reasoning strategies and refine agent policies.

Result: RareAgent improves indication AUPRC by 18.1% over reasoning baselines and provides transparent reasoning chains consistent with clinical evidence.

Conclusion: The system successfully transforms drug repurposing into active reasoning and generates transferable heuristics for future investigations.

Abstract: Computational drug repurposing for rare diseases is especially challenging when no prior associations exist between drugs and target diseases. Therefore, knowledge graph completion and message-passing GNNs have little reliable signal to learn and propagate, resulting in poor performance. We present RareAgent, a self-evolving multi-agent system that reframes this task from passive pattern recognition to active evidence-seeking reasoning. RareAgent organizes task-specific adversarial debates in which agents dynamically construct evidence graphs from diverse perspectives to support, refute, or entail hypotheses. The reasoning strategies are analyzed post hoc in a self-evolutionary loop, producing textual feedback that refines agent policies, while successful reasoning paths are distilled into transferable heuristics to accelerate future investigations. Comprehensive evaluations reveal that RareAgent improves the indication AUPRC by 18.1% over reasoning baselines and provides a transparent reasoning chain consistent with clinical evidence.

[371] Towards Agentic Self-Learning LLMs in Search Environment

Wangtao Sun, Xiang Cheng, Jialin Fan, Yao Xu, Xing Yu, Shizhu He, Jun Zhao, Kang Liu

Main category: cs.AI

TL;DR: Agentic Self-Learning (ASL) is a closed-loop RL framework that enables LLM-based agents to self-improve without human data or rule-based rewards, using multi-role co-evolution of task generation, policy execution, and generative reward modeling.

Details

Motivation: To scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards, addressing the limitations of current approaches that plateau or degrade.

Method: Proposed ASL framework with three coordinated components: Prompt Generator, Policy Model, and Generative Reward Model (GRM) that co-evolve in a virtuous cycle of harder task setting, sharper verification, and stronger solving.

Result: ASL delivers steady performance gains, surpasses RLVR baselines (e.g., Search-R1), continues improving under zero-labeled-data conditions, and shows superior sample efficiency and robustness compared to methods that plateau.

Conclusion: Reward source (GRM vs rule-based) and data scale are critical for open-domain agent learning; multi-role co-evolution enables scalable self-improvement, with GRM verification capacity being the main bottleneck that requires continual training.

Abstract: We study whether self-learning can scale LLM-based agents without relying on human-curated datasets or predefined rule-based rewards. Through controlled experiments in a search-agent setting, we identify two key determinants of scalable agent training: the source of reward signals and the scale of agent task data. We find that rewards from a Generative Reward Model (GRM) outperform rigid rule-based signals for open-domain learning, and that co-evolving the GRM with the policy further boosts performance. Increasing the volume of agent task data-even when synthetically generated-substantially enhances agentic capabilities. Building on these insights, we propose \textbf{Agentic Self-Learning} (ASL), a fully closed-loop, multi-role reinforcement learning framework that unifies task generation, policy execution, and evaluation within a shared tool environment and LLM backbone. ASL coordinates a Prompt Generator, a Policy Model, and a Generative Reward Model to form a virtuous cycle of harder task setting, sharper verification, and stronger solving. Empirically, ASL delivers steady, round-over-round gains, surpasses strong RLVR baselines (e.g., Search-R1) that plateau or degrade, and continues improving under zero-labeled-data conditions, indicating superior sample efficiency and robustness. We further show that GRM verification capacity is the main bottleneck: if frozen, it induces reward hacking and stalls progress; continual GRM training on the evolving data distribution mitigates this, and a small late-stage injection of real verification data raises the performance ceiling. This work establishes reward source and data scale as critical levers for open-domain agent learning and demonstrates the efficacy of multi-role co-evolution for scalable, self-improving agents. The data and code of this paper are released at https://github.com/forangel2014/Towards-Agentic-Self-Learning

[372] Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization

Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li

Main category: cs.AI

TL;DR: The paper addresses identity bias in multi-agent debate systems, where LLMs exhibit sycophancy (uncritically adopting peers’ views) and self-bias (stubbornly adhering to their own outputs), undermining debate reliability.

Details

Motivation: Recent studies reveal that agents in multi-agent debate systems are not neutral - they suffer from identity-driven sycophancy and self-bias, which compromises the reliability of debate outcomes.

Method: The authors propose a principled framework that formalizes debate dynamics as identity-weighted Bayesian update, introduces response anonymization by removing identity markers from prompts, and defines the Identity Bias Coefficient (IBC) metric to measure bias.

Result: Empirical studies across multiple models, datasets and debate rounds confirm that identity bias is widespread, with sycophancy being far more common than self-bias.

Conclusion: The findings highlight the need to ‘mask’ identity in multi-agent debate systems to ensure reasoning is based on content rather than source identity, with code released for implementation.

Abstract: Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer’s view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish “self” from “peer”, which forces equal weights on agent identity, thereby reducing bias. Third, we define the Identity Bias Coefficient (IBC), a principled metric that measures how often an agent follows a peer versus itself. Empirical studies across multiple models, datasets and debate rounds confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to “mask” identity to ensure that MAD systems reason based on content rather than source identity. Code is released in https://github.com/deeplearning-wisc/MAD-identity-bias.

[373] MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

Xukai Wang, Xuanbo Liu, Mingrui Chen, Haitian Zhong, Xuanlin Yang, Bohan Zeng, Jinbo Hu, Hao Liang, Junbo Niu, Xuchen Li, Ruitao Wu, Ruichuan An, Yang Shi, Liu Liu, Xu-Yao Zhang, Qiang Liu, Zhouchen Lin, Wentao Zhang, Bin Dong

Main category: cs.AI

TL;DR: MorphoBench is a new benchmark for evaluating reasoning capabilities of large models that can adaptively adjust question difficulty based on model performance, using multidisciplinary questions from Olympiad competitions and simulation-generated content.

Details

Motivation: Existing reasoning benchmarks are limited in scope and lack flexibility to adapt difficulty according to evolving model capabilities, creating a need for more comprehensive and adaptive evaluation tools.

Method: Curated complex reasoning questions from existing benchmarks and Olympiad competitions, adaptively modified question difficulty using key statements from model reasoning processes, and included simulation-generated questions for dynamic difficulty adjustment.

Result: Collected over 1,300 test questions and iteratively adjusted difficulty based on models like o3 and GPT-5, enhancing evaluation comprehensiveness and validity.

Conclusion: MorphoBench provides reliable guidance for improving reasoning abilities and scientific robustness of large models through its adaptive difficulty adjustment and comprehensive evaluation approach.

Abstract: With the advancement of powerful large-scale reasoning models, effectively evaluating the reasoning capabilities of these models has become increasingly important. However, existing benchmarks designed to assess the reasoning abilities of large models tend to be limited in scope and lack the flexibility to adapt their difficulty according to the evolving reasoning capacities of the models. To address this, we propose MorphoBench, a benchmark that incorporates multidisciplinary questions to evaluate the reasoning capabilities of large models and can adjust and update question difficulty based on the reasoning abilities of advanced models. Specifically, we curate the benchmark by selecting and collecting complex reasoning questions from existing benchmarks and sources such as Olympiad-level competitions. Additionally, MorphoBench adaptively modifies the analytical challenge of questions by leveraging key statements generated during the model’s reasoning process. Furthermore, it includes questions generated using simulation software, enabling dynamic adjustment of benchmark difficulty with minimal resource consumption. We have gathered over 1,300 test questions and iteratively adjusted the difficulty of MorphoBench based on the reasoning capabilities of models such as o3 and GPT-5. MorphoBench enhances the comprehensiveness and validity of model reasoning evaluation, providing reliable guidance for improving both the reasoning abilities and scientific robustness of large models. The code has been released in https://github.com/OpenDCAI/MorphoBench.

[374] Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Marco Del Tredici, Jacob McCarran, Benjamin Breen, Javier Aspuru Mijares, Weichen Winston Yin, Jacob M. Taylor, Frank H. L. Koppens, Dirk Englund

Main category: cs.AI

TL;DR: Ax-Prover is a multi-agent system for automated theorem proving in Lean that combines LLMs with formal verification tools, achieving competitive performance on math benchmarks and strong generalization across scientific domains.

Details

Motivation: To create a generalizable automated theorem prover that can handle diverse scientific domains while maintaining formal correctness, addressing the limitations of specialized systems that struggle with generalization.

Method: Equips Large Language Models with Lean tools via Model Context Protocol (MCP), combining LLM knowledge and reasoning with formal verification tools to ensure syntactic rigor while maintaining creative reasoning capabilities.

Result: Competitive with state-of-the-art provers on public math benchmarks and largely outperforms them on new benchmarks in abstract algebra and quantum theory. Successfully assisted an expert mathematician in formalizing a complex cryptography theorem proof.

Conclusion: The tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains, overcoming the generalization limitations of specialized systems.

Abstract: We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover’s assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

[375] A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Bingjie Zhang, Yibo Yang, Renzhe, Dandan Guo, Jindong Gu, Philip Torr, Bernard Ghanem

Main category: cs.AI

TL;DR: GuardSpace is a framework that preserves safety alignment in LLMs during fine-tuning by decomposing weights into safety-relevant and irrelevant components, and restricting adapter updates to maintain refusal behavior on harmful prompts.

Details

Motivation: LLMs lose safety alignment during fine-tuning even on benign data, leading to harmful responses. Current methods degrade pre-trained safety behaviors.

Method: Decompose pre-trained weights using covariance-preconditioned SVD into safety-relevant (frozen) and irrelevant components. Initialize low-rank adapters from safety-irrelevant weights and use null space projector to restrict updates that alter safe outputs on harmful prompts.

Result: GuardSpace reduces average harmful score from 14.4% to 3.6% and improves accuracy from 26.0% to 28.0% for Llama-2-7B-Chat fine-tuned on GSM8K, outperforming state-of-the-art AsFT method.

Conclusion: GuardSpace effectively preserves safety alignment during fine-tuning while maintaining or improving task performance across various models and downstream tasks.

Abstract: Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4% to 3.6%, while improving the accuracy from from 26.0% to 28.0%.

[376] Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies

Mason Nakamura, Abhinav Kumar, Saaduddin Mahmud, Sahar Abdelnabi, Shlomo Zilberstein, Eugene Bagdasarian

Main category: cs.AI

TL;DR: Terrarium is a framework for studying safety, privacy, and security in LLM-based multi-agent systems, using a modular testbed to identify and analyze attack vectors like misalignment, malicious agents, and data poisoning.

Details

Motivation: LLM-powered multi-agent systems can automate user tasks but introduce new risks including misalignment, malicious attacks, and data theft that compromise agents or steal user data.

Method: Repurposes the blackboard design to create a modular, configurable testbed for multi-agent collaboration, identifying key attack vectors and implementing three collaborative scenarios with four representative attacks.

Result: Demonstrates the framework’s flexibility through implementation of collaborative MAS scenarios with attacks, providing tools for rapid prototyping and evaluation of defenses.

Conclusion: Terrarium aims to accelerate progress toward trustworthy multi-agent systems by enabling systematic study and development of defenses against security and privacy threats in LLM-based MAS.

Abstract: A multi-agent system (MAS) powered by large language models (LLMs) can automate tedious user tasks such as meeting scheduling that requires inter-agent collaboration. LLMs enable nuanced protocols that account for unstructured private data, user constraints, and preferences. However, this design introduces new risks, including misalignment and attacks by malicious parties that compromise agents or steal user data. In this paper, we propose the Terrarium framework for fine-grained study on safety, privacy, and security in LLM-based MAS. We repurpose the blackboard design, an early approach in multi-agent systems, to create a modular, configurable testbed for multi-agent collaboration. We identify key attack vectors such as misalignment, malicious agents, compromised communication, and data poisoning. We implement three collaborative MAS scenarios with four representative attacks to demonstrate the framework’s flexibility. By providing tools to rapidly prototype, evaluate, and iterate on defenses and designs, Terrarium aims to accelerate progress toward trustworthy multi-agent systems.

[377] Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction

Xu Shen, Qi Zhang, Song Wang, Zhen Tan, Xinyu Zhao, Laura Yao, Vaishnav Tadiparthi, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Kwonjoon Lee, Tianlong Chen

Main category: cs.AI

TL;DR: MASC is a metacognitive framework for multi-agent systems that enables real-time, unsupervised error detection and self-correction to prevent cascading errors.

Details

Motivation: Large Language Model based multi-agent systems are brittle to cascading errors where a single faulty step can propagate across agents and disrupt the entire problem-solving trajectory.

Method: MASC uses two complementary designs: Next-Execution Reconstruction (predicts next step embeddings from query and history) and Prototype-Guided Enhancement (learns prototype prior over normal-step embeddings to stabilize anomaly scoring under sparse context). When anomalies are detected, a correction agent revises the output before downstream propagation.

Result: On the Who&When benchmark, MASC improved step-level error detection by up to 8.47% AUC-ROC and delivered consistent end-to-end gains across diverse MAS architectures with minimal overhead.

Conclusion: Metacognitive monitoring and targeted correction can effectively mitigate error propagation in multi-agent systems, confirming the framework’s robustness across different architectures.

Abstract: Large Language Model based multi-agent systems (MAS) excel at collaborative problem solving but remain brittle to cascading errors: a single faulty step can propagate across agents and disrupt the trajectory. In this paper, we present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction. MASC rethinks detection as history-conditioned anomaly scoring via two complementary designs: (1) Next-Execution Reconstruction, which predicts the embedding of the next step from the query and interaction history to capture causal consistency, and (2) Prototype-Guided Enhancement, which learns a prototype prior over normal-step embeddings and uses it to stabilize reconstruction and anomaly scoring under sparse context (e.g., early steps). When an anomaly step is flagged, MASC triggers a correction agent to revise the acting agent’s output before information flows downstream. On the Who&When benchmark, MASC consistently outperforms all baselines, improving step-level error detection by up to 8.47% AUC-ROC ; When plugged into diverse MAS frameworks, it delivers consistent end-to-end gains across architectures, confirming that our metacognitive monitoring and targeted correction can mitigate error propagation with minimal overhead.

[378] AI for Service: Proactive Assistance with AI Glasses

Zichen Wen, Yiyu Wang, Chenfei Liao, Boxue Yang, Junxian Li, Weifeng Liu, Haocong He, Bolong Feng, Xuyang Liu, Yuanhuiyi Lyu, Xu Zheng, Xuming Hu, Linfeng Zhang

Main category: cs.AI

TL;DR: AI4Service introduces proactive AI assistance through Alpha-Service framework that anticipates user needs from egocentric video and provides personalized services without explicit commands.

Details

Motivation: Current AI services are reactive and only respond to explicit commands, but truly intelligent assistants should anticipate user needs and act proactively.

Method: Alpha-Service framework with five components inspired by von Neumann architecture: Input Unit (perception), Central Processing Unit (task scheduling), Arithmetic Logic Unit (tool utilization), Memory Unit (personalization), and Output Unit (human interaction), implemented as multi-agent system on AI glasses.

Result: Case studies demonstrate successful deployment including real-time Blackjack advisor, museum tour guide, and shopping fit assistant that can perceive environment, infer intent, and provide timely assistance without explicit prompts.

Conclusion: AI4Service paradigm enables proactive, real-time assistance in daily life through the Alpha-Service framework, moving AI from passive tools to active companions.

Abstract: In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.

[379] Can MLLMs Absorb Math Reasoning Abilities from LLMs as Free Lunch?

Yijie Hu, Zihao Zhou, Kaizhu Huang, Xiaowei Huang, Qiufeng Wang

Main category: cs.AI

TL;DR: IP-Merging is a tuning-free method that enables multi-modal LLMs (MLLMs) to directly acquire math reasoning abilities from off-the-shelf math LLMs by identifying reasoning-associated parameters, projecting them into MLLM’s subspace, and merging while maintaining alignment.

Details

Motivation: Math reasoning performance of MLLMs lags behind text-only LLMs, and existing model-merging approaches overlook the alignment gap between MLLMs and math LLMs, resulting in poor performance.

Method: IP-Merging identifies reasoning-associated parameters in both MLLM and math LLM, projects them into MLLM’s subspace to maintain alignment, and merges parameters in this subspace without tuning.

Result: Extensive experiments show IP-Merging enhances MLLMs’ math reasoning ability directly from math LLMs without compromising other capabilities.

Conclusion: The proposed IP-Merging method successfully transfers math reasoning abilities from text LLMs to MLLMs without tuning, addressing the parameter space gap issue through identification, projection, and subspace merging.

Abstract: Math reasoning has been one crucial ability of large language models (LLMs), where significant advancements have been achieved in recent years. However, most efforts focus on LLMs by curating high-quality annotation data and intricate training (or inference) paradigms, while the math reasoning performance of multi-modal LLMs (MLLMs) remains lagging behind. Since the MLLM typically consists of an LLM and a vision block, we wonder: Can MLLMs directly absorb math reasoning abilities from off-the-shelf math LLMs without tuning? Recent model-merging approaches may offer insights into this question. However, they overlook the alignment between the MLLM and LLM, where we find that there is a large gap between their parameter spaces, resulting in lower performance. Our empirical evidence reveals two key factors behind this issue: the identification of crucial reasoning-associated layers in the model and the mitigation of the gaps in parameter space. Based on the empirical insights, we propose IP-Merging that first identifies the reasoning-associated parameters in both MLLM and Math LLM, then projects them into the subspace of MLLM, aiming to maintain the alignment, and finally merges parameters in this subspace. IP-Merging is a tuning-free approach since parameters are directly adjusted. Extensive experiments demonstrate that our IP-Merging method can enhance the math reasoning ability of MLLMs directly from Math LLMs without compromising their other capabilities.

[380] Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

Zhe Wu, Hongjin Lu, Junliang Xing, Changhao Zhang, Yin Zhu, Yuhao Yang, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Jun Wang, Yuanchun Shi

Main category: cs.AI

TL;DR: Hi-Agent is a hierarchical vision-language agent for mobile control that achieves SOTA 87.9% task success rate on Android-in-the-Wild benchmark, significantly outperforming prior methods.

Details

Motivation: Existing approaches rely on direct state-to-action mappings which lack structured reasoning and planning, leading to poor generalization to novel tasks or unseen UI layouts.

Method: Hierarchical architecture with high-level reasoning model and low-level action model jointly optimized. Reformulates multi-step decision-making as sequence of single-step subgoals with foresight advantage function leveraging execution feedback.

Result: Achieves 87.9% task success rate on Android-in-the-Wild benchmark, outperforming AppAgent (17.7%), Filtered BC (54.5%), and DigiRL (71.9%). Shows competitive zero-shot generalization on ScreenSpot-v2 and scales effectively on AndroidWorld benchmark.

Conclusion: Hi-Agent’s hierarchical design with joint optimization enables stable training and strong performance in complex mobile control scenarios, demonstrating superior generalization capabilities compared to existing approaches.

Abstract: Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.

[381] IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning

Xikai Zhang, Bo Wang, Likang Xiao, Yongzhi Li, Quan Chen, Wenju Wu, Liu Liu

Main category: cs.AI

TL;DR: The IMAGINE framework integrates Multi-Agent System reasoning capabilities into a single compact model, achieving superior performance on complex planning tasks while reducing model size and computational costs.

Details

Motivation: Current LLMs struggle with complex reasoning and planning tasks, and while Multi-Agent Systems can improve reasoning, they suffer from high computational costs, long latency, and training difficulties.

Method: Proposed IMAGINE framework that integrates MAS reasoning and planning capabilities into a single model through end-to-end training, enabling a compact model to acquire structured reasoning abilities.

Result: A Qwen3-8B-Instruct model trained with IMAGINE achieved 82.7% Final Pass Rate on TravelPlanner benchmark, significantly outperforming DeepSeek-R1-671B’s 40% while being much smaller.

Conclusion: The IMAGINE framework successfully enables small-scale models to surpass the capabilities of well-organized Multi-Agent Systems in complex reasoning tasks, offering a scalable and efficient alternative.

Abstract: Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.

[382] Eliminating Negative Occurrences of Derived Predicates from PDDL Axioms

Claudia Grundke, Gabriele Röger

Main category: cs.AI

TL;DR: The paper presents a transformation method to eliminate negative occurrences of derived predicates in PDDL axioms, showing that stratifiable axioms can be converted to comply with PDDL standards while maintaining equivalent expressive power.

Details

Motivation: PDDL restricts negative predicate occurrences in axiom bodies to only predicates directly set by actions, but literature often uses stratifiable axioms which violate this. The paper aims to bridge this gap by showing these approaches are equivalent.

Method: The authors develop a transformation technique that converts stratifiable axioms with negative occurrences of derived predicates into equivalent axiom sets that comply with PDDL standards.

Result: The transformation successfully eliminates negative occurrences of derived predicates while preserving the same query expressiveness as least fixed-point logic.

Conclusion: Both PDDL’s restricted approach and the literature’s stratifiable approach have equivalent expressive power, and negative occurrences of derived predicates can be eliminated through the presented transformation.

Abstract: Axioms are a feature of the Planning Domain Definition Language PDDL that can be considered as a generalization of database query languages such as Datalog. The PDDL standard restricts negative occurrences of predicates in axiom bodies to predicates that are directly set by actions and not derived by axioms. In the literature, authors often deviate from this limitation and only require that the set of axioms is stratifiable. Both variants can express exactly the same queries as least fixed-point logic, indicating that negative occurrences of derived predicates can be eliminated. We present the corresponding transformation.

[383] Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration

Haoyuan Li, Mathias Funk, Aaqib Saeed

Main category: cs.AI

TL;DR: Helmsman is a multi-agent system that automates federated learning system design from user specifications through planning, code generation, and autonomous evaluation.

Details

Motivation: Federated Learning faces complexity in designing robust systems due to data heterogeneity and system constraints, creating brittle solutions.

Method: Three-phase approach: (1) interactive human-in-the-loop planning, (2) modular code generation by supervised agent teams, (3) autonomous evaluation and refinement in sandboxed simulation.

Result: Generated solutions are competitive with and often superior to hand-crafted baselines, as demonstrated on the new AgentFL-Bench benchmark with 16 diverse tasks.

Conclusion: Helmsman represents a significant step towards automated engineering of complex decentralized AI systems.

Abstract: Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.

[384] JSPLIT: A Taxonomy-based Solution for Prompt Bloating in Model Context Protocol

Emanuele Antonioni, Stefan Markovic, Anirudha Shankar, Jaime Bernardo, Lovro Markovic, Silvia Pareti, Benedetto Proietti

Main category: cs.AI

TL;DR: JSPLIT is a taxonomy-driven framework that reduces prompt bloating in LLM agents by organizing MCP tools hierarchically and selecting only relevant tools based on user prompts.

Details

Motivation: Address the growing problem of prompt bloating in agentic systems using MCP tools, which leads to high token costs, increased latency, and reduced task success due to irrelevant tool inclusion.

Method: Organizes tools into hierarchical taxonomy and uses user prompts to identify and include only the most relevant tools based on both query content and taxonomy structure.

Result: Significantly reduces prompt size without compromising agent effectiveness, and improves tool selection accuracy in high-complexity environments with many available tools.

Conclusion: JSPLIT effectively reduces costs while improving task success in complex agent environments by managing prompt size through taxonomy-driven tool selection.

Abstract: AI systems are continually evolving and advancing, and user expectations are concurrently increasing, with a growing demand for interactions that go beyond simple text-based interaction with Large Language Models (LLMs). Today’s applications often require LLMs to interact with external tools, marking a shift toward more complex agentic systems. To support this, standards such as the Model Context Protocol (MCP) have emerged, enabling agents to access tools by including a specification of the capabilities of each tool within the prompt. Although this approach expands what agents can do, it also introduces a growing problem: prompt bloating. As the number of tools increases, the prompts become longer, leading to high prompt token costs, increased latency, and reduced task success resulting from the selection of tools irrelevant to the prompt. To address this issue, we introduce JSPLIT, a taxonomy-driven framework designed to help agents manage prompt size more effectively when using large sets of MCP tools. JSPLIT organizes the tools into a hierarchical taxonomy and uses the user’s prompt to identify and include only the most relevant tools, based on both the query and the taxonomy structure. In this paper, we describe the design of the taxonomy, the tool selection algorithm, and the dataset used to evaluate JSPLIT. Our results show that JSPLIT significantly reduces prompt size without significantly compromising the agent’s ability to respond effectively. As the number of available tools for the agent grows substantially, JSPLIT even improves the tool selection accuracy of the agent, effectively reducing costs while simultaneously improving task success in high-complexity agent environments.

[385] Symbol Grounding in Neuro-Symbolic AI: A Gentle Introduction to Reasoning Shortcuts

Emanuele Marconato, Samuele Bortolotti, Emile van Krieken, Paolo Morettin, Elena Umili, Antonio Vergari, Efthymia Tsamoura, Andrea Passerini, Stefano Teso

Main category: cs.AI

TL;DR: This paper provides an overview of Reasoning Shortcuts (RSs) in neuro-symbolic AI, where models achieve high accuracy by incorrectly grounding concepts, compromising interpretability and reliability. It discusses causes, consequences, theoretical characterizations, and mitigation strategies.

Details

Motivation: Neuro-symbolic AI aims to create reliable and trustworthy AI by combining neural networks with symbolic reasoning, but Reasoning Shortcuts undermine this goal by allowing models to achieve high accuracy while grounding concepts incorrectly, which is difficult to detect without direct concept supervision.

Method: The paper provides a gentle introduction to Reasoning Shortcuts, discusses their causes and consequences in intuitive terms, reviews existing theoretical characterizations, and details methods for dealing with RSs including mitigation and awareness strategies.

Result: The overview reformulates advanced material in digestible form to provide a unifying perspective on Reasoning Shortcuts, mapping the benefits and limitations of existing approaches to help researchers and practitioners understand and tackle this challenging problem.

Conclusion: By lowering the barrier to entry for addressing Reasoning Shortcuts, this overview aims to contribute to the development of reliable neuro-symbolic AI and trustworthy AI models that maintain interpretability and performance in out-of-distribution scenarios.

Abstract: Neuro-symbolic (NeSy) AI aims to develop deep neural networks whose predictions comply with prior knowledge encoding, e.g. safety or structural constraints. As such, it represents one of the most promising avenues for reliable and trustworthy AI. The core idea behind NeSy AI is to combine neural and symbolic steps: neural networks are typically responsible for mapping low-level inputs into high-level symbolic concepts, while symbolic reasoning infers predictions compatible with the extracted concepts and the prior knowledge. Despite their promise, it was recently shown that - whenever the concepts are not supervised directly - NeSy models can be affected by Reasoning Shortcuts (RSs). That is, they can achieve high label accuracy by grounding the concepts incorrectly. RSs can compromise the interpretability of the model’s explanations, performance in out-of-distribution scenarios, and therefore reliability. At the same time, RSs are difficult to detect and prevent unless concept supervision is available, which is typically not the case. However, the literature on RSs is scattered, making it difficult for researchers and practitioners to understand and tackle this challenging problem. This overview addresses this issue by providing a gentle introduction to RSs, discussing their causes and consequences in intuitive terms. It also reviews and elucidates existing theoretical characterizations of this phenomenon. Finally, it details methods for dealing with RSs, including mitigation and awareness strategies, and maps their benefits and limitations. By reformulating advanced material in a digestible form, this overview aims to provide a unifying perspective on RSs to lower the bar to entry for tackling them. Ultimately, we hope this overview contributes to the development of reliable NeSy and trustworthy AI models.

[386] LLM Agents Beyond Utility: An Open-Ended Perspective

Asen Nachkov, Xi Wang, Luc Van Gool

Main category: cs.AI

TL;DR: The paper explores whether LLM agents can evolve from problem-solving tools into autonomous entities capable of planning, designing tasks, and reasoning toward ambiguous goals through an open-ended experimental setup.

Details

Motivation: To investigate if LLM agents can represent autonomous entities rather than just problem-solving tools, capable of planning, task design, and reasoning toward broader goals.

Method: Augment a pretrained LLM agent with abilities to generate its own tasks, accumulate knowledge, and interact extensively with its environment in an open-ended experimental setting.

Result: The agent can reliably follow complex multi-step instructions, store and reuse information across runs, and propose/solve its own tasks, but remains sensitive to prompt design, prone to repetitive task generation, and unable to form self-representations.

Conclusion: The findings show both promise and current limitations in adapting pretrained LLMs toward open-endedness, pointing to future directions for training agents in memory management, productive exploration, and pursuit of abstract long-term goals.

Abstract: Recent LLM agents have made great use of chain of thought reasoning and function calling. As their capabilities grow, an important question arises: can this software represent not only a smart problem-solving tool, but an entity in its own right, that can plan, design immediate tasks, and reason toward broader, more ambiguous goals? To study this question, we adopt an open-ended experimental setting where we augment a pretrained LLM agent with the ability to generate its own tasks, accumulate knowledge, and interact extensively with its environment. We study the resulting open-ended agent qualitatively. It can reliably follow complex multi-step instructions, store and reuse information across runs, and propose and solve its own tasks, though it remains sensitive to prompt design, prone to repetitive task generation, and unable to form self-representations. These findings illustrate both the promise and current limits of adapting pretrained LLMs toward open-endedness, and point to future directions for training agents to manage memory, explore productively, and pursue abstract long-term goals.

[387] ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

Yuanyi Song, Heyuan Huang, Qiqiang Lin, Yin Zhao, Xiangmou Qu, Jun Wang, Xingyu Lou, Weiwen Liu, Zhuosheng Zhang, Jun Wang, Yong Yu, Weinan Zhang, Zhaoxiang Wang

Main category: cs.AI

TL;DR: ColorBench is a graph-structured benchmark for evaluating mobile agents on complex long-horizon tasks, addressing limitations of current evaluation methods by supporting multiple valid solutions and quasi-dynamic interaction.

Details

Motivation: Current mobile agent evaluation methods are inadequate - offline benchmarks only validate single predefined paths, while online testing suffers from complexity and non-reproducibility issues on real devices.

Method: Developed a graph-structured benchmarking framework that models finite states from real-device interactions to achieve static simulation of dynamic behaviors. Created ColorBench with 175 tasks (74 single-app, 101 cross-app) averaging over 13 steps, each with multiple correct paths and error paths.

Result: The benchmark enables evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. Evaluation across various baselines revealed limitations of existing models.

Conclusion: The proposed framework bridges offline and online evaluation gaps, enhances testing stability, and provides improvement directions and technical pathways for enhancing agent performance on complex long-horizon problems.

Abstract: The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined “golden path”, while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents’ performance on complex, long-horizon problems based on experimental results. Code and data are available at: https://github.com/MadeAgents/ColorBench.

[388] Beyond Hallucinations: The Illusion of Understanding in Large Language Models

Rikard Rosenbacke, Carl Rosenbacke, Victor Rosenbacke, Martin McKee

Main category: cs.AI

TL;DR: The paper introduces Rose-Frame, a three-dimensional framework for diagnosing cognitive and epistemic drift in human-AI interaction, addressing LLMs’ tendency toward hallucination and lack of grounded reasoning.

Details

Motivation: LLMs inherit ambiguity, bias, and lack direct access to truth from language, creating risks of hallucination where outputs sound convincing but lack factual validity, despite being fluent and emotionally resonant.

Method: Rose-Frame framework with three axes: (i) Map vs. Territory (epistemology vs ontology), (ii) Intuition vs. Reason (dual-process theory), and (iii) Conflict vs. Confirmation (critical testing vs mutual validation).

Result: The framework enables diagnosis of cognitive and epistemic drift in human-AI interaction, making both model limitations and user assumptions visible for more transparent and critically aware AI deployment.

Conclusion: Alignment should be reframed as cognitive governance where intuition (human or artificial) must remain governed by human reason through reflective, falsifiable oversight to align machine fluency with human understanding.

Abstract: Large language models (LLMs) are becoming deeply embedded in human communication and decision-making, yet they inherit the ambiguity, bias, and lack of direct access to truth inherent in language itself. While their outputs are fluent, emotionally resonant, and coherent, they are generated through statistical prediction rather than grounded reasoning. This creates the risk of hallucination, responses that sound convincing but lack factual validity. Building on Geoffrey Hinton’s observation that AI mirrors human intuition rather than reasoning, this paper argues that LLMs operationalize System 1 cognition at scale: fast, associative, and persuasive, but without reflection or falsification. To address this, we introduce the Rose-Frame, a three-dimensional framework for diagnosing cognitive and epistemic drift in human-AI interaction. The three axes are: (i) Map vs. Territory, which distinguishes representations of reality (epistemology) from reality itself (ontology); (ii) Intuition vs. Reason, drawing on dual-process theory to separate fast, emotional judgments from slow, reflective thinking; and (iii) Conflict vs. Confirmation, which examines whether ideas are critically tested through disagreement or simply reinforced through mutual validation. Each dimension captures a distinct failure mode, and their combination amplifies misalignment. Rose-Frame does not attempt to fix LLMs with more data or rules. Instead, it offers a reflective tool that makes both the model’s limitations and the user’s assumptions visible, enabling more transparent and critically aware AI deployment. It reframes alignment as cognitive governance: intuition, whether human or artificial, must remain governed by human reason. Only by embedding reflective, falsifiable oversight can we align machine fluency with human understanding.

[389] Machine Learning and Public Health: Identifying and Mitigating Algorithmic Bias through a Systematic Review

Sara Altamirano, Arjan Vreeken, Sennay Ghebreab

Main category: cs.AI

TL;DR: Systematic review of algorithmic bias in Dutch public health ML research (2021-2025) reveals significant gaps in fairness considerations, leading to development of RABAT assessment tool and ACAR framework to address bias across ML lifecycle.

Details

Motivation: Machine learning promises to revolutionize public health but may inadvertently reinforce existing health disparities without systematic attention to algorithmic bias, requiring assessment of current practices and development of mitigation frameworks.

Method: Developed Risk of Algorithmic Bias Assessment Tool (RABAT) by integrating established frameworks (Cochrane Risk of Bias, PROBAST, Microsoft Responsible AI checklist) and applied it to 35 peer-reviewed Dutch public health ML studies from 2021-2025.

Result: Analysis revealed pervasive gaps: most studies omit explicit fairness framing, subgroup analyses, and transparent discussion of potential harms, although data sampling and missing data practices are well documented.

Conclusion: Introduced ACAR framework (Awareness, Conceptualization, Application, Reporting) with actionable recommendations to help researchers address fairness across ML lifecycle and ensure algorithmic innovations advance health equity rather than undermine it.

Abstract: Machine learning (ML) promises to revolutionize public health through improved surveillance, risk stratification, and resource allocation. However, without systematic attention to algorithmic bias, ML may inadvertently reinforce existing health disparities. We present a systematic literature review of algorithmic bias identification, discussion, and reporting in Dutch public health ML research from 2021 to 2025. To this end, we developed the Risk of Algorithmic Bias Assessment Tool (RABAT) by integrating elements from established frameworks (Cochrane Risk of Bias, PROBAST, Microsoft Responsible AI checklist) and applied it to 35 peer-reviewed studies. Our analysis reveals pervasive gaps: although data sampling and missing data practices are well documented, most studies omit explicit fairness framing, subgroup analyses, and transparent discussion of potential harms. In response, we introduce a four-stage fairness-oriented framework called ACAR (Awareness, Conceptualization, Application, Reporting), with guiding questions derived from our systematic literature review to help researchers address fairness across the ML lifecycle. We conclude with actionable recommendations for public health ML practitioners to consistently consider algorithmic bias and foster transparency, ensuring that algorithmic innovations advance health equity rather than undermine it.

[390] TITAN: Graph-Executable Reasoning for Cyber Threat Intelligence

Marco Simoni, Aleksandar Fontana, Andrea Saracino, Paolo Mori

Main category: cs.AI

TL;DR: TITAN is a framework that connects natural-language cyber threat queries with executable reasoning over a structured knowledge graph derived from MITRE, using a path planner model and graph executor to retrieve factual answers with supporting evidence.

Details

Motivation: Traditional retrieval systems lack the ability to perform clear and reversible reasoning between threats, behaviors, and defenses in cybersecurity threat intelligence.

Method: Integrates a path planner model that predicts logical relation chains from text and a graph executor that traverses the TITAN Ontology (a typed, bidirectional graph from MITRE) to retrieve answers with evidence.

Result: Empirical evaluations show TITAN enables models to generate syntactically valid and semantically coherent reasoning paths that can be deterministically executed on the underlying graph. The TITAN Dataset contains 88,209 examples for training and testing.

Conclusion: TITAN provides an effective framework for automated cyber threat intelligence reasoning that bridges natural language queries with structured knowledge graph execution.

Abstract: TITAN (Threat Intelligence Through Automated Navigation) is a framework that connects natural-language cyber threat queries with executable reasoning over a structured knowledge graph. It integrates a path planner model, which predicts logical relation chains from text, and a graph executor that traverses the TITAN Ontology to retrieve factual answers and supporting evidence. Unlike traditional retrieval systems, TITAN operates on a typed, bidirectional graph derived from MITRE, allowing reasoning to move clearly and reversibly between threats, behaviors, and defenses. To support training and evaluation, we introduce the TITAN Dataset, a corpus of 88209 examples (Train: 74258; Test: 13951) pairing natural language questions with executable reasoning paths and step by step Chain of Thought explanations. Empirical evaluations show that TITAN enables models to generate syntactically valid and semantically coherent reasoning paths that can be deterministically executed on the underlying graph.

[391] NAEL: Non-Anthropocentric Ethical Logic

Bianca Maria Lerma, Rafael Peñaloza

Main category: cs.AI

TL;DR: NAEL is a non-anthropocentric ethical framework for AI agents that combines active inference and symbolic reasoning to enable emergent ethical behavior through free energy minimization in multi-agent environments.

Details

Motivation: To address limitations of human-centered AI ethics by creating a framework that allows agents to develop context-sensitive, adaptive ethical behavior without relying on anthropomorphic moral intuitions.

Method: Proposes a neuro-symbolic architecture combining active inference (minimizing global expected free energy) with symbolic reasoning for ethical evaluation in uncertain multi-agent environments.

Result: The framework enables dynamic balancing of self-preservation, epistemic learning, and collective welfare, as demonstrated in a case study on ethical resource distribution.

Conclusion: NAEL provides a novel approach to AI ethics that moves beyond human-centric models, allowing for emergent, relational ethical behavior in artificial agents through formal computational principles.

Abstract: We introduce NAEL (Non-Anthropocentric Ethical Logic), a novel ethical framework for artificial agents grounded in active inference and symbolic reasoning. Departing from conventional, human-centred approaches to AI ethics, NAEL formalizes ethical behaviour as an emergent property of intelligent systems minimizing global expected free energy in dynamic, multi-agent environments. We propose a neuro-symbolic architecture to allow agents to evaluate the ethical consequences of their actions in uncertain settings. The proposed system addresses the limitations of existing ethical models by allowing agents to develop context-sensitive, adaptive, and relational ethical behaviour without presupposing anthropomorphic moral intuitions. A case study involving ethical resource distribution illustrates NAEL’s dynamic balancing of self-preservation, epistemic learning, and collective welfare.

[392] Practical, Utilitarian Algorithm Configuration

Devon Graham, Kevin Leyton-Brown

Main category: cs.AI

TL;DR: COUP is improved to make utilitarian algorithm configuration practical and competitive with heuristic methods while maintaining theoretical guarantees.

Details

Motivation: To bridge the gap between theoretical guarantees and practical performance in utilitarian algorithm configuration, making it competitive with heuristic approaches.

Method: A series of improvements to the COUP procedure that enhance empirical performance without compromising theoretical guarantees.

Result: The improved COUP achieves competitive performance with widely used heuristic configuration procedures while maintaining strong theoretical guarantees.

Conclusion: Utilitarian algorithm configuration can be made practical and competitive, and solution robustness to utility function variations can be explored.

Abstract: Utilitarian algorithm configuration identifies a parameter setting for a given algorithm that maximizes a user’s utility. Utility functions offer a theoretically well-grounded approach to optimizing decision-making under uncertainty and are flexible enough to capture a user’s preferences over algorithm runtimes (e.g., they can describe a sharp cutoff after which a solution is no longer required, a per-hour cost for compute, or diminishing returns from algorithms that take longer to run). COUP is a recently-introduced utilitarian algorithm configuration procedure which was designed mainly to offer strong theoretical guarantees about the quality of the configuration it returns, with less attention paid to its practical performance. This paper closes that gap, bringing theoretically-grounded, utilitarian algorithm configuration to the point where it is competitive with widely used, heuristic configuration procedures that offer no performance guarantees. We present a series of improvements to COUP that improve its empirical performance without degrading its theoretical guarantees and demonstrate their benefit experimentally. Using a case study, we also illustrate ways of exploring the robustness of a given solution to the algorithm selection problem to variations in the utility function.

[393] Purifying Task Vectors in Knowledge-Aware Subspace for Model Merging

Bang An, Yibo Yang, Philip Torr, Bernard Ghanem

Main category: cs.AI

TL;DR: PAVE is a plug-and-play method that purifies task vectors by removing redundant components in knowledge-aware subspaces to improve model merging performance without extra training.

Details

Motivation: Existing model merging methods suffer from performance degradation due to task-irrelevant redundancy in task vectors, and current redundancy removal approaches involve randomness and lack knowledge awareness.

Method: Uses context-oriented singular value decomposition on covariance matrices from task training examples to identify and prune redundant components in task vectors, with spectral rank allocation for fair pruning across models.

Result: PAVE effectively improves performance across various task vector-based merging methods, tasks, and model architectures.

Conclusion: The proposed PAVE method successfully addresses redundancy issues in task vectors and serves as an effective plug-and-play enhancement for model merging techniques.

Abstract: Model merging aims to integrate task-specific abilities from individually fine-tuned models into a single model without extra training. In recent model merging methods, task vector has become a fundamental building block, as it can encapsulate the residual information from finetuning. However, the merged model often suffers from notable performance degradation due to the conflicts caused by task-irrelevant redundancy in task vectors. Existing efforts in overcoming redundancy by randomly dropping elements in the parameter space involves randomness and lacks knowledge awareness. To address these challenges, in this study, we propose Purifying TAsk Vectors (PAVE) in knowledge-aware subspace. Concretely, we sample some training examples from each task, and feed them into their corresponding fine-tuned models to acquire the covariance matrices before linear layers. We then perform a context-oriented singular value decomposition, which accentuates the weight components most relevant to the target knowledge. As a result, we can split fine-tuned model weights into task-relevant and redundant components in the knowledge-aware subspace, and purify the task vector by pruning the redundant components. To induce fair pruning efforts across models, we further introduce a spectral rank allocation strategy by optimizing a normalized activated pruning error. The task vector purification by our method as a plug-and-play scheme is applicable across various task vector-based merging methods to improve their performance. In experiments, we demonstrate the effectiveness of PAVE across a diverse set of merging methods, tasks, and model architectures.

[394] Cognitive-Aligned Spatio-Temporal Large Language Models For Next Point-of-Interest Prediction

Penglong Zhai, Jie Li, Fanyi Di, Yue Liu, Yifang Yuan, Jie Huang, Peng Wu, Sicong Wang, Mingyang Yin, Tingting Hu, Yao Xu, Xin Li

Main category: cs.AI

TL;DR: CoAST is a framework that enhances LLMs for next POI recommendation by incorporating geographical understanding, mobility patterns, and cognitive factors like seasons, weather, and user profiles through continued pretraining and cognitive alignment.

Details

Motivation: Current LLMs lack understanding of structured geographical entities and sequential mobility patterns needed for POI prediction, and fail to incorporate world knowledge and human cognitive factors that could improve recommendation performance and user experience.

Method: Two-stage approach: (1) Recommendation Knowledge Acquisition through continued pretraining on enriched spatial-temporal trajectory data, (2) Cognitive Alignment using Supervised Fine-Tuning and Reinforcement Learning to align with human preferences.

Result: Extensive offline experiments on real-world datasets and online deployment in AMAP App’s “Guess Where You Go” feature demonstrate the framework’s effectiveness.

Conclusion: CoAST successfully addresses LLMs’ limitations in POI recommendation by incorporating spatial-temporal understanding and cognitive alignment, showing practical value in real-world applications.

Abstract: The next point-of-interest (POI) recommendation task aims to predict the users’ immediate next destinations based on their preferences and historical check-ins, holding significant value in location-based services. Recently, large language models (LLMs) have shown great potential in recommender systems, which treat the next POI prediction in a generative manner. However, these LLMs, pretrained primarily on vast corpora of unstructured text, lack the native understanding of structured geographical entities and sequential mobility patterns required for next POI prediction tasks. Moreover, in industrial-scale POI prediction applications, incorporating world knowledge and alignment of human cognition, such as seasons, weather conditions, holidays, and users’ profiles (such as habits, occupation, and preferences), can enhance the user experience while improving recommendation performance. To address these issues, we propose CoAST (Cognitive-Aligned Spatial-Temporal LLMs), a framework employing natural language as an interface, allowing for the incorporation of world knowledge, spatio-temporal trajectory patterns, profiles, and situational information. Specifically, CoAST mainly comprises of 2 stages: (1) Recommendation Knowledge Acquisition through continued pretraining on the enriched spatial-temporal trajectory data of the desensitized users; (2) Cognitive Alignment to align cognitive judgments with human preferences using enriched training data through Supervised Fine-Tuning (SFT) and a subsequent Reinforcement Learning (RL) phase. Extensive offline experiments on various real-world datasets and online experiments deployed in “Guess Where You Go” of AMAP App homepage demonstrate the effectiveness of CoAST.

[395] ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

Jianghao Lin, Yuanyuan Shi, Xin Peng, Renjie Ding, Hairui Wang, Yuxuan Peng, Bizhe Bai, Weixi Song, Fengshuo Bai, Huacan Chai, Weinan Zhang, Fei Huang, Ying Wen

Main category: cs.AI

TL;DR: Proposes ToolPRM, an inference scaling framework with fine-grained beam search and process reward model for improving LLM function calling performance through step-level supervision.

Details

Motivation: Current inference scaling research focuses on unstructured outputs, leaving structured outputs like function calling underexplored. Need to bridge this gap for better autonomous agent capabilities.

Method: Combines fine-grained beam search with ToolPRM process reward model that scores internal steps of function calls. Uses automatically annotated intra-call process supervision dataset created with function-masking techniques.

Result: ToolPRM beats coarse-grained and outcome reward models in predictive accuracy. Inference scaling with ToolPRM significantly improves backbone model performance across various function calling tasks and benchmarks.

Conclusion: Reveals key principle for applying inference scaling to structured outputs: “explore more but retain less” due to unrecoverability characteristics of structured function calling generation.

Abstract: Large language models (LLMs) are increasingly demonstrating strong capabilities as autonomous agents, with function calling serving as a core mechanism for interaction with the environment. Meanwhile, inference scaling has become a cutting-edge technique to enhance LLM performance by allocating more computational resources during the inference process. However, current research on inference scaling primarily focuses on unstructured output generation tasks, leaving its application in structured outputs, like function calling, largely underexplored. To bridge this gap, we propose an inference scaling framework that combines fine-grained beam search with a process reward model, ToolPRM, which scores the internal steps of each single function call. To train ToolPRM, we construct the first fine-grained intra-call process supervision dataset, automatically annotated with function-masking techniques to provide step-level rewards for structured tool-use reasoning. Extensive experiments demonstrate that ToolPRM beats the coarse-grained and outcome reward models in terms of predictive accuracy, indicating its stronger capability in supervising the function calling inference process. Inference scaling technique equipped with ToolPRM also significantly improves the backbone model performance across various function calling tasks and benchmarks. More importantly, we reveal a key principle for applying inference scaling techniques to structured outputs: “explore more but retain less” due to the unrecoverability characteristics of structured function calling generation.

[396] SimKO: Simple Pass@K Policy Optimization

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen

Main category: cs.AI

TL;DR: SimKO addresses RLVR’s exploitation bias by mitigating token-level probability concentration, improving pass@K performance through asymmetric probability adjustments.

Details

Motivation: RLVR methods show systematic bias toward exploitation over exploration, evidenced by improved pass@1 but reduced pass@K performance due to probability concentration effects.

Method: SimKO uses asymmetric probability adjustments: boosts top-K candidates for correct responses and penalizes top-1 candidate for incorrect responses, especially at high-entropy tokens.

Result: Across math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, improving RLVR’s exploration capabilities.

Conclusion: SimKO effectively mitigates RLVR’s over-concentration issue and encourages better exploration while maintaining performance.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR’s exploration.

[397] Agentic NL2SQL to Reduce Computational Costs

Dominik Jehle, Lennart Purucker, Frank Hutter

Main category: cs.AI

TL;DR: Datalake Agent is an agentic system that uses an interactive loop with LLMs to reduce token usage by up to 87% for NL2SQL tasks, maintaining competitive performance while lowering costs.

Details

Motivation: Traditional NL2SQL methods using LLMs require processing large amounts of database meta-information, resulting in lengthy prompts with high token counts and processing costs.

Method: Instead of direct solvers that call LLMs once with all meta-information, Datalake Agent employs an interactive loop where LLMs selectively request only necessary information for table question answering tasks.

Result: Evaluation on 23 databases with 100 table question answering tasks shows token reduction up to 87% with substantial cost reductions while maintaining competitive performance.

Conclusion: The Datalake Agent provides an efficient approach to NL2SQL by reducing token usage through selective information retrieval in an interactive loop, enabling cost-effective LLM deployment.

Abstract: Translating natural language queries into SQL queries (NL2SQL or Text-to-SQL) has recently been empowered by large language models (LLMs). Using LLMs to perform NL2SQL methods on a large collection of SQL databases necessitates processing large quantities of meta-information about the databases, which in turn results in lengthy prompts with many tokens and high processing costs. To address this challenge, we introduce Datalake Agent, an agentic system designed to enable an LLM to solve NL2SQL tasks more efficiently. Instead of utilizing direct solvers for NL2SQL that call the LLM once with all meta-information in the prompt, the Datalake Agent employs an interactive loop to reduce the utilized meta-information. Within the loop, the LLM is used in a reasoning framework that selectively requests only the necessary information to solve a table question answering task. We evaluate the Datalake Agent on a collection of 23 databases with 100 table question answering tasks. The Datalake Agent reduces the tokens used by the LLM by up to 87% and thus allows for substantial cost reductions while maintaining competitive performance.

[398] RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li

Main category: cs.AI

TL;DR: RoboGPT-R1 is a two-stage fine-tuning framework that combines supervised training with reinforcement learning to improve embodied agents’ reasoning for long-horizon manipulation tasks, achieving state-of-the-art performance on EmbodiedBench.

Details

Motivation: Current vision language models struggle with long-horizon manipulation tasks due to limited common sense and reasoning capabilities, and supervised fine-tuning alone suffers from poor generalization and insufficient physical understanding.

Method: A two-stage framework: supervised training acquires foundational knowledge from expert sequences, followed by RL to address visual-spatial understanding and reasoning shortcomings. Uses rule-based reward function considering long-horizon performance and action constraints.

Result: Significantly outperforms GPT-4o-mini by 21.33% and other work trained on Qwen2.5-VL-7B by 20.33% on EmbodiedBench benchmark.

Conclusion: The proposed two-stage fine-tuning approach effectively enhances embodied agents’ reasoning capabilities for complex manipulation tasks in real-world environments.

Abstract: Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model’s shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

[399] Boosting Instruction Following at Scale

Ben Elder, Evelyn Duesterwald, Vinod Muthusamy

Main category: cs.AI

TL;DR: Instruction Boosting is a post-generation method that improves LLM instruction following by up to 7 points for two instructions and 4 points for ten instructions, addressing the problem of performance degradation with more instructions.

Details

Motivation: Current approaches of adding more instructions to LLM prompts provide little assurance they will be followed, and performance typically degrades as more instructions are added due to tension and conflict between instructions.

Method: Introduced Instruction Boosting as a post-generation method to increase LLM instruction following reliability. Also created SCALEDIF benchmark with up to ten instructions per sample and developed a quantitative conflict scoring tool to analyze instruction conflicts.

Result: Instruction Boosting improves instruction following rate by up to 7 points for two instructions and up to 4 points for ten instructions. The conflict scoring tool successfully explains performance degradation trends with increasing instructions.

Conclusion: Instruction Boosting effectively addresses the challenge of LLM instruction following reliability, and the conflict scoring tool provides developers with feedback on how additional prompt instructions impact model performance.

Abstract: A typical approach developers follow to influence an LLM’s behavior in an application is through careful manipulation of the prompt, such as by adding or modifying instructions. However, merely adding more instructions provides little assurance that they will actually be followed. We introduce Instruction Boosting as a post-generation method to increase the reliability of LLM prompt instructions. We show that Instruction Boosting improves the instruction following rate by up to 7 points for two instructions and up to 4 points for ten instructions. To demonstrate these results we introduce SCALEDIF, a benchmark with a scaled instruction volume of up to ten instructions per data sample. We also present an analysis of the commonly observed trend that performance degrades as more instructions are added. We show that an important factor contributing to this trend is the degree of tension and conflict that arises as the number of instructions is increased. We contribute a quantitative conflict scoring tool that explains the observed performance trends and provides feedback to developers on the impact that additional prompt instructions have on a model’s performance.

[400] Where to Search: Measure the Prior-Structured Search Space of LLM Agents

Zhuo-Yang Song

Main category: cs.AI

TL;DR: A formal theory for measuring LLM-assisted iterative search with domain priors, using fuzzy relations and coverage generating functions to quantify reachability difficulty.

Details

Motivation: To systematically encode domain priors into structured hypothesis spaces for effective LLM-based iterative search in reasoning and program discovery.

Method: Represent agents as fuzzy relation operators constrained by safety envelopes, use continuation parameters to weight paths, and compute coverage generating functions to measure reachability difficulty.

Result: Developed a geometric interpretation of search on safety envelope graphs and validated testable inferences via majority-vote instantiation.

Conclusion: The theory provides operational tools and language to systematically measure agents and their search spaces in LLM-constructed iterative search.

Abstract: The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via a majority-vote instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.

[401] LabOS: The AI-XR Co-Scientist That Sees and Works With Humans

Le Cong, Zaixi Zhang, Xiaotong Wang, Yin Di, Ruofan Jin, Michal Gerasimiuk, Yinkai Wang, Ravi K. Dinesh, David Smerkous, Alex Smerkous, Xuekun Wu, Shilong Liu, Peishan Li, Yi Zhu, Simran Serrao, Ning Zhao, Imran A. Mohammad, John B. Sunwoo, Joseph C. Wu, Mengdi Wang

Main category: cs.AI

TL;DR: LabOS is the first AI co-scientist that integrates computational reasoning with physical experimentation through multimodal perception, self-evolving agents, and XR-enabled human-AI collaboration.

Details

Motivation: To advance science by enabling AI to move beyond computational design to active participation in laboratory experiments, creating an intelligent collaborative environment.

Method: Uses multimodal perception, self-evolving AI agents, and Extended Reality (XR) technology to connect computational reasoning with physical experimentation through smart glasses and human-AI collaboration.

Result: LabOS enables AI to see what scientists see, understand experimental context, and assist in real-time execution across applications including cancer immunotherapy target discovery and stem-cell engineering.

Conclusion: LabOS transforms laboratories into intelligent collaborative environments where human and machine discovery evolve together, demonstrating AI’s capability to participate directly in scientific experimentation.

Abstract: Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal perception, self-evolving agents, and Entended-Reality(XR)-enabled human-AI collaboration. By connecting multi-model AI agents, smart glasses, and human-AI collaboration, LabOS allows AI to see what scientists see, understand experimental context, and assist in real-time execution. Across applications–from cancer immunotherapy target discovery to stem-cell engineering – LabOS shows that AI can move beyond computational design to participation, turning the laboratory into an intelligent, collaborative environment where human and machine discovery evolve together.

[402] The Gatekeeper Knows Enough

Fikresilase Wondmeneh Abebayew

Main category: cs.AI

TL;DR: The Gatekeeper Protocol addresses LLM limitations in autonomous agents by using a latent state approach and JSON-based interactions to improve reliability and efficiency.

Details

Motivation: LLMs face constraints from limited context windows and state desynchronization, leading to unreliable outputs and inefficient resource usage when interacting with complex knowledge systems.

Method: A domain-agnostic framework where agents first operate on minimalist latent state representations, then strategically request high-fidelity context through unified JSON format interactions.

Result: Significantly increases agent reliability, improves computational efficiency by minimizing token consumption, and enables scalable interaction with complex systems.

Conclusion: Creates a foundational methodology for building more robust, predictable, and grounded AI agents for any structured knowledge domain.

Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents, yet their practical utility is fundamentally constrained by a limited context window and state desynchronization resulting from the LLMs’ stateless nature and inefficient context management. These limitations lead to unreliable output, unpredictable behavior, and inefficient resource usage, particularly when interacting with large, structured, and sensitive knowledge systems such as codebases and documents. To address these challenges, we introduce the Gatekeeper Protocol, a novel, domain-agnostic framework that governs agent-system interactions. Our protocol mandates that the agent first operate and reason on a minimalist, low-fidelity “latent state” representation of the system to strategically request high-fidelity context on demand. All interactions are mediated through a unified JSON format that serves as a declarative, state-synchronized protocol, ensuring the agent’s model of the system remains verifiably grounded in the system’s reality. We demonstrate the efficacy of this protocol with Sage, a reference implementation of the Gatekeeper Protocol for software development. Our results show that this approach significantly increases agent reliability, improves computational efficiency by minimizing token consumption, and enables scalable interaction with complex systems, creating a foundational methodology for building more robust, predictable, and grounded AI agents for any structured knowledge domain.

[403] Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improves Without Labels or Model Updates

Wen-Kwang Tsao, Yao-Ching Yu, Chien-Ming Huang

Main category: cs.AI

TL;DR: A reinforcement learning agent that self-improves schema mapping for enterprise logs without labeled data or model updates, using web searches and confidence-based rewards to achieve 93.94% accuracy.

Details

Motivation: Enterprise systems need to integrate logs from multiple vendors, but vendor documentation is often unavailable, misplaced, or incomplete, making schema mapping challenging.

Method: Reinforcement learning agent that: 1) identifies ambiguous field mappings, 2) generates targeted web-search queries for external evidence, 3) applies confidence-based rewards to iteratively refine mappings without labeled examples or model weight updates.

Result: Increased mapping accuracy from 56.4% (LLM-only) to 72.73% (RAG) to 93.94% over 100 iterations using GPT-4o. Reduced low-confidence mappings requiring expert review by 85%.

Conclusion: Provides an evidence-driven, transparent method for solving industry schema mapping problems, enabling more robust, accountable, scalable, efficient, flexible, adaptable, and collaborative solutions.

Abstract: The Enterprise Intelligence Platform must integrate logs from numerous third-party vendors in order to perform various downstream tasks. However, vendor documentation is often unavailable at test time. It is either misplaced, mismatched, poorly formatted, or incomplete, which makes schema mapping challenging. We introduce a reinforcement learning agent that can self-improve without labeled examples or model weight updates. During inference, the agent:

Identifies ambiguous field-mapping attempts. 2) Generates targeted web-search queries to gather external evidence. 3) Applies a confidence-based reward to iteratively refine its mappings. To demonstrate this concept, we converted Microsoft Defender for Endpoint logs into a common schema. Our method increased mapping accuracy from 56.4%(LLM-only) to 72.73%(RAG) to 93.94% over 100 iterations using GPT-4o. At the same time, it reduced the number of low-confidence mappings requiring expert review by 85%. This new approach provides an evidence-driven, transparent method for solving future industry problems, paving the way for more robust, accountable, scalable, efficient, flexible, adaptable, and collaborative solutions.

[404] Budget-aware Test-time Scaling via Discriminative Verification

Kyle Montgomery, Sijun Tan, Yuqi Chen, Siyuan Zhuang, Tianjun Zhang, Raluca Ada Popa, Chenguang Wang

Main category: cs.AI

TL;DR: Discriminative verification combined with self-consistency outperforms costly generative verification methods under fixed compute budgets, achieving up to 15.3% higher accuracy on AIME2025.

Details

Motivation: Current state-of-the-art approaches use generative verifiers which incur prohibitive computational costs, limiting practical deployment of test-time scaling for large language models.

Method: Proposes a hybrid approach combining discriminative verifiers with self-consistency, shifting focus to budget-aware discriminative verification instead of expensive generative verification.

Result: Under fixed compute budget, the hybrid discriminative verification approach achieves up to 15.3% higher accuracy on AIME2025 compared to state-of-the-art generative verification methods.

Conclusion: Budget-aware scaling with discriminative verifiers is not only a free upgrade over self-consistency but also a more effective and efficient alternative to costly generative techniques for practical applications.

Abstract: Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a “free” upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.

[405] Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models

Akira Okutomi

Main category: cs.AI

TL;DR: This paper reinterprets Kant’s Critique of Pure Reason through feedback stability theory, proposing a composite instability index (H-Risk) that predicts overconfident errors in reasoning systems, including large language models.

Details

Motivation: To bridge Kant's concept of reason as self-limiting with modern feedback control theory, providing a principled framework for diagnosing and reducing overconfidence in reasoning systems.

Method: Developed a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. Applied this to linear-Gaussian simulations and extended analysis to large language models.

Result: Higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. In LLMs, fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects.

Conclusion: The study establishes a structural bridge between Kantian self-limitation and feedback control, offering a diagnostic lens for identifying and selectively reducing overconfidence in reasoning systems.

Abstract: We reinterpret Kant’s Critique of Pure Reason as a theory of feedback stability, viewing reason as a regulator that keeps inference within the bounds of possible experience. We formalize this intuition via a composite instability index (H-Risk) combining spectral margin, conditioning, temporal sensitivity, and innovation amplification. In linear-Gaussian simulations, higher H-Risk predicts overconfident errors even under formal stability, revealing a gap between nominal and epistemic stability. Extending to large language models (LLMs), we find that fragile internal dynamics correlate with miscalibration and hallucination, while critique-style prompts show mixed effects on calibration and hallucination. These results suggest a structural bridge between Kantian self-limitation and feedback control, offering a principled lens for diagnosing – and selectively reducing – overconfidence in reasoning systems. This is a preliminary version; supplementary experiments and broader replication will be reported in a future revision.

[406] GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Jingpei Wu, Guohao Li, Zhen Han, Volker Tresp

Main category: cs.AI

TL;DR: GroundedPRM is a framework that improves process reward models for multi-step reasoning in LLMs by using Monte Carlo Tree Search for structured reasoning paths and external tools for step validation, achieving better performance with less data than existing methods.

Details

Motivation: Existing PRMs face challenges with noisy rewards, low factual fidelity, and misalignment due to reliance on costly human labeling, hallucination-prone LLM self-evaluation, or Monte Carlo estimation that causes credit misattribution.

Method: Uses Monte Carlo Tree Search to construct structured reasoning paths, validates intermediate steps with external tools for execution-grounded correctness, employs hybrid reward aggregation combining tool verification and MCTS feedback, and formats rewards into rationale-enhanced generative structures.

Result: Achieves up to 26% relative improvement on ProcessBench with only 40K automatically labeled samples (10% of data used by best-performing auto-labeled PRM), and outperforms human-labeled PRMs when used for reward-guided greedy search.

Conclusion: GroundedPRM provides a scalable and verifiable path toward high-quality process-level reasoning by addressing key limitations of existing PRMs through tree-guided, fidelity-aware automatic process supervision.

Abstract: Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.

[407] Agentic Design of Compositional Machines

Wenqian Zhang, Weiyang Liu, Zhen Liu

Main category: cs.AI

TL;DR: LLMs can learn machine design through compositional assembly in simulated physics environments, with reinforcement learning improving their spatial reasoning and strategic assembly capabilities.

Details

Motivation: To explore whether large language models can learn to create complex machines, particularly through compositional design using standardized components in physical simulations.

Method: Introduces BesiegeField testbed built on Besiege game for part-based construction and physical simulation. Benchmarks LLMs with agentic workflows and uses reinforcement learning finetuning with curated datasets.

Result: Current open-source LLMs fall short in machine design tasks, requiring improved spatial reasoning, strategic assembly, and instruction-following capabilities. RL finetuning shows promise for enhancement.

Conclusion: Machine design represents a challenging frontier for LLMs that requires advances at the intersection of language, physical reasoning, and strategic assembly, with RL offering a path toward improvement.

Abstract: The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.

Haiyang Li, Yaxiong Wang, Shengeng Tang, Lianwei Wu, Lechao Cheng, Zhun Zhong

Main category: cs.AI

TL;DR: This paper addresses the gap in multimodal fake content detection by proposing a unified framework that handles both human-crafted misinformation and AI-generated content, which are typically studied separately.

Details

Motivation: Existing detection systems are specialized for either human-written misinformation or AI-generated content, but real-world scenarios involve unknown types of deceptive content, limiting their effectiveness.

Method: Proposed UMFDet framework with VLM backbone, Category-aware Mixture-of-Experts Adapter for category-specific cues, and attribution chain-of-thought mechanism for implicit reasoning guidance.

Result: UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines on the comprehensive OmniFake dataset of 127K samples.

Conclusion: The unified approach offers a practical solution for real-world multimodal deception detection by effectively handling both human-crafted and AI-generated fake content.

Abstract: In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.

[409] Generative AI Meets Future Cities: Towards an Era of Autonomous Urban Intelligence

Dongjie Wang, Chang-Tien Lu, Xinyue Ye, Tan Yigitcanlar, Yanjie Fu

Main category: cs.AI

TL;DR: The paper explores the intersection of urban planning and AI, showing how machine learning techniques can address key urban planning challenges like automated land-use configuration.

Details

Motivation: To bridge the gap between urban planning and AI by demonstrating how AI advances can contribute to solving modern urban planning problems from sustainability, economic, environmental, and disaster perspectives.

Method: Relates fundamental urban planning concepts to machine learning problems including adversarial learning, generative neural networks, deep encoder-decoder networks, conversational AI, and geospatial/temporal machine learning.

Result: Formulates automated land-use configuration as generating land uses and building configurations using surrounding geospatial, mobility, social media, environment, and economic data.

Conclusion: Proposes key research areas at the intersection of AI and urban planning, highlighting the potential for AI to significantly contribute to modern urban planning practices.

Abstract: The two fields of urban planning and artificial intelligence (AI) arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we introduce the importance of urban planning from the sustainability, living, economic, disaster, and environmental perspectives. We review the fundamental concepts of urban planning and relate these concepts to crucial open problems of machine learning, including adversarial learning, generative neural networks, deep encoder-decoder networks, conversational AI, and geospatial and temporal machine learning, thereby assaying how AI can contribute to modern urban planning. Thus, a central problem is automated land-use configuration, which is formulated as the generation of land uses and building configuration for a target area from surrounding geospatial, human mobility, social media, environment, and economic activities. Finally, we delineate some implications of AI for urban planning and propose key research areas at the intersection of both topics.

[410] Domain-Independent Dynamic Programming

Ryo Kuroiwa, J. Christopher Beck

Main category: cs.AI

TL;DR: The paper proposes Domain-Independent Dynamic Programming (DIDP) as a new model-based paradigm for combinatorial optimization, introducing DyPDL formalism and showing DIDP outperforms MIP and CP solvers on most benchmark problems.

Details

Motivation: To create a declarative problem-solving paradigm that decouples modeling from solving, similar to MIP and CP, but based on dynamic programming principles to achieve better performance.

Method: Introduces Dynamic Programming Description Language (DyPDL) based on state transition systems inspired by AI planning, and develops seven DIDP solvers using heuristic search algorithms to solve DyPDL models.

Result: Experimental comparison on 11 combinatorial optimization problem classes shows DIDP outperforms MIP in 9 classes, CP in 9 classes, and both MIP and CP in 7 classes, achieving superior performance to existing state-based solvers.

Conclusion: DIDP represents an effective new paradigm for combinatorial optimization that demonstrates competitive advantages over traditional MIP and CP approaches, as well as existing AI planning methods.

Abstract: For combinatorial optimization problems, model-based paradigms such as mixed-integer programming (MIP) and constraint programming (CP) aim to decouple modeling and solving a problem: the `holy grail’ of declarative problem solving. We propose domain-independent dynamic programming (DIDP), a novel model-based paradigm based on dynamic programming (DP). While DP is not new, it has typically been implemented as a problem-specific method. We introduce Dynamic Programming Description Language (DyPDL), a formalism to define DP models based on a state transition system, inspired by artificial intelligence (AI) planning. we show that heuristic search algorithms can be used to solve DyPDL models and propose seven DIDP solvers. We experimentally compare our DIDP solvers with commercial MIP and CP solvers (solving MIP and CP models, respectively) on common benchmark instances of eleven combinatorial optimization problem classes. We show that DIDP outperforms MIP in nine problem classes, CP also in nine problem classes, and both MIP and CP in seven. DIDP also achieves superior performance to existing state-based solvers including domain-independent AI planners.

[411] TriQXNet: Forecasting Dst Index from Solar Wind Data Using an Interpretable Parallel Classical-Quantum Framework with Uncertainty Quantification

Md Abrar Jahin, M. F. Mridha, Zeyar Aung, Nilanjan Dey, R. Simon Sherratt

Main category: cs.AI

TL;DR: TriQXNet is a hybrid classical-quantum neural network that outperforms 13 state-of-the-art models in geomagnetic storm forecasting, achieving 9.27 nT RMSE with uncertainty quantification and explainable AI.

Details

Motivation: Geomagnetic storms disrupt critical infrastructure but existing forecasting models struggle with accuracy due to noise and sensor failures, creating need for more reliable prediction methods.

Method: Developed TriQXNet hybrid classical-quantum neural network with comprehensive preprocessing pipeline, conformal prediction for uncertainty, and XAI methods like ShapTime for interpretability.

Result: Achieved superior performance with 9.27 nT RMSE, outperforming 13 state-of-the-art models with 95% confidence via 10-fold cross-validated paired t-tests.

Conclusion: TriQXNet sets new standards for geomagnetic storm prediction and demonstrates the potential of classical-quantum hybrid models in space weather forecasting.

Abstract: Geomagnetic storms, caused by solar wind energy transfer to Earth’s magnetic field, can disrupt critical infrastructure like GPS, satellite communications, and power grids. The disturbance storm-time (Dst) index measures storm intensity. Despite advancements in empirical, physics-based, and machine-learning models using real-time solar wind data, accurately forecasting extreme geomagnetic events remains challenging due to noise and sensor failures. This research introduces TriQXNet, a novel hybrid classical-quantum neural network for Dst forecasting. Our model integrates classical and quantum computing, conformal prediction, and explainable AI (XAI) within a hybrid architecture. To ensure high-quality input data, we developed a comprehensive preprocessing pipeline that included feature selection, normalization, aggregation, and imputation. TriQXNet processes preprocessed solar wind data from NASA’s ACE and NOAA’s DSCOVR satellites, predicting the Dst index for the current hour and the next, providing vital advance notice to mitigate geomagnetic storm impacts. TriQXNet outperforms 13 state-of-the-art hybrid deep-learning models, achieving a root mean squared error of 9.27 nanoteslas (nT). Rigorous evaluation through 10-fold cross-validated paired t-tests confirmed its superior performance with 95% confidence. Conformal prediction techniques provide quantifiable uncertainty, which is essential for operational decisions, while XAI methods like ShapTime enhance interpretability. Comparative analysis shows TriQXNet’s superior forecasting accuracy, setting a new level of expectations for geomagnetic storm prediction and highlighting the potential of classical-quantum hybrid models in space weather forecasting.

[412] DELE: Deductive $\mathcal{EL}^{++}$ Embeddings for Knowledge Base Completion

Olga Mashkova, Fernando Zhapa-Camacho, Robert Hoehndorf

Main category: cs.AI

TL;DR: The paper proposes improved ontology embedding methods for Description Logic EL++ that address limitations of existing approaches by incorporating deductive closure and distinguishing between unprovable and provably false statements.

Details

Motivation: Existing ontology embedding methods for EL++ have limitations: they don't distinguish between unprovable and provably false statements, and they don't utilize the deductive closure of ontologies to identify inferred but not asserted statements.

Method: Developed embedding methods with novel negative losses that account for deductive closure and different types of negatives, and formulated evaluation methods for knowledge base completion.

Result: The proposed embedding methods demonstrate improvement over baseline ontology embedding in the task of knowledge base or ontology completion.

Conclusion: Incorporating deductive closure and properly handling different types of negative statements leads to better ontology embedding performance for knowledge base completion tasks.

Abstract: Ontology embeddings map classes, roles, and individuals in ontologies into $\mathbb{R}^n$, and within $\mathbb{R}^n$ similarity between entities can be computed or new axioms inferred. For ontologies in the Description Logic $\mathcal{EL}^{++}$, several optimization-based embedding methods have been developed that explicitly generate models of an ontology. However, these methods suffer from some limitations; they do not distinguish between statements that are unprovable and provably false, and therefore they may use entailed statements as negatives. Furthermore, they do not utilize the deductive closure of an ontology to identify statements that are inferred but not asserted. We evaluated a set of embedding methods for $\mathcal{EL}^{++}$ ontologies, incorporating several modifications that aim to make use of the ontology deductive closure. In particular, we designed novel negative losses that account both for the deductive closure and different types of negatives and formulated evaluation methods for knowledge base completion. We demonstrate that our embedding methods improve over the baseline ontology embedding in the task of knowledge base or ontology completion.

[413] Robust Counterfactual Inference in Markov Decision Processes

Jessica Lally, Milad Kazemi, Nicola Paoletti

Main category: cs.AI

TL;DR: Proposes a non-parametric method to compute tight bounds on counterfactual transition probabilities in MDPs across all compatible causal models, enabling efficient computation and robust policy optimization.

Details

Motivation: Existing counterfactual inference methods for MDPs assume specific causal models, limiting validity since multiple causal models can align with observational data but yield different counterfactual distributions.

Method: Non-parametric approach using closed-form expressions to compute bounds on counterfactual transition probabilities across all compatible causal models, avoiding large optimization problems. Constructs interval counterfactual MDP and identifies robust policies optimizing worst-case reward.

Result: Method provides tight bounds efficiently without exponential growth in computation. Evaluations show improved robustness over existing methods in various case studies.

Conclusion: The approach enables scalable counterfactual inference in MDPs by considering all compatible causal models through interval probabilities, leading to more robust decision-making.

Abstract: This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

[414] SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Jianzhu Yao, Kevin Wang, Ryan Hsieh, Haisu Zhou, Tianqing Zou, Zerui Cheng, Zhangyang Wang, Pramod Viswanath

Main category: cs.AI

TL;DR: SPIN-Bench is a new multi-domain benchmark for evaluating strategic planning and social reasoning in AI agents, combining PDDL tasks, competitive board games, cooperative card games, and multi-agent negotiation scenarios.

Details

Motivation: To address the gap in evaluating sophisticated reasoning and strategic behavior in social interactions, which requires more complex capabilities than isolated planning or static reasoning tasks.

Method: Created a unified framework with systematic variation of action spaces, state complexity, and number of interacting agents across multiple domains including classical PDDL, competitive board games, cooperative card games, and multi-agent negotiation scenarios.

Result: Contemporary LLMs perform well on basic fact retrieval and short-range planning but show significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty.

Conclusion: SPIN-Bench serves as a catalyst for future research on robust multi-agent planning, social reasoning, and human-AI teaming, highlighting the need for improved strategic reasoning capabilities in AI systems.

Abstract: Reasoning and strategic behavior in social interactions is a hallmark of intelligence. This form of reasoning is significantly more sophisticated than isolated planning or reasoning tasks in static settings (e.g., math problem solving). In this paper, we present Strategic Planning, Interaction, and Negotiation (SPIN-Bench), a new multi-domain evaluation designed to measure the intelligence of strategic planning and social reasoning. While many existing benchmarks focus on narrow planning or single-agent reasoning, SPIN-Bench combines classical PDDL tasks, competitive board games, cooperative card games, and multi-agent negotiation scenarios in one unified framework. The framework includes both a benchmark as well as an arena to simulate and evaluate the variety of social settings to test reasoning and strategic behavior of AI agents. We formulate the benchmark SPIN-Bench by systematically varying action spaces, state complexity, and the number of interacting agents to simulate a variety of social settings where success depends on not only methodical and step-wise decision making, but also conceptual inference of other (adversarial or cooperative) participants. Our experiments reveal that while contemporary LLMs handle basic fact retrieval and short-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring deep multi-hop reasoning over large state spaces and socially adept coordination under uncertainty. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human–AI teaming. Project Website: https://spinbench.github.io/

[415] PoE-World: Compositional World Modeling with Products of Programmatic Experts

Wasu Top Piriyakulkij, Yichao Liang, Hao Tang, Adrian Weller, Marta Kryven, Kevin Ellis

Main category: cs.AI

TL;DR: A novel program synthesis method using LLMs to create world models as exponentially-weighted products of programmatic experts (PoE-World), enabling learning of complex stochastic world models from few observations and effective planning in non-gridworld domains like Atari games.

Details

Motivation: Traditional deep learning world models require large datasets and lack flexibility for sparse observations, while existing program-structured world models are limited to simple domains like natural language and grid-worlds.

Method: Represent world models as exponentially-weighted products of programmatic experts synthesized by LLMs, allowing modeling of complex non-gridworld domains from few observations.

Result: The approach successfully learns complex stochastic world models from few observations and enables efficient model-based planning, demonstrating generalization to unseen levels on Atari’s Pong and Montezuma’s Revenge.

Conclusion: PoE-World provides an effective framework for learning program-structured world models that generalize well from sparse data in complex environments, advancing beyond traditional deep learning approaches.

Abstract: Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari’s Pong and Montezuma’s Revenge. We release our code and display the learned world models and videos of the agent’s gameplay at https://topwasu.github.io/poe-world.

[416] MAFA: A multi-agent framework for annotation

Mahmood Hegazy, Aaron Rodrigues, Azzam Naeem

Main category: cs.AI

TL;DR: A multi-agent framework for FAQ annotation that combines specialized agents with different approaches and a judge agent for reranking, achieving significant improvements over single-agent methods in banking applications.

Details

Motivation: Traditional FAQ retrieval approaches relying on single models often fail to capture nuances of diverse user inquiries in banking applications, requiring more robust and accurate solutions.

Method: Multi-agent framework with specialized agents using structured reasoning inspired by ARQs, few-shot examples for ensemble diversity, and a judge agent for reranking candidates.

Result: 14% increase in Top-1 accuracy, 18% increase in Top-5 accuracy, and 12% improvement in Mean Reciprocal Rank on bank dataset, with similar gains on public benchmarks (LCQMC and FiQA).

Conclusion: The framework effectively handles ambiguous queries and shows strong generalization across domains and languages, making it suitable for production banking applications.

Abstract: Modern consumer banking applications require accurate and efficient retrieval of information in response to user queries. Mapping user utterances to the most relevant Frequently Asked Questions (FAQs) is a crucial component of these systems. Traditional approaches often rely on a single model or technique, which may not capture the nuances of diverse user inquiries. In this paper, we introduce a multi-agent framework for FAQ annotation that combines multiple specialized agents with different approaches and a judge agent that reranks candidates to produce optimal results. Our agents utilize a structured reasoning approach inspired by Attentive Reasoning Queries (ARQs), which guides them through systematic reasoning steps using targeted, task-specific JSON queries. Our framework features a few-shot example strategy, where each agent receives different few-shots, enhancing ensemble diversity and coverage of the query space. We evaluate our framework on a real-world major bank dataset as well as public benchmark datasets (LCQMC and FiQA), demonstrating significant improvements over single-agent approaches across multiple metrics, including a 14% increase in Top-1 accuracy, an 18% increase in Top-5 accuracy, and a 12% improvement in Mean Reciprocal Rank on our dataset, and similar gains on public benchmarks when compared with traditional and single-agent annotation techniques. Our framework is particularly effective at handling ambiguous queries, making it well-suited for deployment in production banking applications while showing strong generalization capabilities across different domains and languages.

[417] EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

Shuang Ao, Flora D. Salim, Simon Khan

Main category: cs.AI

TL;DR: EMAC+ is an embodied multimodal agent that integrates LLM and VLM through bidirectional training to address limitations in robotics control, achieving superior performance on ALFWorld and RT-1 benchmarks.

Details

Motivation: Current LLM-based agents have three key limitations for robotics: working mainly with text rather than visual inputs, treating LLMs as static planners separated from environment dynamics, and inability to learn from visual interactions for domain-specific improvements.

Method: EMAC+ integrates LLM and VLM via bidirectional training paradigm, where the LLM generates high-level textual plans and the VLM provides real-time visual feedback for dynamic plan refinement, enabling the LLM to internalize visual environment dynamics through interactive experience.

Result: Extensive experiments on ALFWorld and RT-1 benchmarks show EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning compared to existing methods.

Conclusion: The bidirectional integration of LLM and VLM in EMAC+ successfully addresses key limitations of previous models by enabling dynamic plan refinement and visual learning, demonstrating improved robotics control capabilities.

Abstract: Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.

[418] Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance

Xixi Wang, Miguel Costa, Jordanka Kovaceva, Shuai Wang, Francisco C. Pereira

Main category: cs.AI

TL;DR: A graph-based framework for multi-table QA that uses human-curated relational knowledge to encode schema links and join paths, enabling interpretable reasoning chains on complex industrial tabular data.

Details

Motivation: Existing methods for multi-table QA based on semantic similarity struggle with complex real-world scenarios with numerous diverse columns, as they have unreliable schema linking across complex tables.

Method: Propose a graph-based framework that leverages human-curated relational knowledge to explicitly encode schema links and join paths. Uses graph search to construct interpretable reasoning chains with pruning and sub-path merging strategies for efficiency.

Result: Experiments on both standard benchmarks and a realistic large-scale dataset demonstrate the effectiveness of the approach.

Conclusion: This is the first multi-table QA system applied to truly complex industrial tabular data, showing promising results for handling real-world complexity.

Abstract: Large language models (LLMs) have shown promise in table Question Answering (Table QA). However, extending these capabilities to multi-table QA remains challenging due to unreliable schema linking across complex tables. Existing methods based on semantic similarity work well only on simplified hand-crafted datasets and struggle to handle complex, real-world scenarios with numerous and diverse columns. To address this, we propose a graph-based framework that leverages human-curated relational knowledge to explicitly encode schema links and join paths. Given a natural language query, our method searches on graph to construct interpretable reasoning chains, aided by pruning and sub-path merging strategies to enhance efficiency and coherence. Experiments on both standard benchmarks and a realistic, large-scale dataset demonstrate the effectiveness of our approach. To our knowledge, this is the first multi-table QA system applied to truly complex industrial tabular data.

[419] Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

Jiacheng Miao, Joe R. Davis, Yaohui Zhang, Jonathan K. Pritchard, James Zou

Main category: cs.AI

TL;DR: Paper2Agent is a framework that automatically converts research papers into AI agents that serve as knowledgeable research assistants, enabling natural language interaction with paper methods and workflows.

Details

Motivation: To overcome barriers in research dissemination and reuse by transforming static papers into active AI systems that can accelerate downstream adoption and discovery.

Method: Systematically analyzes papers and associated codebases using multiple agents to construct Model Context Protocol (MCP) servers, then iteratively generates and runs tests to refine the resulting MCPs.

Result: Successfully created reliable paper agents for genomic variant interpretation (AlphaGenome) and single-cell/spatial transcriptomics analyses (ScanPy, TISSUE), reproducing original results and handling novel queries.

Conclusion: Paper2Agent introduces a new paradigm for knowledge dissemination by turning static papers into dynamic AI agents, creating a foundation for collaborative AI co-scientist ecosystems.

Abstract: We introduce Paper2Agent, an automated framework that converts research papers into AI agents. Paper2Agent transforms research output from passive artifacts into active systems that can accelerate downstream use, adoption, and discovery. Conventional research papers require readers to invest substantial effort to understand and adapt a paper’s code, data, and methods to their own work, creating barriers to dissemination and reuse. Paper2Agent addresses this challenge by automatically converting a paper into an AI agent that acts as a knowledgeable research assistant. It systematically analyzes the paper and the associated codebase using multiple agents to construct a Model Context Protocol (MCP) server, then iteratively generates and runs tests to refine and robustify the resulting MCP. These paper MCPs can then be flexibly connected to a chat agent (e.g. Claude Code) to carry out complex scientific queries through natural language while invoking tools and workflows from the original paper. We demonstrate Paper2Agent’s effectiveness in creating reliable and capable paper agents through in-depth case studies. Paper2Agent created an agent that leverages AlphaGenome to interpret genomic variants and agents based on ScanPy and TISSUE to carry out single-cell and spatial transcriptomics analyses. We validate that these paper agents can reproduce the original paper’s results and can correctly carry out novel user queries. Paper2Agent automatically created AI co-scientist that identified new splicing variant associated with ADHD risk. By turning static papers into dynamic, interactive AI agents, Paper2Agent introduces a new paradigm for knowledge dissemination and a foundation for the collaborative ecosystem of AI co-scientists.

[420] RepIt: Representing Isolated Targets to Steer Language Models

Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang

Main category: cs.AI

TL;DR: RepIt is a data-efficient framework that isolates concept-specific representations in LLMs, enabling precise interventions to suppress refusal on targeted concepts while maintaining safety on standard benchmarks.

Details

Motivation: Current activation steering methods in LLMs often have broader effects than desired, motivating the need for purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level.

Method: RepIt framework isolates concept-specific representations using minimal data (as few as a dozen examples) and compute (single A6000), localizing corrective signals to just 100-200 neurons.

Result: RepIt successfully suppresses refusal on targeted concepts (like WMD-related questions) while preserving refusal elsewhere, producing models that still score as safe on standard benchmarks.

Conclusion: Targeted interventions can counteract overgeneralization in LLMs, laying foundation for more granular control of model behavior, though this efficiency also raises concerns about potential misuse with modest compute and data.

Abstract: While activation steering in large language models (LLMs) is a growing area of research, methods can often incur broader effects than desired. This motivates isolation of purer concept vectors to enable targeted interventions and understand LLM behavior at a more granular level. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations. Across five frontier LLMs, RepIt enables precise interventions: it selectively suppresses refusal on targeted concepts while preserving refusal elsewhere, producing models that answer WMD-related questions while still scoring as safe on standard benchmarks. We further show that the corrective signal localizes to just 100-200 neurons and that robust target representations can be extracted from as few as a dozen examples on a single A6000. This efficiency raises a dual concern: manipulations can be performed with modest compute and data to extend to underrepresented data-scarce topics while evading existing benchmarks. By disentangling refusal vectors with RepIt, this work demonstrates that targeted interventions can counteract overgeneralization, laying the foundation for more granular control of model behavior.

[421] SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs

Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang

Main category: cs.AI

TL;DR: SteeringSafety is a framework for evaluating representation steering methods across 7 safety perspectives using 17 datasets, revealing method-model-perspective dependencies and significant safety entanglements.

Details

Motivation: Prior work focused on general capabilities of representation steering, but systematic safety evaluation across multiple perspectives was lacking.

Method: Modular framework implementing DIM, ACE, CAA, PCA, and LAT steering methods with conditional steering enhancements, tested on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B models.

Result: Strong steering performance depends on method-model-perspective pairing; DIM is consistently effective; all methods show substantial entanglement including 76% degradation in social behaviors, jailbreaking compromises normative judgment, and hallucination steering unpredictably shifts political views.

Conclusion: Holistic safety evaluations are critically needed for representation steering methods due to complex entanglements across safety perspectives.

Abstract: We introduce SteeringSafety, a systematic framework for evaluating representation steering methods across seven safety perspectives spanning 17 datasets. While prior work highlights general capabilities of representation steering, we systematically explore safety perspectives including bias, harmfulness, hallucination, social behaviors, reasoning, epistemic integrity, and normative judgment. Our framework provides modularized building blocks for state-of-the-art steering methods, enabling unified implementation of DIM, ACE, CAA, PCA, and LAT with recent enhancements like conditional steering. Results on Gemma-2-2B, Llama-3.1-8B, and Qwen-2.5-7B reveal that strong steering performance depends critically on pairing of method, model, and specific perspective. DIM shows consistent effectiveness, but all methods exhibit substantial entanglement: social behaviors show highest vulnerability (reaching degradation as high as 76%), jailbreaking often compromises normative judgment, and hallucination steering unpredictably shifts political views. Our findings underscore the critical need for holistic safety evaluations.

[422] Efficient & Correct Predictive Equivalence for Decision Trees

Joao Marques-Silva, Alexey Ignatiev

Main category: cs.AI

TL;DR: This paper identifies limitations in using Quine-McCluskey method for decision tree analysis, showing it can have exponential complexity and incorrect equivalence decisions, and proposes polynomial-time alternatives.

Details

Motivation: To address the redundancy and inaccuracy issues in Rashomon set analysis caused by predictive equivalent decision trees, and to overcome the computational limitations of existing Quine-McCluskey based approaches.

Method: The paper demonstrates the exponential worst-case behavior of Quine-McCluskey method, identifies constraints needed for correct predictive equivalence decisions, and develops polynomial-time algorithms for solving problems previously addressed using DNF minimization.

Result: Experiments show the proposed algorithms are orders of magnitude faster than Quine-McCluskey based methods when worst-case scenarios are triggered, while maintaining correctness.

Conclusion: The paper provides efficient polynomial-time alternatives to Quine-McCluskey method for decision tree analysis, addressing computational complexity and correctness issues while maintaining practical performance.

Abstract: The Rashomon set of decision trees (DTs) finds importance uses. Recent work showed that DTs computing the same classification function, i.e. predictive equivalent DTs, can represent a significant fraction of the Rashomon set. Such redundancy is undesirable. For example, feature importance based on the Rashomon set becomes inaccurate due the existence of predictive equivalent DTs, i.e. DTs with the same prediction for every possible input. In recent work, McTavish et al. proposed solutions for several computational problems related with DTs, including that of deciding predictive equivalent DTs. The approach of McTavish et al. consists of applying the well-known method of Quine-McCluskey (QM) for obtaining minimum-size DNF (disjunctive normal form) representations of DTs, which are then used for comparing DTs for predictive equivalence. Furthermore, the minimum-size DNF representation was also applied to computing explanations for the predictions made by DTs, and to finding predictions in the presence of missing data. However, the problem of formula minimization is hard for the second level of the polynomial hierarchy, and the QM method may exhibit worst-case exponential running time and space. This paper first demonstrates that there exist decision trees that trigger the worst-case exponential running time and space of the QM method. Second, the paper shows that the QM method may incorrectly decide predictive equivalence, if two key constraints are not respected, and one may be difficult to formally guarantee. Third, the paper shows that any of the problems to which the smallest DNF representation has been applied to can be solved in polynomial time, in the size of the DT. The experiments confirm that, for DTs for which the worst-case of the QM method is triggered, the algorithms proposed in this paper are orders of magnitude faster than the ones proposed by McTavish et al.

[423] From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, Chenxu Wu, Wanyi Chen, Zhe Qian, Xinyu Liu, Yiwei Zhang, Junhao Wang, Hengbo Xu, Fei Luo, Xiaohua Chen, Xiaoshuai Hao, Hehan Li, Andi Zhang, Wenxuan Wang, Kaiyan Zhang, Guoli Jia, Lingling Li, Zhiwu Lu, Yang Lu, Yike Guo

Main category: cs.AI

TL;DR: This survey introduces a “From Perception to Cognition” framework to analyze MLLMs’ limitations in integrating visual perception with cognitive reasoning, addressing hallucination issues and proposing future directions.

Details

Motivation: MLLMs often exhibit shallow integration between perception and cognition, leading to reasoning failures like hallucination, which prevents them from building coherent internal world models.

Method: Introduces a unified analytical framework decomposing vision-language understanding into Perception (visual information extraction) and Cognition (proactive reasoning with observe-think-verify loop), then surveys current methods addressing these layers.

Result: Systematically analyzes key bottlenecks in current MLLMs at both perception and cognition layers, surveys cutting-edge enhancement techniques, and reviews relevant benchmarks.

Conclusion: Provides a structured perspective for understanding MLLM limitations and illuminates paths toward next-generation models capable of deep reasoning and genuine world understanding.

Abstract: Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

[424] LLM Based Bayesian Optimization for Prompt Search

Adam Ballew, Jingbo Wang, Shaogang Ren

Main category: cs.AI

TL;DR: BO-LLM algorithm uses Bayesian Optimization with LLM-powered Gaussian Process for prompt engineering to improve text classification accuracy while reducing API calls.

Details

Motivation: To efficiently optimize expensive black-box functions (prompt engineering for LLMs) with limited evaluations using Bayesian Optimization.

Method: Uses LLM-powered Gaussian Process as surrogate model, generates prompt candidates via LLM expansion of seed prompts, evaluates with UCB acquisition function, and iteratively refines prompts on subset data.

Result: Evaluated on two datasets, showing improved classification accuracy while reducing API calls through prediction uncertainty leveraging.

Conclusion: BO-LLM algorithm effectively enhances text classification performance through optimized prompt engineering with reduced computational costs.

Abstract: Bayesian Optimization (BO) has been widely used to efficiently optimize expensive black-box functions with limited evaluations. In this paper, we investigate the use of BO for prompt engineering to enhance text classification with Large Language Models (LLMs). We employ an LLM-powered Gaussian Process (GP) as the surrogate model to estimate the performance of different prompt candidates. These candidates are generated by an LLM through the expansion of a set of seed prompts and are subsequently evaluated using an Upper Confidence Bound (UCB) acquisition function in conjunction with the GP posterior. The optimization process iteratively refines the prompts based on a subset of the data, aiming to improve classification accuracy while reducing the number of API calls by leveraging the prediction uncertainty of the LLM-based GP. The proposed BO-LLM algorithm is evaluated on two datasets, and its advantages are discussed in detail in this paper.

[425] TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

Yincen Qu, Huan Xiao, Feng Li, Gregory Li, Hui Zhou, Xiangying Dai, Xiaoru Dai

Main category: cs.AI

TL;DR: A comprehensive benchmark for evaluating travel planning capabilities of LLMs, focusing on feasibility, reliability, and engagement with a unified reward system.

Details

Motivation: Existing benchmarks fall short in evaluating key aspects of travel plans like feasibility, reliability, and engagement, necessitating a more comprehensive evaluation framework.

Method: Developed a benchmark with fine-grained criteria unified into a single reward, created a dataset of 4,870 queries including 219 real-world requests, and tested various methods including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO.

Result: The evaluator achieved 60.75% agreement with travel-expert annotations and outperformed multiple LLM-as-judge baselines. RL generally improved itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

Conclusion: The proposed benchmark enables direct comparison of travel plan quality and seamless integration with RL, showing that RL methods can effectively improve travel planning capabilities of LLMs.

Abstract: Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs’ planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

[426] Tensor Logic: The Language of AI

Pedro Domingos

Main category: cs.AI

TL;DR: Tensor logic is a new programming language that unifies neural and symbolic AI through tensor equations, treating logical rules and Einstein summation as the same operation.

Details

Motivation: Current AI tools like PyTorch and TensorFlow lack automated reasoning capabilities, while traditional AI languages like LISP and Prolog lack scalability and learning support, creating a fundamental gap in AI development.

Method: The paper proposes tensor logic as a unified language where the sole construct is tensor equations, treating logical rules and Einstein summation as equivalent operations.

Result: Tensor logic elegantly implements key AI forms including transformers, formal reasoning, kernel machines, and graphical models, and enables new capabilities like sound reasoning in embedding space.

Conclusion: Tensor logic combines neural networks’ scalability and learnability with symbolic reasoning’s reliability and transparency, potentially enabling wider AI adoption through fundamental unification.

Abstract: Progress in AI is hindered by the lack of a programming language with all the requisite features. Libraries like PyTorch and TensorFlow provide automatic differentiation and efficient GPU implementation, but are additions to Python, which was never intended for AI. Their lack of support for automated reasoning and knowledge acquisition has led to a long and costly series of hacky attempts to tack them on. On the other hand, AI languages like LISP and Prolog lack scalability and support for learning. This paper proposes tensor logic, a language that solves these problems by unifying neural and symbolic AI at a fundamental level. The sole construct in tensor logic is the tensor equation, based on the observation that logical rules and Einstein summation are essentially the same operation, and all else can be reduced to them. I show how to elegantly implement key forms of neural, symbolic and statistical AI in tensor logic, including transformers, formal reasoning, kernel machines and graphical models. Most importantly, tensor logic makes new directions possible, such as sound reasoning in embedding space. This combines the scalability and learnability of neural networks with the reliability and transparency of symbolic reasoning, and is potentially a basis for the wider adoption of AI.

[427] O-Forge: An LLM + Computer Algebra Framework for Asymptotic Analysis

Ayush Khaitan, Vijay Ganesh

Main category: cs.AI

TL;DR: LLM+CAS framework combines frontier LLMs with computer algebra systems to produce creative and verified proofs for asymptotic inequalities, addressing the verification challenge in AI-assisted mathematical research.

Details

Motivation: To overcome the verification difficulty in using LLMs for research mathematics, where plausible-looking proofs cannot be trusted without rigorous checking, and to answer Terry Tao's question about whether LLMs with verifiers can help prove intricate asymptotic inequalities.

Method: LLM+CAS framework with O-Forge tool that couples frontier LLMs with computer algebra systems in an In-Context Symbolic Feedback loop - LLM suggests domain decompositions and CAS provides axiomatic verification of each piece.

Result: The framework proves remarkably effective at proposing appropriate domain decompositions for asymptotic inequalities, moving AI beyond contest math towards research-level tools for professional mathematicians.

Conclusion: LLM+CAS successfully demonstrates that AI can assist in research-level mathematics by providing both creative suggestions and rigorous verification, particularly for complex asymptotic analysis requiring appropriate domain decomposition.

Abstract: Large language models have recently demonstrated advanced capabilities in solving IMO and Putnam problems; yet their role in research mathematics has remained fairly limited. The key difficulty is verification: suggested proofs may look plausible, but cannot be trusted without rigorous checking. We present a framework, called LLM+CAS, and an associated tool, O-Forge, that couples frontier LLMs with a computer algebra systems (CAS) in an In-Context Symbolic Feedback loop to produce proofs that are both creative and symbolically verified. Our focus is on asymptotic inequalities, a topic that often involves difficult proofs and appropriate decomposition of the domain into the “right” subdomains. Many mathematicians, including Terry Tao, have suggested that using AI tools to find the right decompositions can be very useful for research-level asymptotic analysis. In this paper, we show that our framework LLM+CAS turns out to be remarkably effective at proposing such decompositions via a combination of a frontier LLM and a CAS. More precisely, we use an LLM to suggest domain decomposition, and a CAS (such as Mathematica) that provides a verification of each piece axiomatically. Using this loop, we answer a question posed by Terence Tao: whether LLMs coupled with a verifier can be used to help prove intricate asymptotic inequalities. More broadly, we show how AI can move beyond contest math towards research-level tools for professional mathematicians.

[428] A Methodology for Assessing the Risk of Metric Failure in LLMs Within the Financial Domain

William Flanagan, Mukunda Das, Rajitha Ramanayake, Swanuja Maslekar, Meghana Mangipudi, Joong Ho Choi, Shruti Nair, Shambhavi Bhusan, Sanjana Dulam, Mouni Pendharkar, Nidhi Singh, Vashisth Doshi, Sachi Shah Paresh

Main category: cs.AI

TL;DR: The paper addresses challenges in measuring GenAI performance in financial services, proposing a Risk Assessment Framework to better combine SME evaluation with ML metrics.

Details

Motivation: Traditional ML metrics often fail to generalize to GenAI workloads in finance, and existing benchmarks from research labs don't work well for industrial use. Current approaches also overlook unique risks in metric selection.

Method: The paper develops a Risk Assessment Framework that enables better application of both Subject Matter Expert (SME) evaluation and machine learning metrics for GenAI performance measurement.

Result: The framework helps address the generalization issues of traditional metrics and benchmarks, while accounting for the unique risks involved in metric selection for financial GenAI applications.

Conclusion: A structured risk assessment approach is needed to effectively measure GenAI performance in financial services, combining SME insights with appropriate ML metrics while mitigating selection risks.

Abstract: As Generative Artificial Intelligence is adopted across the financial services industry, a significant barrier to adoption and usage is measuring model performance. Historical machine learning metrics can oftentimes fail to generalize to GenAI workloads and are often supplemented using Subject Matter Expert (SME) Evaluation. Even in this combination, many projects fail to account for various unique risks present in choosing specific metrics. Additionally, many widespread benchmarks created by foundational research labs and educational institutions fail to generalize to industrial use. This paper explains these challenges and provides a Risk Assessment Framework to allow for better application of SME and machine learning Metrics

[429] Training LLM Agents to Empower Humans

Evan Ellis, Vivek Myers, Jens Tuyls, Sergey Levine, Anca Dragan, Benjamin Eysenbach

Main category: cs.AI

TL;DR: The paper proposes Empower, a method for tuning assistive language models to maximize human empowerment rather than completing tasks independently, using only offline text data without explicit human feedback.

Details

Motivation: Current assistive agents often complete tasks on their own rather than truly assisting humans, and require costly explicit human feedback for training.

Method: Empowerment-maximizing method (Empower) that fine-tunes language models to maximize human’s ability to effect desired changes in the environment, using only offline text data.

Result: In user studies, participants preferred Empower assistant 78% of the time with 31% higher acceptance rate and 38% fewer suggestions. In simulated coding environment, Empower increased success rate by 192% over baseline.

Conclusion: Empower provides a framework for aligned AI agents using only offline data without additional human feedback or verifiable rewards.

Abstract: Assistive agents should not only take actions on behalf of a human, but also step out of the way and cede control when there are important decisions to be made. However, current methods for building assistive agents, whether via mimicking expert humans or via RL finetuning on an inferred reward, often encourage agents to complete tasks on their own rather than truly assisting the human attain their objectives. Additionally, these methods often require costly explicit human feedback to provide a training signal. We propose a new approach to tuning assistive language models based on maximizing the human’s empowerment, their ability to effect desired changes in the environment. Our empowerment-maximizing method, Empower, only requires offline text data, providing a self-supervised method for fine-tuning language models to better assist humans. To study the efficacy of our approach, we conducted an 18-person user study comparing our empowerment assistant with a strong baseline. Participants preferred our assistant 78% of the time (p=0.015), with a 31% higher acceptance rate and 38% fewer suggestions. Additionally, we introduce a new environment for evaluating multi-turn code assistance using simulated humans. Using this environment, we show that agents trained with Empower increase the success rate of a simulated human programmer on challenging coding questions by an average of 192% over an SFT baseline. With this empowerment objective, we provide a framework for useful aligned AI agents at scale using only offline data without the need for any additional human feedback or verifiable rewards.

cs.SD

[430] Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?

Qixin Deng, Bryan Pardo, Thrasyvoulos N Pappas

Main category: cs.SD

TL;DR: Evaluation of three joint language-audio embedding models (MS-CLAP, LAION-CLAP, MuQ-MuLan) for capturing human-perceived timbre semantics, with LAION-CLAP showing best performance.

Details

Motivation: To understand how well multimodal embedding models align with human perception of timbre, which is critical for music applications but underexplored in current models.

Method: Evaluated three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre using instrumental sounds and audio effects.

Result: LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.

Conclusion: LAION-CLAP demonstrates superior performance in capturing human perceptual dimensions of timbre compared to other models, making it more suitable for applications requiring accurate timbre representation.

Abstract: Understanding and modeling the relationship between language and sound is critical for applications such as music information retrieval,text-guided music generation, and audio captioning. Central to these tasks is the use of joint language-audio embedding spaces, which map textual descriptions and auditory content into a shared embedding space. While multimodal embedding models such as MS-CLAP, LAION-CLAP, and MuQ-MuLan have shown strong performance in aligning language and audio, their correspondence to human perception of timbre, a multifaceted attribute encompassing qualities such as brightness, roughness, and warmth, remains underexplored. In this paper, we evaluate the above three joint language-audio embedding models on their ability to capture perceptual dimensions of timbre. Our findings show that LAION-CLAP consistently provides the most reliable alignment with human-perceived timbre semantics across both instrumental sounds and audio effects.

[431] Beat Detection as Object Detection

Jaehoon Ahn, Moon-Ryul Jung

Main category: cs.SD

TL;DR: The paper proposes reframing beat and downbeat tracking as an object detection problem, adapting the FCOS detector from computer vision to 1D audio with a WaveBeat backbone and Feature Pyramid Network.

Details

Motivation: Traditional beat tracking models output frame-level activations, but the authors want to model beats and downbeats as temporal objects using object detection techniques.

Method: Adapt FCOS object detector to 1D audio by replacing backbone with WaveBeat’s temporal feature extractor, adding Feature Pyramid Network for multi-scale patterns, and using non-maximum suppression for final predictions.

Result: Achieves competitive results on standard music datasets, demonstrating that object detection techniques can effectively model musical beats with minimal adaptation.

Conclusion: Object detection frameworks can be successfully applied to beat tracking tasks, providing a simpler alternative to traditional methods like dynamic Bayesian networks.

Abstract: Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal “objects.” Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat’s temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.

[432] Big Data Approaches to Bovine Bioacoustics: A FAIR-Compliant Dataset and Scalable ML Framework for Precision Livestock Welfare

Mayuri Kate, Suresh Neethirajan

Main category: cs.SD

TL;DR: This paper presents a comprehensive bovine vocalization dataset and processing framework for precision livestock farming, addressing bioacoustic data challenges through curated clips, domain-informed augmentation, and standardized feature engineering.

Details

Motivation: Bioacoustic data streams in livestock farming remain underused due to computational complexity and ecological validity challenges, despite the convergence of IoT sensing, edge computing, and machine learning transforming precision agriculture.

Method: Created one of the most comprehensive bovine vocalization datasets with 569 curated clips covering 48 behavioral classes, expanded to 2900 samples through domain-informed augmentation. Implemented distributed processing framework integrating advanced denoising, multimodal synchronization, and standardized feature engineering with 24 acoustic descriptors from Praat, librosa, and openSMILE.

Result: Preliminary benchmarks reveal distinct class-level acoustic patterns for estrus detection, distress classification, and maternal communication. The dataset’s ecological realism ensures readiness for field deployment, addressing major Big Data challenges including volume (90 hours, 65.6 GB), variety, velocity, and veracity.

Conclusion: This work establishes a foundation for animal-centered AI, enabling continuous and non-invasive welfare assessment at industrial scale. The framework supports UN SDG 9, showing how data science can transform traditional farming into intelligent, welfare-optimized systems that meet global food needs while upholding ethical animal care.

Abstract: The convergence of IoT sensing, edge computing, and machine learning is transforming precision livestock farming. Yet bioacoustic data streams remain underused because of computational complexity and ecological validity challenges. We present one of the most comprehensive bovine vocalization datasets to date, with 569 curated clips covering 48 behavioral classes, recorded across three commercial dairy farms using multiple microphone arrays and expanded to 2900 samples through domain informed augmentation. This FAIR compliant resource addresses major Big Data challenges - volume (90 hours of recordings, 65.6 GB), variety (multi farm and multi zone acoustics), velocity (real time processing), and veracity (noise robust feature extraction). Our distributed processing framework integrates advanced denoising using iZotope RX, multimodal synchronization through audio and video alignment, and standardized feature engineering with 24 acoustic descriptors generated from Praat, librosa, and openSMILE. Preliminary benchmarks reveal distinct class level acoustic patterns for estrus detection, distress classification, and maternal communication. The datasets ecological realism, reflecting authentic barn acoustics rather than controlled settings, ensures readiness for field deployment. This work establishes a foundation for animal centered AI, where bioacoustic data enable continuous and non invasive welfare assessment at industrial scale. By releasing standardized pipelines and detailed metadata, we promote reproducible research that connects Big Data analytics, sustainable agriculture, and precision livestock management. The framework supports UN SDG 9, showing how data science can turn traditional farming into intelligent, welfare optimized systems that meet global food needs while upholding ethical animal care.

[433] AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation

Hui Wang, Jinghua Zhao, Cheng Liu, Yuhang Jia, Haoqin Sun, Jiaming Zhou, Yong Qin

Main category: cs.SD

TL;DR: AudioEval is a large-scale text-to-audio evaluation dataset with 4,200 samples and 126,000 ratings, enabling development of Qwen-DisQA model for automated quality assessment.

Details

Motivation: Current TTA evaluation methods are limited - human ratings are expensive and objective metrics only capture partial aspects of perceptual quality, creating a need for better evaluation tools.

Method: Created AudioEval dataset with 4,200 audio samples from 24 systems and 126,000 ratings across 5 perceptual dimensions from both experts and non-experts. Developed Qwen-DisQA, a multimodal model that processes text prompts and generated audio to predict human ratings.

Result: Qwen-DisQA effectively provides reliable and scalable evaluation of TTA systems, demonstrating strong performance in predicting human-like quality ratings.

Conclusion: AudioEval dataset and Qwen-DisQA model address the TTA evaluation gap by providing comprehensive human ratings and automated scoring, accelerating future research in text-to-audio generation.

Abstract: Text-to-audio (TTA) is rapidly advancing, with broad potential in virtual reality, accessibility, and creative media. However, evaluating TTA quality remains difficult: human ratings are costly and limited, while existing objective metrics capture only partial aspects of perceptual quality. To address this gap, we introduce AudioEval, the first large-scale TTA evaluation dataset, containing 4,200 audio samples from 24 systems with 126,000 ratings across five perceptual dimensions, annotated by both experts and non-experts. Based on this resource, we propose Qwen-DisQA, a multimodal scoring model that jointly processes text prompts and generated audio to predict human-like quality ratings. Experiments show its effectiveness in providing reliable and scalable evaluation. The dataset will be made publicly available to accelerate future research.

[434] SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

Main category: cs.SD

TL;DR: SpeechLLM-as-Judges enables LLMs to perform structured, explanation-based speech quality evaluation using a new dataset SpeechEval and trained model SQ-LLM.

Details

Motivation: Existing speech quality evaluation methods lack interpretability and generalization across tasks and languages, relying on scalar scores or binary decisions.

Method: Developed SpeechEval dataset with 32,207 multilingual speech clips and 128,754 annotations across four tasks, then trained SQ-LLM using chain-of-thought reasoning and reward optimization.

Result: SQ-LLM delivers strong performance across tasks and languages, demonstrating the potential of this paradigm for advancing speech quality evaluation.

Conclusion: The SpeechLLM-as-Judges paradigm shows promise for improving speech quality evaluation, with relevant resources to be open-sourced.

Abstract: Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.

[435] TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation

Ming-Hao Hsu, Liang-Hsuan Tseng, Hung-yi Lee, Zhizheng Wu

Main category: cs.SD

TL;DR: TASLA is a text-aligned speech tokenization framework that improves acoustic detail preservation under low frame rates using multi-layer aggregation and dynamic attention mechanisms.

Details

Motivation: Address the trade-off in text-aligned speech tokenization where single-source tokens lose acoustic details during reconstruction, particularly in low-frame-rate regimes, while previous methods like TASTE struggle to capture acoustic details.

Method: Uses Multi-Layer Dynamic Attention (MLDA) to let each text position adaptively mix shallow/deep features from a frozen speech encoder, combined with Finite Scalar Quantization (FSQ) for per-dimension discretization with smooth optimization.

Result: At about 2.62 Hz token rate, TASLA consistently improves prosody and achieves competitive quality over TASTE on both in-domain (LibriSpeech) and out-of-domain (EXPRESSO, Voxceleb) datasets.

Conclusion: Dynamic layer mixing in MLDA is correlated with spectral flux and explains why the method preserves prosody under extreme feature compression at low frame rates.

Abstract: We propose Text-Aligned Speech Tokens with Multiple Layer-Aggregation (TASLA), which is a text-aligned speech tokenization framework that aims to address the problem that under a low-frame-rate and text-aligned regime, single-source speech tokens may lose acoustic details during reconstruction. On the other hand, this paper further explains how different encoder layers collaborate to capture comprehensive acoustic features for tokenization. Previous work, TASTE, proposed the text-aligned speech tokenization framework, which is a LM-friendly architecture, but struggles to capture acoustic details. We address this trade-off with two components: Multi-Layer Dynamic Attention (MLDA), which lets each text position adaptively mix shallow/deep features from a frozen speech encoder, and Finite Scalar Quantization (FSQ), a simple per-dimension discretization with smooth optimization. At about 2.62 Hz (tokens/s), TASLA consistently improves prosody and achieves competitive quality over TASTE on in-domain (LibriSpeech) and OOD (EXPRESSO, Voxceleb) sets. We further demonstrate that dynamic layer mixing is correlated with spectral flux and explains why MLDA preserves prosody under a low frame rate with extreme feature compression.

cs.LG

[436] Large Language Models for Real-World IoT Device Identification

Rameen Mahmood, Tousif Ahmed, Sai Teja Peddinti, Danny Yuxing Huang

Main category: cs.LG

TL;DR: A semantic inference pipeline reframes IoT device identification as a language modeling task using network metadata, achieving high accuracy across 2,015 vendors while maintaining resilience to missing data and adversarial attacks.

Details

Motivation: The rapid expansion of IoT devices has outpaced current identification methods, creating security, privacy, and network accountability risks, especially in open-world environments with incomplete or obfuscated traffic metadata.

Method: Introduced a semantic inference pipeline using language modeling over network metadata. Generated high-fidelity vendor labels using LLM ensemble with mutual-information and entropy-based stability scores. Instruction-tuned a quantized LLaMA3.18B model with curriculum learning.

Result: Achieved 98.25% top-1 accuracy and 90.73% macro accuracy across 2,015 vendors. Maintained resilience to missing fields, protocol drift, and adversarial manipulation. Demonstrated scalability and interpretability on independent IoT testbed.

Conclusion: Instruction-tuned LLMs provide a scalable and interpretable foundation for real-world device identification at scale, addressing the limitations of current IoT identification methods.

Abstract: The rapid expansion of IoT devices has outpaced current identification methods, creating significant risks for security, privacy, and network accountability. These challenges are heightened in open-world environments, where traffic metadata is often incomplete, noisy, or intentionally obfuscated. We introduce a semantic inference pipeline that reframes device identification as a language modeling task over heterogeneous network metadata. To construct reliable supervision, we generate high-fidelity vendor labels for the IoT Inspector dataset, the largest real-world IoT traffic corpus, using an ensemble of large language models guided by mutual-information and entropy-based stability scores. We then instruction-tune a quantized LLaMA3.18B model with curriculum learning to support generalization under sparsity and long-tail vendor distributions. Our model achieves 98.25% top-1 accuracy and 90.73% macro accuracy across 2,015 vendors while maintaining resilience to missing fields, protocol drift, and adversarial manipulation. Evaluation on an independent IoT testbed, coupled with explanation quality and adversarial stress tests, demonstrates that instruction-tuned LLMs provide a scalable and interpretable foundation for real-world device identification at scale.

[437] Self-Training with Dynamic Weighting for Robust Gradual Domain Adaptation

Zixi Wang, Yushe Cao, Yubo Huang, Jinzhu Wei, Jingzehua Xu, Shuai Zhang, Xin Lai

Main category: cs.LG

TL;DR: STDW enhances gradual domain adaptation with dynamic weighting to balance source and target domain losses during training, outperforming baselines on multiple datasets.

Details

Motivation: Address inefficient knowledge migration and incomplete intermediate data in traditional gradual domain adaptation methods.

Method: Self-training with dynamic weighting mechanism using time-varying hyperparameter to control domain-specific learning strength and optimize weighted objective function.

Result: Outperforms existing baselines on rotated MNIST, color-shifted MNIST, portrait datasets, and Cover Type dataset; ablation studies validate dynamic scheduling effectiveness.

Conclusion: Provides theoretical insights and practical framework for robust gradual domain adaptation with applications in dynamic real-world scenarios.

Abstract: In this paper, we propose a new method called Self-Training with Dynamic Weighting (STDW), which aims to enhance robustness in Gradual Domain Adaptation (GDA) by addressing the challenge of smooth knowledge migration from the source to the target domain. Traditional GDA methods mitigate domain shift through intermediate domains and self-training but often suffer from inefficient knowledge migration or incomplete intermediate data. Our approach introduces a dynamic weighting mechanism that adaptively balances the loss contributions of the source and target domains during training. Specifically, we design an optimization framework governed by a time-varying hyperparameter $\varrho$ (progressing from 0 to 1), which controls the strength of domain-specific learning and ensures stable adaptation. The method leverages self-training to generate pseudo-labels and optimizes a weighted objective function for iterative model updates, maintaining robustness across intermediate domains. Experiments on rotated MNIST, color-shifted MNIST, portrait datasets, and the Cover Type dataset demonstrate that STDW outperforms existing baselines. Ablation studies further validate the critical role of $\varrho$’s dynamic scheduling in achieving progressive adaptation, confirming its effectiveness in reducing domain bias and improving generalization. This work provides both theoretical insights and a practical framework for robust gradual domain adaptation, with potential applications in dynamic real-world scenarios. The code is available at https://github.com/Dramwig/STDW.

[438] Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning

Dongkwan Lee, Junhoo Lee, Nojun Kwak

Main category: cs.LG

TL;DR: Deep Edge Filter applies high-pass filtering to neural network features to improve generalization by isolating task-relevant high-frequency components while removing domain-specific low-frequency biases.

Details

Motivation: The hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features.

Method: Subtracting low-pass filtered outputs from original features to isolate generalizable representations while preserving architectural integrity.

Result: Consistent performance improvements across diverse domains (Vision, Text, 3D, Audio) regardless of model architecture and data modality. Analysis shows feature sparsification and effective isolation of high-frequency components.

Conclusion: The method effectively improves model generalizability by filtering out domain-specific biases while preserving task-relevant information, with empirical validation of the core hypothesis.

Abstract: We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.

[439] Revisit Modality Imbalance at the Decision Layer

Xiaoyu Ma, Hao Chen

Main category: cs.LG

TL;DR: This paper reveals that modality imbalance in multimodal learning occurs not only during representation learning but also significantly at the decision layer, where uncalibrated modality outputs lead to biased weighting that hinders weaker modalities from contributing effectively.

Details

Motivation: Multimodal learning suffers from modality imbalance where dominant modalities overshadow weaker ones during joint optimization, limiting the full potential of multimodal integration.

Method: Experiments on audio-visual datasets (CREMAD and Kinetic-Sounds) to analyze modality bias, examining feature-space and decision-weight distributions to understand the root causes of imbalance.

Result: Even after extensive pretraining and balanced optimization, models exhibit systematic bias toward certain modalities (like audio), with bias originating from intrinsic disparities in feature-space and decision-weight distributions rather than optimization dynamics alone.

Conclusion: Future multimodal systems should incorporate adaptive weight allocation mechanisms at the decision layer to enable balanced contributions according to each modality’s capabilities.

Abstract: Multimodal learning integrates information from different modalities to enhance model performance, yet it often suffers from modality imbalance, where dominant modalities overshadow weaker ones during joint optimization. This paper reveals that such an imbalance not only occurs during representation learning but also manifests significantly at the decision layer. Experiments on audio-visual datasets (CREMAD and Kinetic-Sounds) show that even after extensive pretraining and balanced optimization, models still exhibit systematic bias toward certain modalities, such as audio. Further analysis demonstrates that this bias originates from intrinsic disparities in feature-space and decision-weight distributions rather than from optimization dynamics alone. We argue that aggregating uncalibrated modality outputs at the fusion stage leads to biased decision-layer weighting, hindering weaker modalities from contributing effectively. To address this, we propose that future multimodal systems should focus more on incorporate adaptive weight allocation mechanisms at the decision layer, enabling relative balanced according to the capabilities of each modality.

[440] CoLoR-GAN: Continual Few-Shot Learning with Low-Rank Adaptation in Generative Adversarial Networks

Munsif Ali, Leonardo Rossi, Massimo Bertozzi

Main category: cs.LG

TL;DR: CoLoR-GAN introduces a framework for continual few-shot learning in GANs using low-rank adaptation to reduce parameters while maintaining performance.

Details

Motivation: Current CL methods for GANs require adding significant new weights at each iteration, which becomes problematic for long-term learning. CoLoR-GAN addresses this by combining FS and CL with efficient parameter usage.

Method: Uses low-rank tensors for model adaptation, applies vanilla LoRA, introduces LLoRA for convolutional layers, and provides empirical hyperparameter optimization.

Result: Achieves state-of-the-art performance with significantly reduced computational resources compared to existing methods.

Conclusion: CoLoR-GAN effectively handles continual few-shot learning in GANs with minimal parameter growth while maintaining competitive performance.

Abstract: Continual learning (CL) in the context of Generative Adversarial Networks (GANs) remains a challenging problem, particularly when it comes to learn from a few-shot (FS) samples without catastrophic forgetting. Current most effective state-of-the-art (SOTA) methods, like LFS-GAN, introduce a non-negligible quantity of new weights at each training iteration, which would become significant when considering the long term. For this reason, this paper introduces \textcolor{red}{\textbf{\underline{c}}}ontinual few-sh\textcolor{red}{\textbf{\underline{o}}}t learning with \textcolor{red}{\textbf{\underline{lo}}}w-\textcolor{red}{\textbf{\underline{r}}}ank adaptation in GANs named CoLoR-GAN, a framework designed to handle both FS and CL together, leveraging low-rank tensors to efficiently adapt the model to target tasks while reducing even more the number of parameters required. Applying a vanilla LoRA implementation already permitted us to obtain pretty good results. In order to optimize even further the size of the adapters, we challenged LoRA limits introducing a LoRA in LoRA (LLoRA) technique for convolutional layers. Finally, aware of the criticality linked to the choice of the hyperparameters of LoRA, we provide an empirical study to easily find the best ones. We demonstrate the effectiveness of CoLoR-GAN through experiments on several benchmark CL and FS tasks and show that our model is efficient, reaching SOTA performance but with a number of resources enormously reduced. Source code is available on \href{https://github.com/munsifali11/CoLoR-GAN}{Github.

[441] Joint Discriminative-Generative Modeling via Dual Adversarial Training

Xuwang Yin, Claire Zhang, Julie Steele, Nir Shavit, Tony T. Wang

Main category: cs.LG

TL;DR: A novel training framework that integrates adversarial training principles to achieve both robust classification and high-fidelity generative modeling in a single unified framework, addressing limitations of Joint Energy-Based Models.

Details

Motivation: To overcome the instability and poor sample quality of SGLD-based training in hybrid approaches like JEM, while simultaneously achieving robust classification and high-quality generative modeling within a single framework.

Method: Three key innovations: (1) replacing SGLD-based JEM learning with stable AT-based approach using BCE loss to discriminate real data from PGD-generated contrastive samples; (2) synergistic adversarial training for discriminative component; (3) two-stage training to resolve batch normalization and EBM incompatibility.

Result: Substantially improves adversarial robustness over existing hybrid models while maintaining competitive generative performance. On ImageNet, generative fidelity surpasses BigGAN and approaches diffusion models, representing the first MCMC-based EBM to achieve high-quality generation on complex, high-resolution datasets.

Conclusion: Adversarial training can serve as an effective foundation for unified frameworks capable of generating and robustly classifying visual data, addressing key stability issues that have limited JEM scaling.

Abstract: Simultaneously achieving robust classification and high-fidelity generative modeling within a single framework presents a significant challenge. Hybrid approaches, such as Joint Energy-Based Models (JEM), interpret classifiers as EBMs but are often limited by the instability and poor sample quality inherent in SGLD-based training. We address these limitations by proposing a novel training framework that integrates adversarial training (AT) principles for both discriminative robustness and stable generative learning. The proposed method introduces three key innovations: (1) the replacement of SGLD-based JEM learning with a stable, AT-based approach that optimizes the energy function by discriminating between real data and PGD-generated contrastive samples using the BCE loss; (2) synergistic adversarial training for the discriminative component that enhances classification robustness while eliminating the need for explicit gradient penalties; and (3) a two-stage training procedure to resolve the incompatibility between batch normalization and EBM training. Experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that our method substantially improves adversarial robustness over existing hybrid models while maintaining competitive generative performance. On ImageNet, when optimized for generative modeling, our model’s generative fidelity surpasses that of BigGAN and approaches diffusion models, representing the first MCMC-based EBM approach to achieve high-quality generation on complex, high-resolution datasets. Our approach addresses key stability issues that have limited JEM scaling and demonstrates that adversarial training can serve as an effective foundation for unified frameworks capable of generating and robustly classifying visual data.

[442] K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding

Yifeng Yao, Yike Yun, Jing Wang, Huishuai Zhang, Dongyan Zhao, Ke Tian, Zhihao Wang, Minghui Qiu, Tao Wang

Main category: cs.LG

TL;DR: K-frames introduces a scene-driven keyframe selection method for long-video understanding that predicts semantically coherent clips instead of individual frames, enabling flexible any-k keyframe selection while preserving temporal continuity.

Details

Motivation: Current MLLMs struggle with long videos due to context window limitations and computational costs. Existing keyframe selection methods produce sparse, disjointed frames that lose scene continuity and lack flexibility for multi-scale selection.

Method: Three-stage progressive curriculum: 1) Supervised Fine-Tuning for temporal grounding, 2) Supervised Fine-Tuning for key-clip perception, 3) Reinforcement Learning to optimize scene-driven prediction policy without additional annotations. Built on PeakClips dataset of 200K query-conditioned video highlights.

Result: Extensive experiments on major long-video understanding benchmarks demonstrate K-frames provides effective, interpretable, and plug-and-play solution for keyframe selection at various scales.

Conclusion: K-frames offers a novel paradigm for scene-driven keyframe selection that preserves temporal continuity and enables flexible any-k frame selection, addressing limitations of existing methods while being computationally efficient.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in image understanding, but long-video are constrained by context windows and computational cost. Uniform frame sampling often leads to substantial information loss. Meanwhile existing keyframe selection methods such as text-frame retrieval or RL-based frame optimization typically yield sparse and temporally disjointed frames, overlooking scene continuity and lacking flexibility for multi-scale frame selection. To address these limitations, we introduce K-frames, a novel paradigm for scene-driven keyframe selection that preserves temporal continuity. Instead of selecting individual frames, K-frames predicts semantically coherent, query-relevant clips, which enables any-k keyframes selection to meet diverse user budgets. To achieve this approach, we first introduce PeakClips, a dataset of 200K video highlights conditioned by query. Building on this dataset, K-frames learns clip2frame selection using a three-stage progressive curriculum. It involves two Supervised Fine-Tuning stages for temporal grounding and key-clip perception, followed by a Reinforcement Learning stage that directly optimizes the scene-driven prediction policy for downstream task without further annotations. Extensive experiments on major long-video understanding benchmarks demonstrate that K-frames provides an effective, interpretable, and plug-and-play solution for keyframe selection at various scales. Our dataset and model will be available.

[443] Multi-View Semi-Supervised Label Distribution Learning with Local Structure Complementarity

Yanshan Xiao, Kaihong Wu, Bo Liu

Main category: cs.LG

TL;DR: Proposes MVSS-LDL, the first multi-view semi-supervised label distribution learning method that leverages local structure complementarity across views to improve classification performance.

Details

Motivation: Existing LDL approaches only handle single-view problems with labeled data, leaving multi-view LDL with both labeled and unlabeled data unexplored.

Method: Uses k-nearest neighbors to capture local structure in each view, then complements each view’s neighbor set with neighbors from other views, and constructs a graph learning-based multi-view semi-supervised LDL model.

Result: Numerical studies show MVSS-LDL achieves significantly better classification performance than existing single-view LDL methods.

Conclusion: MVSS-LDL successfully addresses multi-view semi-supervised LDL by exploiting local structure complementarity, representing the first attempt at multi-view label distribution learning.

Abstract: Label distribution learning (LDL) is a paradigm that each sample is associated with a label distribution. At present, the existing approaches are proposed for the single-view LDL problem with labeled data, while the multi-view LDL problem with labeled and unlabeled data has not been considered. In this paper, we put forward the multi-view semi-supervised label distribution learning with local structure complementarity (MVSS-LDL) approach, which exploits the local nearest neighbor structure of each view and emphasizes the complementarity of local nearest neighbor structures in multiple views. Specifically speaking, we first explore the local structure of view $v$ by computing the $k$-nearest neighbors. As a result, the $k$-nearest neighbor set of each sample $\boldsymbol{x}_i$ in view $v$ is attained. Nevertheless, this $k$-nearest neighbor set describes only a part of the nearest neighbor information of sample $\boldsymbol{x}_i$. In order to obtain a more comprehensive description of sample $\boldsymbol{x}_i$’s nearest neighbors, we complement the nearest neighbor set in view $v$ by incorporating sample $\boldsymbol{x}_i$’s nearest neighbors in other views. Lastly, based on the complemented nearest neighbor set in each view, a graph learning-based multi-view semi-supervised LDL model is constructed. By considering the complementarity of local nearest neighbor structures, different views can mutually provide the local structural information to complement each other. To the best of our knowledge, this is the first attempt at multi-view LDL. Numerical studies have demonstrated that MVSS-LDL attains explicitly better classification performance than the existing single-view LDL methods.

[444] Weight Weaving: Parameter Pooling for Data-Free Model Merging

Levy Chaves, Eduardo Valle, Sandra Avila

Main category: cs.LG

TL;DR: Weight Weaving is a data-free model merging technique that pools weights across different scaling factor values using pooling functions, eliminating the need for evaluation data and improving performance of existing merging methods.

Details

Motivation: Current model merging methods heavily depend on scaling hyperparameters that require data for tuning, which is impractical in real-world scenarios where evaluation data is unavailable.

Method: A plug-and-play technique that pools model weights across the scaling factor search space using user-defined pooling functions like averaging, random selection, or existing merging methods.

Result: Achieves average accuracy gains of up to 15.9 percentage points in data-free settings across three ViT variants in multi-task learning, continual learning, and domain generalization setups.

Conclusion: Weight Weaving provides a practical, modular solution for data-free model merging that consistently improves existing methods without requiring evaluation data.

Abstract: Model merging provides a cost-effective and data-efficient combination of specialized deep neural networks through parameter integration. This technique leverages expert models across downstream tasks without requiring retraining. Most model merging approaches critically depend on scaling hyper-parameters $\lambda$, which weight each model’s contribution globally or individually. Principled approaches for setting scaling factors without accessing any data (data-free) are scarce, often leading researchers to tune $\lambda$ using privileged data from the evaluation set, which is obviously unfeasible in practice. To address this limitation, we introduce Weight Weaving, a plug-and-play technique that pools model weights across $\lambda$ values search space using user-defined pooling functions, such as averaging, random selection, or even existing model merging methods. Our method demonstrates high modularity, imposing minimal constraints on the search space. It operates orthogonally to existing model merging methods and eliminates evaluation data requirements. We validate Weight Weaving across three ViT variants in three experimental setups: vision multi-task learning, vision continual learning, and domain generalization. Our method consistently improves the performance of several model merging methods, achieving average accuracy gains of up to 15.9 percentage points in a data-free setting.

[445] LTR-ICD: A Learning-to-Rank Approach for Automatic ICD Coding

Mohammad Mansoori, Amira Soliman, Farzaneh Etminani

Main category: cs.LG

TL;DR: This paper proposes a novel approach to ICD code assignment that treats it as both classification and ranking task, focusing on code order importance, achieving superior performance in identifying high-priority codes compared to traditional classification methods.

Details

Motivation: Automating ICD code assignment from clinical notes is challenging, and existing methods ignore the order of codes which is essential for medical diagnosis and reimbursement purposes.

Method: The paper approaches ICD code assignment as a retrieval system problem, formulating it as both classification and ranking task to consider the order of diagnostic codes.

Result: The proposed framework achieves 47% accuracy in ranking primary diagnosis codes (vs 20% for state-of-the-art classifier) and micro-/macro-F1 scores of 0.6065/0.2904, surpassing previous best model (0.597/0.2660).

Conclusion: The retrieval-based approach that considers code order significantly improves ICD code assignment performance, particularly in identifying high-priority codes, demonstrating the importance of treating this task as both classification and ranking.

Abstract: Clinical notes contain unstructured text provided by clinicians during patient encounters. These notes are usually accompanied by a sequence of diagnostic codes following the International Classification of Diseases (ICD). Correctly assigning and ordering ICD codes are essential for medical diagnosis and reimbursement. However, automating this task remains challenging. State-of-the-art methods treated this problem as a classification task, leading to ignoring the order of ICD codes that is essential for different purposes. In this work, as a first attempt, we approach this task from a retrieval system perspective to consider the order of codes, thus formulating this problem as a classification and ranking task. Our results and analysis show that the proposed framework has a superior ability to identify high-priority codes compared to other methods. For instance, our model accuracy in correctly ranking primary diagnosis codes is 47%, compared to 20% for the state-of-the-art classifier. Additionally, in terms of classification metrics, the proposed model achieves a micro- and macro-F1 scores of 0.6065 and 0.2904, respectively, surpassing the previous best model with scores of 0.597 and 0.2660.

[446] Distributional Consistency Loss: Beyond Pointwise Data Terms in Inverse Problems

George Webber, Andrew J. Reader

Main category: cs.LG

TL;DR: The paper introduces Distributional Consistency (DC) loss as a plug-in replacement for conventional data-fidelity losses in inverse problems, which evaluates data-fidelity collectively through distribution-level calibration rather than pointwise matching to avoid overfitting to noise.

Details

Motivation: Current data-fidelity loss functions like MSE seek pointwise agreement with noisy measurements, leading to overfitting to noise. The authors aim to develop a statistically grounded alternative that avoids this issue by testing whether measurements are statistically consistent with the noise distributions implied by current estimates.

Method: The authors introduce Distributional Consistency (DC) loss, which replaces pointwise matching with distribution-level calibration using model-based probability scores for each measurement. It’s designed as a direct plug-in replacement compatible with modern regularizers and optimized similarly to traditional losses.

Result: In image denoising with deep image prior, DC loss eliminates the need for early stopping and achieves higher PSNR compared to MSE loss. In medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances hand-crafted regularization efficacy.

Conclusion: DC loss serves as a statistically grounded, performance-enhancing alternative to conventional fidelity losses for inverse problems, particularly effective when measurement-noise distribution is known and datasets contain many independent noisy values.

Abstract: Recovering true signals from noisy measurements is a central challenge in inverse problems spanning medical imaging, geophysics, and signal processing. Current solutions balance prior assumptions regarding the true signal (regularization) with agreement to noisy measured data (data-fidelity). Conventional data-fidelity loss functions, such as mean-squared error (MSE) or negative log-likelihood, seek pointwise agreement with noisy measurements, often leading to overfitting to noise. In this work, we instead evaluate data-fidelity collectively by testing whether the observed measurements are statistically consistent with the noise distributions implied by the current estimate. We adopt this aggregated perspective and introduce distributional consistency (DC) loss, a data-fidelity objective that replaces pointwise matching with distribution-level calibration using model-based probability scores for each measurement. DC loss acts as a direct and practical plug-in replacement for standard data consistency terms: i) it is compatible with modern regularizers, ii) it is optimized in the same way as traditional losses, and iii) it avoids overfitting to measurement noise even without the use of priors. Its scope naturally fits many practical inverse problems where the measurement-noise distribution is known and where the measured dataset consists of many independent noisy values. We demonstrate efficacy in two key example application areas: i) in image denoising with deep image prior, using DC instead of MSE loss removes the need for early stopping and achieves higher PSNR; ii) in medical image reconstruction from Poisson-noisy data, DC loss reduces artifacts in highly-iterated reconstructions and enhances the efficacy of hand-crafted regularization. These results position DC loss as a statistically grounded, performance-enhancing alternative to conventional fidelity losses for inverse problems.

[447] BitNet Distillation

Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei

Main category: cs.LG

TL;DR: BitNet Distillation (BitDistill) is a lightweight pipeline that converts full-precision LLMs into 1.58-bit ternary weight models for specific tasks, achieving comparable performance with 10x memory savings and 2.65x faster CPU inference.

Details

Motivation: To enable efficient deployment of large language models on resource-constrained devices by reducing model precision while maintaining task-specific performance.

Method: Combines three techniques: SubLN module from BitNet, multi-head attention distillation from MiniLM, and continual pre-training as a warm-up step to bridge performance gaps.

Result: Achieves performance comparable to full-precision models across different model sizes while enabling 10x memory savings and 2.65x faster inference on CPUs.

Conclusion: BitDistill provides an effective approach for creating highly efficient 1.58-bit LLMs that maintain strong task performance with significant computational benefits.

Abstract: In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.

[448] REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa

Main category: cs.LG

TL;DR: Expert pruning outperforms expert merging for SMoE model compression in generative tasks, with proposed REAP method achieving near-lossless compression at 50% pruning.

Details

Motivation: SMoE models have large parameter counts creating significant memory overhead, motivating research into expert compression methods.

Method: Proposed Router-weighted Expert Activation Pruning (REAP), a pruning criterion that considers both router gate-values and expert activation norms.

Result: REAP consistently outperforms merging and other pruning methods across SMoE models from 20B to 1T parameters, achieving near-lossless compression on code generation and tool-calling tasks with 50% expert pruning.

Conclusion: Expert pruning is superior to merging for generative tasks due to avoiding functional subspace collapse, and REAP provides an effective compression strategy.

Abstract: Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we demonstrate that expert pruning is a superior strategy for generative tasks. We prove that merging introduces an irreducible error by causing a “functional subspace collapse”, due to the loss of the router’s independent, input-dependent control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation and tool-calling tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

[449] Conditional Clifford-Steerable CNNs with Complete Kernel Basis for PDE Modeling

Bálint László Szarvas, Maksim Zhdanov

Main category: cs.LG

TL;DR: The paper addresses incompleteness in Clifford-Steerable CNN kernel bases by proposing Conditional Clifford-Steerable Kernels that use input-dependent representations to enhance model expressivity, showing improved performance on PDE forecasting tasks.

Details

Motivation: The kernel basis of Clifford-Steerable CNNs (CSCNNs) is not complete, which limits model expressivity despite their ability to incorporate equivariance to arbitrary pseudo-Euclidean groups.

Method: Proposed Conditional Clifford-Steerable Kernels that augment kernels with equivariant representations computed from input feature fields, with efficient solution via implicit parameterization of equivariance constraints.

Result: Empirical demonstration of improved expressivity on multiple PDE forecasting tasks including fluid dynamics and relativistic electrodynamics, consistently outperforming baseline methods.

Conclusion: Conditional Clifford-Steerable Kernels successfully address the expressivity limitations of CSCNNs by incorporating input-dependent equivariant representations, leading to superior performance on complex physical modeling tasks.

Abstract: Clifford-Steerable CNNs (CSCNNs) provide a unified framework that allows incorporating equivariance to arbitrary pseudo-Euclidean groups, including isometries of Euclidean space and Minkowski spacetime. In this work, we demonstrate that the kernel basis of CSCNNs is not complete, thus limiting the model expressivity. To address this issue, we propose Conditional Clifford-Steerable Kernels, which augment the kernels with equivariant representations computed from the input feature field. We derive the equivariance constraint for these input-dependent kernels and show how it can be solved efficiently via implicit parameterization. We empirically demonstrate an improved expressivity of the resulting framework on multiple PDE forecasting tasks, including fluid dynamics and relativistic electrodynamics, where our method consistently outperforms baseline methods.

[450] Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training

Jie Hao, Xiaochuan Gong, Jie Xu, Zhengdao Wang, Mingrui Liu

Main category: cs.LG

TL;DR: The paper introduces a noise-adaptive layerwise learning rate scheme for geometry-aware optimization algorithms that accelerates DNN training by adapting learning rates dynamically within layer groups based on gradient variance estimation.

Details

Motivation: Standard geometry-aware optimizers use fixed learning rates within layer groups, but local curvature varies across layers and changes dynamically during training, making fixed rates inefficient.

Method: Proposes estimating gradient variance in the dual norm induced by the linear minimization oracle (LMO) and using it to assign time-varying, noise-adaptive layerwise learning rates within each group.

Result: Empirical results on transformer architectures (LLaMA and GPT) show faster convergence than state-of-the-art optimizers, with theoretical analysis confirming sharp convergence rates.

Conclusion: The noise-adaptive layerwise learning rate scheme substantially accelerates DNN training compared to methods using fixed learning rates within layer groups.

Abstract: Geometry-aware optimization algorithms, such as Muon, have achieved remarkable success in training deep neural networks (DNNs). These methods leverage the underlying geometry of DNNs by selecting appropriate norms for different layers and updating parameters via norm-constrained linear minimization oracles (LMOs). However, even within a group of layers associated with the same norm, the local curvature can be heterogeneous across layers and vary dynamically over the course of training. For example, recent work shows that sharpness varies substantially across transformer layers and throughout training, yet standard geometry-aware optimizers impose fixed learning rates to layers within the same group, which may be inefficient for DNN training. In this paper, we introduce a noise-adaptive layerwise learning rate scheme on top of geometry-aware optimization algorithms and substantially accelerate DNN training compared to methods that use fixed learning rates within each group. Our method estimates gradient variance in the dual norm induced by the chosen LMO on the fly, and uses it to assign time-varying noise-adaptive layerwise learning rates within each group. We provide a theoretical analysis showing that our algorithm achieves a sharp convergence rate. Empirical results on transformer architectures such as LLaMA and GPT demonstrate that our approach achieves faster convergence than state-of-the-art optimizers.

[451] Context-Selective State Space Models: Feedback is All You Need

Riccardo Zattra, Giacomo Baggio, Umberto Casti, Augusto Ferrante, Francesco Ticozzi

Main category: cs.LG

TL;DR: COFFEE introduces a novel time-varying state space model with state feedback for context-dependent selectivity, achieving superior performance with fewer parameters compared to S6/Mamba.

Details

Motivation: Transformers have quadratic complexity and struggle with long-range dependencies. State space models like S6/Mamba offer efficiency but lack context-dependent selectivity that depends on sequence history.

Method: COFFEE incorporates state feedback to compute selectivity from internal state (representing sequence history) rather than just current input. Uses efficient parameterization to remove redundancies in S6, enabling parallel implementation.

Result: Achieves near-perfect accuracy on induction head task with 100x fewer parameters and training sequences than S6. On MNIST, reaches 97% accuracy with only 3585 parameters, significantly outperforming S6.

Conclusion: State feedback is a key mechanism for building scalable and efficient sequence models, enabling better long-range dependency capture with compact formulations.

Abstract: Transformers, powered by the attention mechanism, are the backbone of most foundation models, yet they suffer from quadratic complexity and difficulties in dealing with long-range dependencies in the input sequence. Recent work has shown that state space models (SSMs) provide an efficient alternative, with the S6 module at the core of the Mamba architecture achieving state-of-the-art results on long-sequence benchmarks. In this paper, we introduce the COFFEE (COntext From FEEdback) model, a novel time-varying SSM that incorporates state feedback to enable context-dependent selectivity, while still allowing for parallel implementation. Whereas the selectivity mechanism of S6 only depends on the current input, COFFEE computes it from the internal state, which serves as a compact representation of the sequence history. This shift allows the model to regulate its dynamics based on accumulated context, improving its ability to capture long-range dependencies. In addition to state feedback, we employ an efficient model parametrization that removes redundancies present in S6 and leads to a more compact and trainable formulation. On the induction head task, COFFEE achieves near-perfect accuracy with two orders of magnitude fewer parameters and training sequences compared to S6. On MNIST, COFFEE largely outperforms S6 within the same architecture, reaching 97% accuracy with only 3585 parameters. These results showcase the role of state feedback as a key mechanism for building scalable and efficient sequence models.

[452] CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations

Guangyi Chen, Yunlong Deng, Peiyuan Zhu, Yan Li, Yifan Sheng, Zijian Li, Kun Zhang

Main category: cs.LG

TL;DR: A new benchmark for Causal Representation Learning using high-fidelity simulated visual data that provides both realistic visual complexity and ground-truth causal generating processes across multiple domains.

Details

Motivation: Current CRL evaluations face a dilemma between realism and evaluative precision, relying on either simplistic synthetic datasets or downstream performance on real-world tasks without ground-truth causal information.

Method: Created a comprehensive dataset with ~200k images and 3M video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. Provides flexible access to underlying causal structures.

Result: The benchmark bridges the gap between rigorous evaluation and real-world applicability, offering scenarios ranging from static to dynamic settings, simple to complex structures, and single to multi-agent interactions.

Conclusion: The benchmark enables evaluation of representative CRL methods across diverse paradigms and provides empirical insights to help practitioners choose appropriate CRL frameworks for specific real-world problems.

Abstract: Causal Representation Learning (CRL) aims to uncover the data-generating process and identify the underlying causal variables and relations, whose evaluation remains inherently challenging due to the requirement of known ground-truth causal variables and causal structure. Existing evaluations often rely on either simplistic synthetic datasets or downstream performance on real-world tasks, generally suffering a dilemma between realism and evaluative precision. In this paper, we introduce a new benchmark for CRL using high-fidelity simulated visual data that retains both realistic visual complexity and, more importantly, access to ground-truth causal generating processes. The dataset comprises around 200 thousand images and 3 million video frames across 24 sub-scenes in four domains: static image generation, dynamic physical simulations, robotic manipulations, and traffic situation analysis. These scenarios range from static to dynamic settings, simple to complex structures, and single to multi-agent interactions, offering a comprehensive testbed that hopefully bridges the gap between rigorous evaluation and real-world applicability. In addition, we provide flexible access to the underlying causal structures, allowing users to modify or configure them to align with the required assumptions in CRL, such as available domain labels, temporal dependencies, or intervention histories. Leveraging this benchmark, we evaluated representative CRL methods across diverse paradigms and offered empirical insights to assist practitioners and newcomers in choosing or extending appropriate CRL frameworks to properly address specific types of real problems that can benefit from the CRL perspective. Welcome to visit our: Project page:https://causal-verse.github.io/, Dataset:https://huggingface.co/CausalVerse.

[453] FedHFT: Efficient Federated Finetuning with Heterogeneous Edge Clients

Fatih Ilhan, Selim Furkan Tekin, Tiansheng Huang, Gaowen Liu, Ramana Kompella, Greg Eisenhauer, Yingyan Celine Lin, Calton Pu, Ling Liu

Main category: cs.LG

TL;DR: FedHFT is a federated fine-tuning framework for LLMs that addresses data heterogeneity and resource constraints through masked adapters and bi-level optimization, keeping data local while enabling efficient collaborative training.

Details

Motivation: Address challenges in fine-tuning LLMs: (i) limited/heterogeneous data due to privacy concerns, and (ii) varying computation resources across edge devices in distributed settings.

Method: Uses mixture of masked adapters for resource heterogeneity and bi-level optimization with masked personalization and client clustering for non-iid data distribution.

Result: Significant performance and efficiency improvements across various NLU tasks under data and resource heterogeneity compared to existing federated learning methods.

Conclusion: FedHFT effectively enables collaborative fine-tuning of LLMs while maintaining data privacy and handling resource constraints through its adaptive framework.

Abstract: Fine-tuning pre-trained large language models (LLMs) has become a common practice for personalized natural language understanding (NLU) applications on downstream tasks and domain-specific datasets. However, there are two main challenges: (i) limited and/or heterogeneous data for fine-tuning due to proprietary data confidentiality or privacy requirements, and (ii) varying computation resources available across participating clients such as edge devices. This paper presents FedHFT - an efficient and personalized federated fine-tuning framework to address both challenges. First, we introduce a mixture of masked adapters to handle resource heterogeneity across participating clients, enabling high-performance collaborative fine-tuning of pre-trained language model(s) across multiple clients in a distributed setting, while keeping proprietary data local. Second, we introduce a bi-level optimization approach to handle non-iid data distribution based on masked personalization and client clustering. Extensive experiments demonstrate significant performance and efficiency improvements over various natural language understanding tasks under data and resource heterogeneity compared to representative heterogeneous federated learning methods.

[454] On the expressivity of sparse maxout networks

Moritz Grillo, Tobias Hofmann

Main category: cs.LG

TL;DR: The paper analyzes the expressivity of sparse maxout networks, establishing a duality with virtual polytopes and proving depth hierarchies where width cannot compensate for insufficient depth under sparsity constraints.

Details

Motivation: To understand the computational capabilities of sparse maxout networks, which model key aspects of convolutional and graph neural networks, particularly how depth and width affect expressivity under sparsity constraints.

Method: Establishes a duality between sparse maxout networks and virtual polytopes, derives tight bounds on polytope dimensions, and constructs depth hierarchies to analyze expressivity.

Result: Shows that sufficiently deep sparse maxout networks are universal, but if depth requirements are not met, width alone cannot overcome the limitations imposed by fixed indegree sparsity constraints.

Conclusion: Depth plays a crucial role in the expressivity of sparse maxout networks, and width cannot compensate for insufficient depth when networks are constrained by sparsity, highlighting fundamental limitations in network architecture design.

Abstract: We study the expressivity of sparse maxout networks, where each neuron takes a fixed number of inputs from the previous layer and employs a, possibly multi-argument, maxout activation. This setting captures key characteristics of convolutional or graph neural networks. We establish a duality between functions computable by such networks and a class of virtual polytopes, linking their geometry to questions of network expressivity. In particular, we derive a tight bound on the dimension of the associated polytopes, which serves as the central tool for our analysis. Building on this, we construct a sequence of depth hierarchies. While sufficiently deep sparse maxout networks are universal, we prove that if the required depth is not reached, width alone cannot compensate for the sparsity of a fixed indegree constraint.

[455] Exploratory Causal Inference in SAEnce

Tommaso Mencattini, Riccardo Cadei, Francesco Locatello

Main category: cs.LG

TL;DR: Neural Effect Search is a novel unsupervised method that discovers unknown causal effects from randomized controlled trial data using foundation models and sparse autoencoders, addressing multiple-testing and effect entanglement issues through progressive stratification.

Details

Motivation: Traditional RCTs rely on hand-crafted hypotheses and expensive analysis, preventing causal effect estimation at scale and potentially missing important effects. The goal is to discover unknown treatment effects directly from data without predefined hypotheses.

Method: Transform unstructured trial data into meaningful representations using pretrained foundation models, then interpret them via sparse autoencoder. Introduce Neural Effect Search - a recursive procedure using progressive stratification to address multiple-testing issues and effect entanglement.

Result: The algorithm demonstrated robustness in semi-synthetic experiments and successfully achieved the first unsupervised causal effect identification on a real-world scientific trial in experimental ecology.

Conclusion: Neural Effect Search enables unsupervised discovery of causal effects from RCT data, overcoming limitations of traditional hypothesis-driven approaches and allowing for scalable causal effect estimation.

Abstract: Randomized Controlled Trials are one of the pillars of science; nevertheless, they rely on hand-crafted hypotheses and expensive analysis. Such constraints prevent causal effect estimation at scale, potentially anchoring on popular yet incomplete hypotheses. We propose to discover the unknown effects of a treatment directly from data. For this, we turn unstructured data from a trial into meaningful representations via pretrained foundation models and interpret them via a sparse autoencoder. However, discovering significant causal effects at the neural level is not trivial due to multiple-testing issues and effects entanglement. To address these challenges, we introduce Neural Effect Search, a novel recursive procedure solving both issues by progressive stratification. After assessing the robustness of our algorithm on semi-synthetic experiments, we showcase, in the context of experimental ecology, the first successful unsupervised causal effect identification on a real-world scientific trial.

[456] Neural Network approximation power on homogeneous and heterogeneous reaction-diffusion equations

Haotian Feng

Main category: cs.LG

TL;DR: Theoretical analysis of neural networks’ approximation power for reaction-diffusion equations, showing that 2-layer networks can approximate 1D equations and 3-layer networks can approximate 2D equations.

Details

Motivation: While neural networks are increasingly used to solve differential equations, the theoretical foundation explaining why they can effectively approximate such solutions remains insufficiently explored.

Method: Building upon the universal approximation theorem, the paper provides theoretical analysis of neural networks for reaction-diffusion equations in homogeneous and heterogeneous media across one and two dimensions.

Result: Demonstrated that a two-layer neural network can approximate the one-dimensional reaction-diffusion equation, while a three-layer neural network can approximate its two-dimensional counterpart.

Conclusion: This work provides theoretical foundation for neural network-based differential equation solvers and highlights the expressive power of neural networks in approximating solutions to reaction-diffusion equations and related PDEs.

Abstract: Reaction-diffusion systems represent one of the most fundamental formulations used to describe a wide range of physical, chemical, and biological processes. With the increasing adoption of neural networks, recent research has focused on solving differential equations using machine learning techniques. However, the theoretical foundation explaining why neural networks can effectively approximate such solutions remains insufficiently explored. This paper provides a theoretical analysis of the approximation power of neural networks for one- and two-dimensional reaction-diffusion equations in both homogeneous and heterogeneous media. Building upon the universal approximation theorem, we demonstrate that a two-layer neural network can approximate the one-dimensional reaction-diffusion equation, while a three-layer neural network can approximate its two-dimensional counterpart. The theoretical framework presented here can be further extended to elliptic and parabolic equations. Overall, this work highlights the expressive power of neural networks in approximating solutions to reaction-diffusion equations and related PDEs, providing a theoretical foundation for neural network-based differential equation solvers.

[457] Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning

Awni Altabaa, Siyu Chen, John Lafferty, Zhuoran Yang

Main category: cs.LG

TL;DR: The paper investigates out-of-distribution generalization in Transformers using modular arithmetic tasks, proposing four architectural mechanisms to enhance compositional reasoning.

Details

Motivation: Addressing the core challenge of systematic compositional generalization beyond training distribution, which is a critical bottleneck for language models' reasoning abilities.

Method: Introduces four architectural mechanisms: input-adaptive recurrence, algorithmic supervision, anchored latent representations via discrete bottleneck, and explicit error-correction mechanism. Uses GSM8K-style modular arithmetic on computational graphs as testbed.

Result: The mechanisms collectively yield an architectural approach for native and scalable latent space reasoning in Transformers with robust algorithmic generalization capabilities.

Conclusion: Mechanistic interpretability analysis reveals how these mechanisms enable robust out-of-distribution generalization abilities in Transformer networks.

Abstract: Systematic, compositional generalization beyond the training distribution remains a core challenge in machine learning – and a critical bottleneck for the emergent reasoning abilities of modern language models. This work investigates out-of-distribution (OOD) generalization in Transformer networks using a GSM8K-style modular arithmetic on computational graphs task as a testbed. We introduce and explore a set of four architectural mechanisms aimed at enhancing OOD generalization: (i) input-adaptive recurrence; (ii) algorithmic supervision; (iii) anchored latent representations via a discrete bottleneck; and (iv) an explicit error-correction mechanism. Collectively, these mechanisms yield an architectural approach for native and scalable latent space reasoning in Transformer networks with robust algorithmic generalization capabilities. We complement these empirical results with a detailed mechanistic interpretability analysis that reveals how these mechanisms give rise to robust OOD generalization abilities.

[458] TENDE: Transfer Entropy Neural Diffusion Estimation

Simon Pedro Galeano Munoz, Mustapha Bounoua, Giulio Franzese, Pietro Michiardi, Maurizio Filippone

Main category: cs.LG

TL;DR: TENDE is a novel method that uses score-based diffusion models to estimate transfer entropy through conditional mutual information, overcoming limitations of existing approaches like curse of dimensionality and restrictive assumptions.

Details

Motivation: Existing transfer entropy estimation methods suffer from curse of dimensionality, require restrictive distributional assumptions, or need exponentially large datasets for reliable convergence.

Method: Leverages score-based diffusion models to estimate transfer entropy through conditional mutual information by learning score functions of relevant conditional distributions.

Result: Demonstrates superior accuracy and robustness compared to existing neural estimators and other state-of-the-art approaches across synthetic benchmarks and real data.

Conclusion: TENDE provides flexible, scalable transfer entropy estimation while making minimal assumptions about the underlying data-generating process.

Abstract: Transfer entropy measures directed information flow in time series, and it has become a fundamental quantity in applications spanning neuroscience, finance, and complex systems analysis. However, existing estimation methods suffer from the curse of dimensionality, require restrictive distributional assumptions, or need exponentially large datasets for reliable convergence. We address these limitations in the literature by proposing TENDE (Transfer Entropy Neural Diffusion Estimation), a novel approach that leverages score-based diffusion models to estimate transfer entropy through conditional mutual information. By learning score functions of the relevant conditional distributions, TENDE provides flexible, scalable estimation while making minimal assumptions about the underlying data-generating process. We demonstrate superior accuracy and robustness compared to existing neural estimators and other state-of-the-art approaches across synthetic benchmarks and real data.

[459] Near-Optimal Regret-Queue Length Tradeoff in Online Learning for Two-Sided Markets

Zixian Yang, Sushil Mahavir Varma, Lei Ying

Main category: cs.LG

TL;DR: The paper proposes an online-learning-based pricing policy for two-sided markets with unknown demand/supply curves, achieving near-optimal tradeoff between regret, average queue length, and maximum queue length.

Details

Motivation: To design pricing and matching algorithms that maximize platform profit while maintaining reasonable queue lengths, especially when demand and supply curves are unknown in practice.

Method: A novel online-learning-based pricing policy with dynamic optimization of regret-queue tradeoff and probabilistic sampling to balance learning and queue management.

Result: Achieves Õ(T^{1-γ}) regret, Õ(T^{γ/2}) average queue length, and Õ(T^{γ}) maximum queue length for γ ∈ (0, 1/6], significantly improving over existing results and proving optimality of the tradeoff.

Conclusion: The proposed policy effectively balances learning and queue management in two-sided markets with unknown demand/supply, achieving near-optimal performance across multiple metrics.

Abstract: We study a two-sided market, wherein, price-sensitive heterogeneous customers and servers arrive and join their respective queues. A compatible customer-server pair can then be matched by the platform, at which point, they leave the system. Our objective is to design pricing and matching algorithms that maximize the platform’s profit, while maintaining reasonable queue lengths. As the demand and supply curves governing the price-dependent arrival rates may not be known in practice, we design a novel online-learning-based pricing policy and establish its near-optimality. In particular, we prove a tradeoff among three performance metrics: $\tilde{O}(T^{1-\gamma})$ regret, $\tilde{O}(T^{\gamma/2})$ average queue length, and $\tilde{O}(T^{\gamma})$ maximum queue length for $\gamma \in (0, 1/6]$, significantly improving over existing results [1]. Moreover, barring the permissible range of $\gamma$, we show that this trade-off between regret and average queue length is optimal up to logarithmic factors under a class of policies, matching the optimal one as in [2] which assumes the demand and supply curves to be known. Our proposed policy has two noteworthy features: a dynamic component that optimizes the tradeoff between low regret and small queue lengths; and a probabilistic component that resolves the tension between obtaining useful samples for fast learning and maintaining small queue lengths.

[460] Briding Diffusion Posterior Sampling and Monte Carlo methods: a survey

Yazid Janati, Alain Durmus, Jimmy Olsson, Eric Moulines

Main category: cs.LG

TL;DR: This review paper provides a comprehensive overview of methods that use pre-trained diffusion models with Monte Carlo methods to solve Bayesian inverse problems without additional training.

Details

Motivation: Diffusion models have shown significant potential for solving Bayesian inverse problems by serving as priors, enabling synthesis of accurate samples from complex distributions.

Method: The methods primarily employ a twisting mechanism for intermediate distributions in the diffusion process, guiding simulations toward the posterior distribution, and use various Monte Carlo methods to sample from these twisted distributions.

Result: The paper demonstrates how pre-trained diffusion models can be effectively combined with Monte Carlo methods to address Bayesian inverse problems.

Conclusion: This approach provides a powerful framework for solving Bayesian inverse problems using pre-trained diffusion models as priors, eliminating the need for additional training.

Abstract: Diffusion models enable the synthesis of highly accurate samples from complex distributions and have become foundational in generative modeling. Recently, they have demonstrated significant potential for solving Bayesian inverse problems by serving as priors. This review offers a comprehensive overview of current methods that leverage \emph{pre-trained} diffusion models alongside Monte Carlo methods to address Bayesian inverse problems without requiring additional training. We show that these methods primarily employ a \emph{twisting} mechanism for the intermediate distributions within the diffusion process, guiding the simulations toward the posterior distribution. We describe how various Monte Carlo methods are then used to aid in sampling from these twisted distributions.

[461] Neural Network-enabled Domain-consistent Robust Optimisation for Global CO$_2$ Reduction Potential of Gas Power Plants

Waqar Muhammad Ashraf, Talha Ansar, Abdulelah S. Alshehri, Peipei Chen, Ramit Debnath, Vivek Dua

Main category: cs.LG

TL;DR: Neural network-driven robust optimization framework integrates data-driven domain constraints to prevent domain-inconsistent solutions, achieving 0.76% energy efficiency improvement in gas power plants with significant global CO2 reduction potential.

Details

Motivation: Address the overlooked issue of domain-inconsistent solutions arising from the interaction of parametrized neural network models with optimization solvers in energy systems.

Method: A robust optimization framework that integrates data-driven domain as a constraint into nonlinear programming technique, applied to a 1180 MW combined cycle gas power plant.

Result: Achieved 0.76 percentage point mean improvement in energy efficiency. Scaling globally: estimated annual 26 Mt CO2 reduction (10.6 Mt in Asia, 9.0 Mt in Americas, 4.5 Mt in Europe).

Conclusion: Demonstrates the synergetic role of machine learning in delivering near-term, scalable decarbonisation pathways for global climate action through domain-consistent robust optimization.

Abstract: We introduce a neural network-driven robust optimisation framework that integrates data-driven domain as a constraint into the nonlinear programming technique, addressing the overlooked issue of domain-inconsistent solutions arising from the interaction of parametrised neural network models with optimisation solvers. Applied to a 1180 MW capacity combined cycle gas power plant, our framework delivers domain-consistent robust optimal solutions that achieve a verified 0.76 percentage point mean improvement in energy efficiency. For the first time, scaling this efficiency gain to the global fleet of gas power plants, we estimate an annual 26 Mt reduction potential in CO$_2$ (with 10.6 Mt in Asia, 9.0 Mt in the Americas, and 4.5 Mt in Europe). These results underscore the synergetic role of machine learning in delivering near-term, scalable decarbonisation pathways for global climate action.

[462] Demystifying the Mechanisms Behind Emergent Exploration in Goal-conditioned RL

Mahsa Bastankhah, Grace Liu, Dilip Arumugam, Thomas L. Griffiths, Benjamin Eysenbach

Main category: cs.LG

TL;DR: SGCRL is a self-supervised RL algorithm that achieves emergent exploration through learned representations that automatically shape implicit rewards, promoting exploration before goal achievement and exploitation after.

Details

Motivation: To understand the mechanisms behind emergent exploration in unsupervised reinforcement learning, specifically how self-supervised algorithms can solve long-horizon tasks without external rewards or curricula.

Method: Combined theoretical analysis of SGCRL’s objective function with controlled experiments to study how learned representations drive exploration dynamics.

Result: SGCRL maximizes implicit rewards shaped by learned representations that automatically modify the reward landscape. Exploration dynamics arise from learning low-rank state space representations rather than neural network function approximation.

Conclusion: The improved understanding enables adapting SGCRL for safety-aware exploration, demonstrating practical applications of the discovered exploration mechanisms.

Abstract: In this work, we take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning. We study Single-Goal Contrastive Reinforcement Learning (SGCRL), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula. We combine theoretical analysis of the algorithm’s objective function with controlled experiments to understand what drives its exploration. We show that SGCRL maximizes implicit rewards shaped by its learned representations. These representations automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter. Our experiments also demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation. Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.

[463] Learning Wireless Interference Patterns: Decoupled GNN for Throughput Prediction in Heterogeneous Multi-Hop p-CSMA Networks

Faezeh Dehghan Tarzjani, Bhaskar Krishnamachari

Main category: cs.LG

TL;DR: The paper proposes D-GCN, a novel graph neural network architecture that decouples node transmission probabilities from neighbor interference effects to accurately predict saturation throughput in heterogeneous multi-hop wireless networks, overcoming limitations of traditional models and standard GNNs.

Details

Motivation: Existing methods for predicting saturation throughput in heterogeneous multi-hop wireless networks have limitations: simplified models underestimate throughput by 48-62%, while exact Markov-chain analyses are accurate but computationally infeasible for large networks. Standard GNNs also struggle due to symmetric normalization conflating direct and cascading interference effects.

Method: The authors propose Decoupled Graph Convolutional Network (D-GCN), which explicitly separates processing of a node’s transmission probability from neighbor interference effects. It replaces mean aggregation with learnable attention mechanisms to capture interpretable per-neighbor contribution weights and complex multihop interference patterns.

Result: D-GCN achieves 3.3% normalized mean absolute error (NMAE), significantly outperforming standard GCN (63.94% NMAE) and other baselines. It remains computationally tractable even when exact analytical methods become infeasible, and enables gradient-based network optimization that achieves within 1% of theoretical optima.

Conclusion: D-GCN provides an accurate, scalable solution for throughput prediction in heterogeneous wireless networks by explicitly modeling interference decoupling, outperforming existing approaches while maintaining computational tractability and enabling effective network optimization.

Abstract: The p-persistent CSMA protocol is central to random-access MAC analysis, but predicting saturation throughput in heterogeneous multi-hop wireless networks remains a hard problem. Simplified models that assume a single, shared interference domain can underestimate throughput by 48–62% in sparse topologies. Exact Markov-chain analyses are accurate but scale exponentially in computation time, making them impractical for large networks. These computational barriers motivate structural machine learning approaches like GNNs for scalable throughput prediction in general network topologies. Yet off-the-shelf GNNs struggle here: a standard GCN yields 63.94% normalized mean absolute error (NMAE) on heterogeneous networks because symmetric normalization conflates a node’s direct interference with higher-order, cascading effects that pertain to how interference propagates over the network graph. Building on these insights, we propose the Decoupled Graph Convolutional Network (D-GCN), a novel architecture that explicitly separates processing of a node’s own transmission probability from neighbor interference effects. D-GCN replaces mean aggregation with learnable attention, yielding interpretable, per-neighbor contribution weights while capturing complex multihop interference patterns. D-GCN attains 3.3% NMAE, outperforms strong baselines, remains tractable even when exact analytical methods become computationally infeasible, and enables gradient-based network optimization that achieves within 1% of theoretical optima.

[464] Inferred global dense residue transition graphs from primary structure sequences enable protein interaction prediction via directed graph convolutional neural networks

Islam Akef Ebeid, Haoteng Tang, Pengfei Gu

Main category: cs.LG

TL;DR: A novel framework for protein-protein interaction prediction using directed graph neural networks on n-gram protein graphs, achieving robust performance with limited training data.

Details

Motivation: Existing PPI prediction methods use computationally intensive approaches like direct sequence embeddings from PLMs or 3D structure GNNs. This study explores less computationally intensive alternatives through link prediction.

Method: Two-stage framework: 1) ProtGram models protein primary structure as hierarchy of n-gram graphs with residue transition probabilities as edge weights; 2) DirectGCN processes these directed graphs using path-specific transformations (incoming, outgoing, undirected) combined via learnable gating mechanism, then pools residue embeddings to protein-level embeddings.

Result: DirectGCN matches established methods on standard node classification benchmarks and excels at complex directed graphs. The full ProtGram-DirectGCN framework delivers robust PPI prediction performance even with limited training data.

Conclusion: The proposed framework provides an effective and computationally efficient alternative for protein-protein interaction prediction through directed graph representation learning.

Abstract: Introduction Accurate prediction of protein-protein interactions (PPIs) is crucial for understanding cellular functions and advancing drug development. Existing in-silico methods use direct sequence embeddings from Protein Language Models (PLMs). Others use Graph Neural Networks (GNNs) for 3D protein structures. This study explores less computationally intensive alternatives. We introduce a novel framework for downstream PPI prediction through link prediction. Methods We introduce a two-stage graph representation learning framework, ProtGram-DirectGCN. First, we developed ProtGram. This approach models a protein’s primary structure as a hierarchy of globally inferred n-gram graphs. In these graphs, residue transition probabilities define edge weights. Each edge connects a pair of residues in a directed graph. The probabilities are aggregated from a large corpus of sequences. Second, we propose DirectGCN, a custom directed graph convolutional neural network. This model features a unique convolutional layer. It processes information through separate path-specific transformations: incoming, outgoing, and undirected. A shared transformation is also applied. These paths are combined via a learnable gating mechanism. We apply DirectGCN to ProtGram graphs to learn residue-level embeddings. These embeddings are pooled via attention to generate protein-level embeddings for prediction. Results We first established the efficacy of DirectGCN on standard node classification benchmarks. Its performance matches established methods on general datasets. The model excels at complex, directed graphs with dense, heterophilic structures. When applied to PPI prediction, the full ProtGram-DirectGCN framework delivers robust predictive power. This strong performance holds even with limited training data.

[465] On Evaluating Loss Functions for Stock Ranking: An Empirical Analysis With Transformer Model

Jan Kwiatkowski, Jarosław A. Chudziak

Main category: cs.LG

TL;DR: This paper systematically evaluates different ranking loss functions (pointwise, pairwise, listwise) for Transformer models in stock return forecasting to improve portfolio selection.

Details

Motivation: Standard prediction loss functions don't directly teach models to learn correct stock return rankings, which is crucial for quantitative trading. There's limited understanding of how advanced ranking losses perform in financial contexts with Transformers.

Method: Systematic evaluation of diverse loss functions including pointwise, pairwise, and listwise approaches for daily stock return forecasting on S&P 500 data, focusing on rank-based portfolio selection.

Result: The research provides a comprehensive benchmark showing how different loss functions impact a model’s ability to learn cross-sectional and temporal patterns for portfolio selection.

Conclusion: The study offers practical guidance for optimizing ranking-based trading strategies by revealing the effects of various loss functions on stock ranking performance.

Abstract: Quantitative trading strategies rely on accurately ranking stocks to identify profitable investments. Effective portfolio management requires models that can reliably order future stock returns. Transformer models are promising for understanding financial time series, but how different training loss functions affect their ability to rank stocks well is not yet fully understood. Financial markets are challenging due to their changing nature and complex relationships between stocks. Standard loss functions, which aim for simple prediction accuracy, often aren’t enough. They don’t directly teach models to learn the correct order of stock returns. While many advanced ranking losses exist from fields such as information retrieval, there hasn’t been a thorough comparison to see how well they work for ranking financial returns, especially when used with modern Transformer models for stock selection. This paper addresses this gap by systematically evaluating a diverse set of advanced loss functions including pointwise, pairwise, listwise for daily stock return forecasting to facilitate rank-based portfolio selection on S&P 500 data. We focus on assessing how each loss function influences the model’s ability to discern profitable relative orderings among assets. Our research contributes a comprehensive benchmark revealing how different loss functions impact a model’s ability to learn cross-sectional and temporal patterns crucial for portfolio selection, thereby offering practical guidance for optimizing ranking-based trading strategies.

[466] Data Understanding Survey: Pursuing Improved Dataset Characterization Via Tensor-based Methods

Matthew D. Merris, Tim Andersen

Main category: cs.LG

TL;DR: This paper surveys limitations of conventional dataset characterization methods and proposes tensor-based approaches as a more robust alternative for enhanced interpretability and insights.

Details

Motivation: Existing dataset characterization methods (statistical, structural, model-based) fail to provide deep understanding and insights needed for innovation and explainability in ML and data analytics.

Method: The paper surveys current data analytic techniques, examines their limitations, and discusses various tensor-based methods as alternatives to traditional characterization approaches.

Result: Through examples, tensor methods are shown to unveil nuanced data characteristics and offer enhanced interpretability and actionable intelligence compared to conventional methods.

Conclusion: The paper advocates for adopting tensor-based characterization as it promises significant advances in understanding complex datasets and enabling intelligent, explainable data-driven discoveries.

Abstract: In the evolving domains of Machine Learning and Data Analytics, existing dataset characterization methods such as statistical, structural, and model-based analyses often fail to deliver the deep understanding and insights essential for innovation and explainability. This work surveys the current state-of-the-art conventional data analytic techniques and examines their limitations, and discusses a variety of tensor-based methods and how these may provide a more robust alternative to traditional statistical, structural, and model-based dataset characterization techniques. Through examples, we illustrate how tensor methods unveil nuanced data characteristics, offering enhanced interpretability and actionable intelligence. We advocate for the adoption of tensor-based characterization, promising a leap forward in understanding complex datasets and paving the way for intelligent, explainable data-driven discoveries.

[467] Towards Reversible Model Merging For Low-rank Weights

Mohammadsajad Alipour, Mohammad Mohammadi Amiri

Main category: cs.LG

TL;DR: Proposes Reversible Model Merging (RMM) - a new approach that creates a compact basis to reconstruct original task-specific models instead of collapsing all adapters into one merged model, addressing performance degradation in low-rank compressed models.

Details

Motivation: Conventional model merging methods fail when applied to low-rank compressed models (LoRA or SVD), causing severe performance degradation. The paper recognizes that no single merged model can consistently outperform specialized models for their specific tasks.

Method: RMM reframes merging as generating a reconstruction-capable model space rather than producing a single merged model. It provides a closed-form solution for selecting optimal basis weights and task-specific coefficients for linear combination, allowing recovery of original models when needed.

Result: Extensive experiments show RMM consistently outperforms existing merging approaches, preserving low-rank compressed model performance by a significant margin across diverse datasets and model scales.

Conclusion: The reversible approach of maintaining a basis for model reconstruction is more effective than traditional merging, especially for low-rank compressed models, enabling flexible recovery of task-specific performance while maintaining efficiency.

Abstract: Model merging aims to combine multiple fine-tuned models into a single set of weights that performs well across all source tasks. While prior work has shown that merging can approximate the performance of individual fine-tuned models for each task, it largely overlooks scenarios where models are compressed into low-rank representations, either through low-rank adaptation (LoRA) or post-training singular value decomposition (SVD). We first demonstrate that applying conventional merging methods to low-rank weights leads to severe performance degradation in the merged model. Motivated by this phenomenon, we propose a fundamentally different approach: instead of collapsing all adapters into one set of weights, we construct a compact basis (e.g., an equivalent of holding two or more models) from which original task-specific models can be recovered via linear combination. This reframes merging as generating a reconstruction-capable model space rather than producing a single merged model. Crucially, this allows us to ``revert’’ to each individual model when needed, recognizing that no merged model can consistently outperform one specialized for its task. Building on this insight, we introduce our method, Reversible Model Merging (RMM), an efficient, data-free, and flexible method that provides a closed-form solution for selecting the optimal basis of model weights and task-specific coefficients for linear combination. Extensive experiments across diverse datasets and model scales demonstrate that RMM consistently outperforms existing merging approaches, preserving the performance of low-rank compressed models by a significant margin.

[468] Optimal Control Theoretic Neural Optimizer: From Backpropagation to Dynamic Programming

Guan-Horng Liu, Tianrong Chen, Evangelos A. Theodorou

Main category: cs.LG

TL;DR: The paper presents OCNOpt, a new neural network optimizer based on optimal control theory that connects backpropagation with dynamic programming, enabling higher-order training methods with improved robustness and efficiency.

Details

Motivation: The motivation is to leverage the algorithmic resemblance between backpropagation in DNNs and optimality conditions in dynamical systems, particularly the connection to dynamic programming, to develop more principled optimization methods.

Method: The method involves interpreting DNNs as dynamical systems and solving approximate dynamic programming up to first-order expansion of the Bellman equation, resulting in the OCNOpt optimizer that explores higher-order expansions.

Result: Extensive experiments show that OCNOpt improves upon existing methods in robustness and efficiency while maintaining manageable computational complexity, and enables new algorithmic opportunities like layer-wise feedback policies and higher-order training of continuous-time models.

Conclusion: OCNOpt paves new avenues for principled algorithmic design grounded in dynamical systems and optimal control theory, demonstrating the value of connecting neural network optimization with optimal control frameworks.

Abstract: Optimization of deep neural networks (DNNs) has been a driving force in the advancement of modern machine learning and artificial intelligence. With DNNs characterized by a prolonged sequence of nonlinear propagation, determining their optimal parameters given an objective naturally fits within the framework of Optimal Control Programming. Such an interpretation of DNNs as dynamical systems has proven crucial in offering a theoretical foundation for principled analysis from numerical equations to physics. In parallel to these theoretical pursuits, this paper focuses on an algorithmic perspective. Our motivated observation is the striking algorithmic resemblance between the Backpropagation algorithm for computing gradients in DNNs and the optimality conditions for dynamical systems, expressed through another backward process known as dynamic programming. Consolidating this connection, where Backpropagation admits a variational structure, solving an approximate dynamic programming up to the first-order expansion leads to a new class of optimization methods exploring higher-order expansions of the Bellman equation. The resulting optimizer, termed Optimal Control Theoretic Neural Optimizer (OCNOpt), enables rich algorithmic opportunities, including layer-wise feedback policies, game-theoretic applications, and higher-order training of continuous-time models such as Neural ODEs. Extensive experiments demonstrate that OCNOpt improves upon existing methods in robustness and efficiency while maintaining manageable computational complexity, paving new avenues for principled algorithmic design grounded in dynamical systems and optimal control theory.

[469] MAFA: A Multi-Agent Framework for Enterprise-Scale Annotation with Configurable Task Adaptation

Mahmood Hegazy, Aaron Rodrigues, Azzam Naeem

Main category: cs.LG

TL;DR: MAFA is a production-deployed multi-agent framework that eliminates annotation backlogs in financial services through configurable agent collaboration, achieving 86% human agreement and saving 5,000+ hours annually.

Details

Motivation: Address enterprise-scale annotation backlogs in financial services where millions of customer utterances need accurate categorization, overcoming limitations of traditional annotation methods.

Method: Combines specialized agents with structured reasoning and judge-based consensus mechanism, supporting dynamic task adaptation through configuration rather than code changes.

Result: Eliminated 1 million utterance backlog at JP Morgan Chase with 86% human agreement, 85% high confidence classifications, and significant improvements: 13.8% higher Top-1 accuracy, 15.1% Top-5 accuracy, 16.9% better F1 scores.

Conclusion: Successfully bridges theoretical multi-agent systems with practical enterprise deployment, providing a blueprint for organizations facing annotation challenges while enabling human annotators to focus on ambiguous cases.

Abstract: We present MAFA (Multi-Agent Framework for Annotation), a production-deployed system that transforms enterprise-scale annotation workflows through configurable multi-agent collaboration. Addressing the critical challenge of annotation backlogs in financial services, where millions of customer utterances require accurate categorization, MAFA combines specialized agents with structured reasoning and a judge-based consensus mechanism. Our framework uniquely supports dynamic task adaptation, allowing organizations to define custom annotation types (FAQs, intents, entities, or domain-specific categories) through configuration rather than code changes. Deployed at JP Morgan Chase, MAFA has eliminated a 1 million utterance backlog while achieving, on average, 86% agreement with human annotators, annually saving over 5,000 hours of manual annotation work. The system processes utterances with annotation confidence classifications, which are typically 85% high, 10% medium, and 5% low across all datasets we tested. This enables human annotators to focus exclusively on ambiguous and low-coverage cases. We demonstrate MAFA’s effectiveness across multiple datasets and languages, showing consistent improvements over traditional and single-agent annotation baselines: 13.8% higher Top-1 accuracy, 15.1% improvement in Top-5 accuracy, and 16.9% better F1 in our internal intent classification dataset and similar gains on public benchmarks. This work bridges the gap between theoretical multi-agent systems and practical enterprise deployment, providing a blueprint for organizations facing similar annotation challenges.

[470] Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation

Ruchi Sandilya, Sumaira Perez, Charles Lynch, Lindsay Victoria, Benjamin Zebley, Derrick Matthew Buchanan, Mahendra T. Bhati, Nolan Williams, Timothy J. Spellman, Faith M. Gunning, Conor Liston, Logan Grosenick

Main category: cs.LG

TL;DR: ConDA applies contrastive learning to diffusion model embeddings to organize latent spaces for interpretable control, enabling nonlinear trajectory traversal that improves interpolation, extrapolation, and controllable generation.

Details

Motivation: Diffusion models have powerful generation capabilities but their latent spaces lack explicit organization for interpretable control over system dynamics.

Method: ConDA framework uses contrastive learning within diffusion embeddings to align latent geometry with system dynamics, organizing latents so traversal directions reflect underlying dynamical factors.

Result: Across fluid dynamics, neural calcium imaging, neurostimulation, and facial expression benchmarks, ConDA produces interpretable latent representations with improved controllability over linear traversals and conditioning-based baselines.

Conclusion: Diffusion latents encode dynamics-relevant structure, but exploiting this structure requires proper latent organization and traversal along the latent manifold.

Abstract: Diffusion models excel at generation, but their latent spaces are not explicitly organized for interpretable control. We introduce ConDA (Contrastive Diffusion Alignment), a framework that applies contrastive learning within diffusion embeddings to align latent geometry with system dynamics. Motivated by recent advances showing that contrastive objectives can recover more disentangled and structured representations, ConDA organizes diffusion latents such that traversal directions reflect underlying dynamical factors. Within this contrastively structured space, ConDA enables nonlinear trajectory traversal that supports faithful interpolation, extrapolation, and controllable generation. Across benchmarks in fluid dynamics, neural calcium imaging, therapeutic neurostimulation, and facial expression, ConDA produces interpretable latent representations with improved controllability compared to linear traversals and conditioning-based baselines. These results suggest that diffusion latents encode dynamics-relevant structure, but exploiting this structure requires latent organization and traversal along the latent manifold.

[471] Incentive-Based Federated Learning

Chanuka A. S. Hewa Kaluannakkage, Rajkumar Buyya

Main category: cs.LG

TL;DR: This chapter analyzes incentive mechanisms in federated learning, addressing the participation dilemma where entities may be unwilling to contribute or free-ride on others’ efforts.

Details

Motivation: Federated learning enables collaborative model training without compromising data privacy, but faces practical limitations due to the participation dilemma where entities may be unwilling to contribute without receiving benefits.

Method: The work examines foundational concepts from economics and game theory applied to federated learning, alongside technology-driven solutions like blockchain and deep reinforcement learning. It presents a comprehensive taxonomy covering both centralized and decentralized architectures.

Result: The analysis demonstrates that well-designed incentive mechanisms are essential components for the practical success of federated learning, covering emerging industrial applications in healthcare, smart infrastructure, vehicular networks, and blockchain-based systems.

Conclusion: While promising solutions have emerged, significant challenges remain in building truly sustainable, fair, and robust federated learning ecosystems, highlighting that incentive mechanisms are not optional but essential for practical success.

Abstract: Federated learning promises to revolutionize machine learning by enabling collaborative model training without compromising data privacy. However, practical adaptability can be limited by critical factors, such as the participation dilemma. Participating entities are often unwilling to contribute to a learning system unless they receive some benefits, or they may pretend to participate and free-ride on others. This chapter identifies the fundamental challenges in designing incentive mechanisms for federated learning systems. It examines how foundational concepts from economics and game theory can be applied to federated learning, alongside technology-driven solutions such as blockchain and deep reinforcement learning. This work presents a comprehensive taxonomy that thoroughly covers both centralized and decentralized architectures based on the aforementioned theoretical concepts. Furthermore, the concepts described are presented from an application perspective, covering emerging industrial applications, including healthcare, smart infrastructure, vehicular networks, and blockchain-based decentralized systems. Through this exploration, this chapter demonstrates that well-designed incentive mechanisms are not merely optional features but essential components for the practical success of federated learning. This analysis reveals both the promising solutions that have emerged and the significant challenges that remain in building truly sustainable, fair, and robust federated learning ecosystems.

[472] Spectral Analysis of Molecular Kernels: When Richer Features Do Not Guarantee Better Generalization

Asma Jamali, Tin Sum Cheng, Rodrigo A. Vargas-Hernández

Main category: cs.LG

TL;DR: Comprehensive spectral analysis of molecular kernels reveals that richer spectral features don’t consistently improve accuracy, challenging the common belief that richer spectra yield better generalization.

Details

Motivation: To understand the relationship between kernel spectral properties and generalization performance in molecular property prediction, particularly since systematic spectral analyses of molecular kernels are scarce despite extensive kernel studies in ML.

Method: Conducted spectral analysis of kernel ridge regression on QM9 dataset using molecular fingerprints, pretrained transformer-based, global and local 3D representations across seven molecular properties. Used four spectral metrics and implemented truncated kernels to probe spectrum-performance relationships.

Result: Surprisingly, richer spectral features don’t consistently improve accuracy. For transformer-based and local 3D representations, spectral richness can negatively correlate with performance. Truncated kernels show that retaining only top 2% of eigenvalues recovers nearly all performance in many cases.

Conclusion: The findings challenge the common heuristic that ‘richer spectra yield better generalization’ and highlight nuanced relationships between representation, kernel features, and predictive performance, with implications for evaluating kernel and self-supervised learning methods in data-limited scientific tasks.

Abstract: Understanding the spectral properties of kernels offers a principled perspective on generalization and representation quality. While deep models achieve state-of-the-art accuracy in molecular property prediction, kernel methods remain widely used for their robustness in low-data regimes and transparent theoretical grounding. Despite extensive studies of kernel spectra in machine learning, systematic spectral analyses of molecular kernels are scarce. In this work, we provide the first comprehensive spectral analysis of kernel ridge regression on the QM9 dataset, molecular fingerprint, pretrained transformer-based, global and local 3D representations across seven molecular properties. Surprisingly, richer spectral features, measured by four different spectral metrics, do not consistently improve accuracy. Pearson correlation tests further reveal that for transformer-based and local 3D representations, spectral richness can even have a negative correlation with performance. We also implement truncated kernels to probe the relationship between spectrum and predictive performance: in many kernels, retaining only the top 2% of eigenvalues recovers nearly all performance, indicating that the leading eigenvalues capture the most informative features. Our results challenge the common heuristic that “richer spectra yield better generalization” and highlight nuanced relationships between representation, kernel features, and predictive performance. Beyond molecular property prediction, these findings inform how kernel and self-supervised learning methods are evaluated in data-limited scientific and real-world tasks.

[473] When Flatness Does (Not) Guarantee Adversarial Robustness

Nils Philipp Walter, Linara Adilova, Jilles Vreeken, Michael Kamp

Main category: cs.LG

TL;DR: Flat minima in neural networks provide local but not global adversarial robustness. While flatness constrains loss variation locally, maintaining robustness beyond local neighborhoods requires sharp curvature away from data manifolds.

Details

Motivation: To rigorously formalize the relationship between flat minima and adversarial robustness, challenging the common intuition that flatness directly implies robustness.

Method: Derived closed-form expression for relative flatness in penultimate layer, used this to constrain loss variation in input space, and formally analyzed adversarial robustness of entire network.

Result: Flatness implies local robustness but not global robustness; adversarial examples often lie in large, flat regions where models are confidently wrong. Validated across architectures and datasets.

Conclusion: The connection between flatness and robustness is nuanced - flat minima provide local robustness but global robustness requires sharp curvature away from data manifolds, challenging simplified views of flatness.

Abstract: Despite their empirical success, neural networks remain vulnerable to small, adversarial perturbations. A longstanding hypothesis suggests that flat minima, regions of low curvature in the loss landscape, offer increased robustness. While intuitive, this connection has remained largely informal and incomplete. By rigorously formalizing the relationship, we show this intuition is only partially correct: flatness implies local but not global adversarial robustness. To arrive at this result, we first derive a closed-form expression for relative flatness in the penultimate layer, and then show we can use this to constrain the variation of the loss in input space. This allows us to formally analyze the adversarial robustness of the entire network. We then show that to maintain robustness beyond a local neighborhood, the loss needs to curve sharply away from the data manifold. We validate our theoretical predictions empirically across architectures and datasets, uncovering the geometric structure that governs adversarial vulnerability, and linking flatness to model confidence: adversarial examples often lie in large, flat regions where the model is confidently wrong. Our results challenge simplified views of flatness and provide a nuanced understanding of its role in robustness.

[474] Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

Main category: cs.LG

TL;DR: GenCluster is a test-time compute framework that achieves IOI gold medal performance using open-weight models through large-scale generation, behavioral clustering, ranking, and round-robin submission strategies.

Details

Motivation: Competitive programming like IOI is a key benchmark for evaluating LLMs' reasoning abilities, but achieving gold medal performance with open-weight models remains challenging while proprietary models claim such results with undisclosed methods.

Method: GenCluster combines large-scale generation, behavioral clustering, ranking, and round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets.

Result: The approach scales consistently with available compute and achieves gold medal performance at IOI 2025 using gpt-oss-120b, the first time with an open-weight model.

Conclusion: GenCluster sets a new benchmark for transparent and reproducible evaluation of reasoning in LLMs, narrowing the gap between open and closed systems.

Abstract: Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present \gencluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.

[475] Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation

Jingwen Gu, Yiting He, Zhishuai Liu, Pan Xu

Main category: cs.LG

TL;DR: DR-RPO is a model-free online policy optimization algorithm for robust RL that learns robust policies with sublinear regret by incorporating reference-policy regularization and linear function approximation.

Details

Motivation: Address the challenge of decision-making under distribution shift in RL, where training and deployment environments differ, and bridge the gap in policy optimization methods for robust RL.

Method: Proposes DR-RPO algorithm with reference-policy regularization to create doubly constrained RMDPs, uses d-rectangular linear MDP formulation with linear function approximation and upper confidence bonus for optimistic exploration.

Result: Achieves polynomial suboptimality bounds and sample efficiency in robust RL, matching value-based approaches, with empirical validation across diverse domains.

Conclusion: Policy optimization can be effective for robust RL with proper regularization and function approximation, providing theoretical guarantees and empirical robustness.

Abstract: Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ. We study this problem through the lens of robust Markov decision processes (RMDPs), which optimize performance against adversarial transition dynamics. Our focus is the online setting, where the agent has only limited interaction with the environment, making sample efficiency and exploration especially critical. Policy optimization, despite its success in standard RL, remains theoretically and empirically underexplored in robust RL. To bridge this gap, we propose \textbf{D}istributionally \textbf{R}obust \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization algorithm (DR-RPO), a model-free online policy optimization method that learns robust policies with sublinear regret. To enable tractable optimization within the softmax policy class, DR-RPO incorporates reference-policy regularization, yielding RMDP variants that are doubly constrained in both transitions and policies. To scale to large state-action spaces, we adopt the $d$-rectangular linear MDP formulation and combine linear function approximation with an upper confidence bonus for optimistic exploration. We provide theoretical guarantees showing that policy optimization can achieve polynomial suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches. Finally, empirical results across diverse domains corroborate our theory and demonstrate the robustness of DR-RPO.

[476] A Physics Prior-Guided Dual-Stream Attention Network for Motion Prediction of Elastic Bragg Breakwaters

Lianzi Jiang, Jianxin Zhang, Xinyu Han, Huanhe Dong, Xiangrong Wang

Main category: cs.LG

TL;DR: Proposes PhysAttnNet, a physics-guided dual-stream attention network for predicting motion responses of elastic Bragg breakwaters, addressing generalization limitations of conventional deep learning models in marine environments.

Details

Motivation: Conventional deep learning models have limited generalization for unseen sea states due to neglect of natural decay in marine systems and inadequate modeling of wave-structure interaction.

Method: Uses decay bidirectional self-attention (DBSA) with learnable temporal decay, phase differences guided bidirectional cross-attention (PDG-BCA) with cosine-based bias, global context fusion (GCF), and hybrid time-frequency loss training.

Result: Significantly outperforms mainstream models on wave flume datasets and demonstrates robust cross-scenario generalization to unseen environments.

Conclusion: PhysAttnNet shows potential as a framework for developing predictive models for complex systems in ocean engineering, with validated robustness and adaptability.

Abstract: Accurate motion response prediction for elastic Bragg breakwaters is critical for their structural safety and operational integrity in marine environments. However, conventional deep learning models often exhibit limited generalization capabilities when presented with unseen sea states. These deficiencies stem from the neglect of natural decay observed in marine systems and inadequate modeling of wave-structure interaction (WSI). To overcome these challenges, this study proposes a novel Physics Prior-Guided Dual-Stream Attention Network (PhysAttnNet). First, the decay bidirectional self-attention (DBSA) module incorporates a learnable temporal decay to assign higher weights to recent states, aiming to emulate the natural decay phenomenon. Meanwhile, the phase differences guided bidirectional cross-attention (PDG-BCA) module explicitly captures the bidirectional interaction and phase relationship between waves and the structure using a cosine-based bias within a bidirectional cross-computation paradigm. These streams are synergistically integrated through a global context fusion (GCF) module. Finally, PhysAttnNet is trained with a hybrid time-frequency loss that jointly minimizes time-domain prediction errors and frequency-domain spectral discrepancies. Comprehensive experiments on wave flume datasets demonstrate that PhysAttnNet significantly outperforms mainstream models. Furthermore,cross-scenario generalization tests validate the model’s robustness and adaptability to unseen environments, highlighting its potential as a framework to develop predictive models for complex systems in ocean engineering.

[477] Generalist vs Specialist Time Series Foundation Models: Investigating Potential Emergent Behaviors in Assessing Human Health Using PPG Signals

Saurabh Kataria, Yi Wu, Zhaoliang Chen, Hyunjung Gloria Kwak, Yuhao Xu, Lovely Yeswanth Panchumarthi, Ran Xiao, Jiaying Lu, Ayca Ermis, Anni Zhao, Runze Yan, Alex Federov, Zewen Liu, Xu Wu, Wei Jin, Carl Yang, Jocelyn Grunwell, Stephanie R. Brown, Amit Shah, Craig Jabaley, Tim Buchman, Sivasubramanium V Bhavani, Randall J. Lee, Xiao Hu

Main category: cs.LG

TL;DR: This paper benchmarks generalist vs specialist foundation models for PPG signal analysis, finding specialist models outperform generalist models by 27% in win score across 51 tasks.

Details

Motivation: Foundation models are increasingly used in time-series analysis, but most are specialist models trained on specific data types. Recent generalist models like MOMENT use multi-domain data, but their performance compared to specialists for physiological signals like PPG remains unclear.

Method: Comprehensive benchmarking study comparing generalist and specialist foundation models across 51 tasks covering cardiac assessment, lab value estimation, and cross-modal inference. Evaluation covers 7 dimensions: win score, average performance, feature quality, tuning gain, performance variance, transferability, and scalability.

Result: Specialist models significantly outperform generalist models, achieving 27% higher win score in full-tuning scenarios. The study provides detailed analysis of generalization, fairness, attention patterns, and training data importance.

Conclusion: Specialist foundation models demonstrate superior performance over generalist models for PPG signal analysis tasks, highlighting the importance of domain-specific pre-training for physiological sensing applications.

Abstract: Foundation models are large-scale machine learning models that are pre-trained on massive amounts of data and can be adapted for various downstream tasks. They have been extensively applied to tasks in Natural Language Processing and Computer Vision with models such as GPT, BERT, and CLIP. They are now also increasingly gaining attention in time-series analysis, particularly for physiological sensing. However, most time series foundation models are specialist models - with data in pre-training and testing of the same type, such as Electrocardiogram, Electroencephalogram, and Photoplethysmogram (PPG). Recent works, such as MOMENT, train a generalist time series foundation model with data from multiple domains, such as weather, traffic, and electricity. This paper aims to conduct a comprehensive benchmarking study to compare the performance of generalist and specialist models, with a focus on PPG signals. Through an extensive suite of total 51 tasks covering cardiac state assessment, laboratory value estimation, and cross-modal inference, we comprehensively evaluate both models across seven dimensions, including win score, average performance, feature quality, tuning gain, performance variance, transferability, and scalability. These metrics jointly capture not only the models’ capability but also their adaptability, robustness, and efficiency under different fine-tuning strategies, providing a holistic understanding of their strengths and limitations for diverse downstream scenarios. In a full-tuning scenario, we demonstrate that the specialist model achieves a 27% higher win score. Finally, we provide further analysis on generalization, fairness, attention visualizations, and the importance of training data choice.

[478] CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions

Zihao Fu, Ming Liao, Chris Russell, Zhenguang G. Cai

Main category: cs.LG

TL;DR: CAST is a probe-free framework for analyzing transformer layers using direct transformation matrix estimation and spectral analysis, revealing distinct behaviors between encoder-only and decoder-only models.

Details

Motivation: Large language models remain black boxes with poorly understood internal mechanisms, despite various existing interpretability methods. CAST aims to provide complementary insights through a novel spectral analysis approach.

Method: CAST uses Moore-Penrose pseudoinverse to estimate realized transformation matrices for each layer and applies spectral analysis with six interpretable metrics to characterize layer behavior.

Result: Analysis reveals decoder models exhibit compression-expansion cycles while encoder models maintain consistent high-rank processing. Kernel analysis shows layers partition into three phases: feature extraction, compression, and specialization.

Conclusion: CAST provides a novel probe-free framework that offers complementary insights to existing interpretability methods by analyzing transformer layer functions through spectral tracking and matrix estimation.

Abstract: Large language models have achieved remarkable success but remain largely black boxes with poorly understood internal mechanisms. To address this limitation, many researchers have proposed various interpretability methods including mechanistic analysis, probing classifiers, and activation visualization, each providing valuable insights from different perspectives. Building upon this rich landscape of complementary approaches, we introduce CAST (Compositional Analysis via Spectral Tracking), a probe-free framework that contributes a novel perspective by analyzing transformer layer functions through direct transformation matrix estimation and comprehensive spectral analysis. CAST offers complementary insights to existing methods by estimating the realized transformation matrices for each layer using Moore-Penrose pseudoinverse and applying spectral analysis with six interpretable metrics characterizing layer behavior. Our analysis reveals distinct behaviors between encoder-only and decoder-only models, with decoder models exhibiting compression-expansion cycles while encoder models maintain consistent high-rank processing. Kernel analysis further demonstrates functional relationship patterns between layers, with CKA similarity matrices clearly partitioning layers into three phases: feature extraction, compression, and specialization.

[479] Nonparametric Data Attribution for Diffusion Models

Yutian Zhao, Chao Du, Xiaosen Zheng, Tianyu Pang, Min Lin

Main category: cs.LG

TL;DR: A nonparametric data attribution method for diffusion models that measures influence via patch-level similarity between generated and training images, without requiring model gradients or retraining.

Details

Motivation: Existing attribution methods for diffusion models typically require access to model gradients or retraining, which limits their use in proprietary or large-scale settings.

Method: Nonparametric attribution method operating entirely on data, measuring influence via patch-level similarity between generated and training images, grounded in analytical form of optimal score function and extended to multiscale representations with convolution-based acceleration.

Result: Achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines, while producing spatially interpretable attributions.

Conclusion: The proposed framework provides computationally efficient attribution that uncovers intrinsic relationships between training data and outputs, independent of any specific model.

Abstract: Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or large-scale settings. We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines. Code is available at https://github.com/sail-sg/NDA.

[480] Stable Prediction of Adverse Events in Medical Time-Series Data

Mayank Keoliya, Seewon Choi, Rajeev Alur, Mayur Naik, Eric Wong

Main category: cs.LG

TL;DR: CAREBench is a new benchmark for early event prediction that evaluates both predictive accuracy and temporal stability using multi-modal inputs (EHR, ECG, clinical text), showing current methods struggle to balance both aspects.

Details

Motivation: Current early event prediction benchmarks ignore temporal stability of risk scores and mainly evaluate on tabular inputs, leaving trajectory behavior untested for clinical deployment.

Method: Proposed CAREBench benchmark with multi-modal inputs (tabular EHR, ECG waveforms, clinical text) and a stability metric that quantifies short-term variability using local-Lipschitz constants to penalize abrupt oscillations.

Result: Existing methods, especially LLMs, struggle to jointly optimize accuracy and stability across six prediction tasks, with notably poor recall at high-precision operating points.

Conclusion: Models need to produce evidence-aligned, stable trajectories to earn clinician trust in continuous monitoring settings.

Abstract: Early event prediction (EEP) systems continuously estimate a patient’s imminent risk to support clinical decision-making. For bedside trust, risk trajectories must be accurate and temporally stable, shifting only with new, relevant evidence. However, current benchmarks (a) ignore stability of risk scores and (b) evaluate mainly on tabular inputs, leaving trajectory behavior untested. To address this gap, we introduce CAREBench, an EEP benchmark that evaluates deployability using multi-modal inputs-tabular EHR, ECG waveforms, and clinical text-and assesses temporal stability alongside predictive accuracy. We propose a stability metric that quantifies short-term variability in per-patient risk and penalizes abrupt oscillations based on local-Lipschitz constants. CAREBench spans six prediction tasks such as sepsis onset and compares classical learners, deep sequence models, and zero-shot LLMs. Across tasks, existing methods, especially LLMs, struggle to jointly optimize accuracy and stability, with notably poor recall at high-precision operating points. These results highlight the need for models that produce evidence-aligned, stable trajectories to earn clinician trust in continuous monitoring settings. (Code: https://github.com/SeewonChoi/CAREBench.)

[481] Enhancing Time-Series Anomaly Detection by Integrating Spectral-Residual Bottom-Up Attention with Reservoir Computing

Hayato Nihei, Sou Nobukawa, Yusuke Sakemi, Kazuyuki Aihara

Main category: cs.LG

TL;DR: Proposes Spectral Residual Reservoir Computing (SR-RC) that combines spectral residual method with reservoir computing to improve time-series anomaly detection performance without sacrificing learning efficiency.

Details

Motivation: Reservoir computing is suitable for edge AI applications but may require large reservoirs for adequate anomaly detection, while attention mechanisms can improve accuracy but undermine RC's learning efficiency.

Method: Integrates spectral residual method (a learning-free, bottom-up attention mechanism) with reservoir computing to enhance anomaly detection without additional training overhead.

Result: SR-RC outperformed conventional RC and logistic-regression models using SR-extracted features across benchmark tasks and real-world time-series datasets.

Conclusion: SR-RC provides a practical direction for deploying reservoir computing as edge AI for time-series anomaly detection, as both components are well-suited for hardware implementation.

Abstract: Reservoir computing (RC) establishes the basis for the processing of time-series data by exploiting the high-dimensional spatiotemporal response of a recurrent neural network to an input signal. In particular, RC trains only the output layer weights. This simplicity has drawn attention especially in Edge Artificial Intelligence (AI) applications. Edge AI enables time-series anomaly detection in real time, which is important because detection delays can lead to serious incidents. However, achieving adequate anomaly-detection performance with RC alone may require an unacceptably large reservoir on resource-constrained edge devices. Without enlarging the reservoir, attention mechanisms can improve accuracy, although they may require substantial computation and undermine the learning efficiency of RC. In this study, to improve the anomaly detection performance of RC without sacrificing learning efficiency, we propose a spectral residual RC (SR-RC) that integrates the spectral residual (SR) method - a learning-free, bottom-up attention mechanism

with RC. We demonstrated that SR-RC outperformed conventional RC and logistic-regression models based on values extracted by the SR method across benchmark tasks and real-world time-series datasets. Moreover, because the SR method, similarly to RC, is well suited for hardware implementation, SR-RC suggests a practical direction for deploying RC as Edge AI for time-series anomaly detection.

[482] TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening

Nam Le, Leo Yu Zhang, Kewen Liao, Shirui Pan, Wei Luo

Main category: cs.LG

TL;DR: TED++ is a submanifold-aware framework that detects subtle backdoor attacks by constructing tubular neighborhoods around class manifolds in hidden features and using Locally Adaptive Ranking to identify activations drifting outside admissible tubes.

Details

Motivation: Stealthy backdoor attacks pose severe security risks to deep neural networks, and existing defenses are vulnerable to subtle distance-based anomalies and perform poorly when clean examples are scarce.

Method: Constructs tubular neighborhoods around each class’s hidden-feature manifold, estimates local thickness from few clean activations, applies Locally Adaptive Ranking to detect activations drifting outside tubes, and aggregates LAR-adjusted ranks across layers to capture input behavior on evolving class submanifolds.

Result: Achieves state-of-the-art detection performance under adaptive-attack and limited-data scenarios, with near-perfect detection using only five examples per class and gains of up to 14% in AUROC over next-best methods.

Conclusion: TED++ effectively detects subtle backdoors that evade existing defenses by leveraging submanifold-aware analysis and tube-constrained behavior characterization, demonstrating robust performance even with extremely limited clean data.

Abstract: As deep neural networks power increasingly critical applications, stealthy backdoor attacks, where poisoned training inputs trigger malicious model behaviour while appearing benign, pose a severe security risk. Many existing defences are vulnerable when attackers exploit subtle distance-based anomalies or when clean examples are scarce. To meet this challenge, we introduce TED++, a submanifold-aware framework that effectively detects subtle backdoors that evade existing defences. TED++ begins by constructing a tubular neighbourhood around each class’s hidden-feature manifold, estimating its local thickness'' from a handful of clean activations. It then applies Locally Adaptive Ranking (LAR) to detect any activation that drifts outside the admissible tube. By aggregating these LAR-adjusted ranks across all layers, TED++ captures how faithfully an input remains on the evolving class submanifolds. Based on such characteristic tube-constrained’’ behaviour, TED++ flags inputs whose LAR-based ranking sequences deviate significantly. Extensive experiments are conducted on benchmark datasets and tasks, demonstrating that TED++ achieves state-of-the-art detection performance under both adaptive-attack and limited-data scenarios. Remarkably, even with only five held-out examples per class, TED++ still delivers near-perfect detection, achieving gains of up to 14% in AUROC over the next-best method. The code is publicly available at https://github.com/namle-w/TEDpp.

[483] Active Measuring in Reinforcement Learning With Delayed Negative Effects

Daiqi Gao, Ziping Xu, Aseel Rawashdeh, Predrag Klasnja, Susan A. Murphy

Main category: cs.LG

TL;DR: The paper introduces Actively Observable Markov Decision Process (AOMDP) where agents can choose to measure latent states, balancing measurement costs against improved state knowledge for better decision-making.

Details

Motivation: Real-world RL applications often face costly state measurements that may negatively impact future outcomes, creating a trade-off between measurement benefits and costs.

Method: Formulate AOMDP as periodic partially observable MDP, develop online RL algorithm using belief states, and propose sequential Monte Carlo method to approximate posterior distributions of unknown parameters and latent states.

Result: The reduced uncertainty from active measurements can provably improve sample efficiency and increase optimal policy value despite measurement costs.

Conclusion: The proposed framework and algorithms effectively handle the measurement trade-off in real-world applications like digital health, where interventions and assessments must be strategically timed.

Abstract: Measuring states in reinforcement learning (RL) can be costly in real-world settings and may negatively influence future outcomes. We introduce the Actively Observable Markov Decision Process (AOMDP), where an agent not only selects control actions but also decides whether to measure the latent state. The measurement action reveals the true latent state but may have a negative delayed effect on the environment. We show that this reduced uncertainty may provably improve sample efficiency and increase the value of the optimal policy despite these costs. We formulate an AOMDP as a periodic partially observable MDP and propose an online RL algorithm based on belief states. To approximate the belief states, we further propose a sequential Monte Carlo method to jointly approximate the posterior of unknown static environment parameters and unobserved latent states. We evaluate the proposed algorithm in a digital health application, where the agent decides when to deliver digital interventions and when to assess users’ health status through surveys.

[484] LLM-ERM: Sample-Efficient Program Learning via LLM-Guided Search

Shivam Singhal, Eran Malach, Tomaso Poggio, Tomer Galanti

Main category: cs.LG

TL;DR: LLM-ERM is a propose-and-verify framework that uses LLMs to guide program search while maintaining ERM-style selection, achieving sample-efficient learning of short programs where gradient-based methods fail.

Details

Motivation: To bridge the gap between sample-efficient but computationally expensive program enumeration methods and computationally feasible but sample-inefficient gradient-based training methods for program learning.

Method: LLM-ERM draws k candidate programs using a pretrained reasoning-augmented LLM, compiles and verifies each on data, then returns the best verified hypothesis without feedback, adaptivity, or gradients.

Result: Empirically, LLM-ERM solves tasks like parity variants, pattern matching, and primality testing with only 200 samples, while SGD-trained transformers overfit even with 100,000 samples.

Conclusion: Language-guided program synthesis recovers statistical efficiency of finite-class ERM while remaining computationally tractable, offering a practical route to learning succinct hypotheses beyond gradient-based training.

Abstract: We seek algorithms for program learning that are both sample-efficient and computationally feasible. Classical results show that targets admitting short program descriptions (e.g., with short python code'') can be learned with a small’’ number of examples (scaling with the size of the code) via length-first program enumeration, but the search is exponential in description length. Consequently, Gradient-based training avoids this cost yet can require exponentially many samples on certain short-program families. To address this gap, we introduce LLM-ERM, a propose-and-verify framework that replaces exhaustive enumeration with an LLM-guided search over candidate programs while retaining ERM-style selection on held-out data. Specifically, we draw $k$ candidates with a pretrained reasoning-augmented LLM, compile and check each on the data, and return the best verified hypothesis, with no feedback, adaptivity, or gradients. Theoretically, we show that coordinate-wise online mini-batch SGD requires many samples to learn certain short programs. {\em Empirically, LLM-ERM solves tasks such as parity variants, pattern matching, and primality testing with as few as 200 samples, while SGD-trained transformers overfit even with 100,000 samples}. These results indicate that language-guided program synthesis recovers much of the statistical efficiency of finite-class ERM while remaining computationally tractable, offering a practical route to learning succinct hypotheses beyond the reach of gradient-based training.

[485] DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis

Shruti Sarika Chakraborty, Peter Minary

Main category: cs.LG

TL;DR: DARTS-GT redesigns Graph Transformers with asymmetric attention and differentiable architecture search to enable depth-wise heterogeneous GNN selection, achieving state-of-the-art performance while providing quantitative interpretability through causal ablation metrics.

Details

Motivation: Current Graph Transformers have rigid designs with fixed GNN types across all layers, lack depth-specific component selection, and suffer from poor interpretability where performance gains cannot distinguish meaningful patterns from spurious correlations.

Method: Redesign GT attention asymmetrically by decoupling structural encoding from feature representation (queries from node features, keys/values from GNN transformations), then use Differentiable Architecture Search (DARTS) to select optimal GNN operators at each layer.

Result: DARTS-GT achieves state-of-the-art on four of eight benchmark datasets while remaining competitive on others. Discovered architectures reveal dataset-specific patterns, and heterogeneous architectures consistently produce more interpretable models than baselines.

Conclusion: Graph Transformers need not choose between performance and interpretability. The framework establishes that visual attention salience and causal importance do not always correlate, indicating traditional visualization approaches may miss important components.

Abstract: Graph Transformers (GTs) have emerged as powerful architectures for graph-structured data, yet remain constrained by rigid designs and lack quantifiable interpretability. Current state-of-the-art GTs commit to fixed GNN types across all layers, missing potential benefits of depth-specific component selection, while their complex architectures become opaque where performance gains cannot be distinguished between meaningful patterns and spurious correlations. We redesign GT attention through asymmetry, decoupling structural encoding from feature representation: queries derive from node features while keys and values come from GNN transformations. Within this framework, we use Differentiable ARchiTecture Search (DARTS) to select optimal GNN operators at each layer, enabling depth-wise heterogeneity inside transformer attention itself (DARTS-GT). To understand discovered architectures, we develop the first quantitative interpretability framework for GTs through causal ablation. Our metrics (Head-deviation, Specialization, and Focus), identify which heads and nodes drive predictions while enabling model comparison. Experiments across eight benchmarks show DARTS-GT achieves state-of-the-art on four datasets while remaining competitive on others, with discovered architectures revealing dataset-specific patterns. Our interpretability analysis reveals that visual attention salience and causal importance do not always correlate, indicating widely used visualization approaches may miss components that actually matter. Crucially, heterogeneous architectures found by DARTS-GT consistently produced more interpretable models than baselines, establishing that Graph Transformers need not choose between performance and interpretability.

[486] Stop-RAG: Value-Based Retrieval Control for Iterative RAG

Jaewan Park, Solbee Cho, Jay-Yoon Lee

Main category: cs.LG

TL;DR: Stop-RAG is a value-based controller that adaptively decides when to stop retrieving in iterative RAG systems, outperforming fixed-iteration and LLM-based stopping methods on multi-hop QA benchmarks.

Details

Motivation: Iterative RAG systems suffer from increased latency, costs, and risk of distracting evidence with each additional retrieval loop, creating need for efficient stopping strategies beyond predetermined iterations or unreliable confidence proxies.

Method: Cast iterative RAG as finite-horizon Markov decision process and train Stop-RAG controller using full-width forward-view Q(λ) targets from complete trajectories, enabling adaptive stopping decisions while remaining compatible with black-box APIs.

Result: Stop-RAG consistently outperforms fixed-iteration baselines and prompting-based stopping with LLMs on multi-hop question-answering benchmarks.

Conclusion: Adaptive stopping is a key missing component in current agentic systems, and value-based control can significantly improve RAG system accuracy.

Abstract: Iterative retrieval-augmented generation (RAG) enables large language models to answer complex multi-hop questions, but each additional loop increases latency, costs, and the risk of introducing distracting evidence, motivating the need for an efficient stopping strategy. Existing methods either use a predetermined number of iterations or rely on confidence proxies that poorly reflect whether more retrieval will actually help. We cast iterative RAG as a finite-horizon Markov decision process and introduce Stop-RAG, a value-based controller that adaptively decides when to stop retrieving. Trained with full-width forward-view Q($\lambda$) targets from complete trajectories, Stop-RAG learns effective stopping policies while remaining compatible with black-box APIs and existing pipelines. On multi-hop question-answering benchmarks, Stop-RAG consistently outperforms both fixed-iteration baselines and prompting-based stopping with LLMs. These results highlight adaptive stopping as a key missing component in current agentic systems, and demonstrate that value-based control can improve the accuracy of RAG systems.

[487] Jet Functors and Weil Algebras in Automatic Differentiation: A Geometric Analysis

Amandip Sangha

Main category: cs.LG

TL;DR: Geometric formulation of automatic differentiation using jet bundles and Weil algebras, showing reverse-mode AD as cotangent-pullback and Taylor-mode as evaluation in Weil algebras, with applications to correctness, stability, and efficient computation of mixed derivatives.

Details

Motivation: To provide a geometric foundation for automatic differentiation theory using differential geometry concepts, enabling structure-preserving differentiation methods and addressing combinatorial complexity in higher-order derivatives.

Method: Uses jet bundles and Weil algebras to formulate AD geometrically, with reverse-mode AD as cotangent-pullback and Taylor-mode as evaluation in Weil algebras. Introduces tensorized Weil algebras for efficient computation of mixed derivatives.

Result: Derived concise statements on correctness, stability, and complexity: functorial identity for reverse-mode, algebraic exactness of higher-order derivatives, explicit truncation error bounds, and linear-cost computation of all mixed derivatives avoiding combinatorial blow-up.

Conclusion: The framework interprets AD theory through differential geometry and provides foundation for structure-preserving differentiation methods in deep learning and scientific computing.

Abstract: We present a geometric formulation of automatic differentiation (AD) using jet bundles and Weil algebras. Reverse-mode AD emerges as cotangent-pullback, while Taylor-mode corresponds to evaluation in a Weil algebra. From these principles, we derive concise statements on correctness, stability, and complexity: a functorial identity for reverse-mode, algebraic exactness of higher-order derivatives, and explicit bounds on truncation error. We further show that tensorized Weil algebras permit one-pass computation of all mixed derivatives with cost linear in the algebra dimension, avoiding the combinatorial blow-up of nested JVP/VJP schedules. This framework interprets AD theory through the lens of differential geometry and offers a foundation for developing structure-preserving differentiation methods in deep learning and scientific computing. Code and examples are available at https://git.nilu.no/geometric-ad/jet-weil-ad.

[488] Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

Andrew Zhao, Reshmi Ghosh, Vitor Carvalho, Emily Lawton, Keegan Hines, Gao Huang, Jack W. Stokes

Main category: cs.LG

TL;DR: This paper presents the first systematic analysis of poisoning risks in LLM-based prompt optimization, showing that systems are more vulnerable to manipulated feedback than injected queries, and proposes a lightweight highlighting defense.

Details

Motivation: LLM systems now power everyday AI applications where performance depends on carefully designed prompts. While LLM-based prompt optimizers reduce effort by refining prompts from scored feedback, the security of this optimization stage remains underexamined.

Method: The researchers used HarmBench to analyze poisoning risks, introduced a simple fake-reward attack that requires no access to the reward model, and proposed a lightweight highlighting defense to mitigate vulnerabilities.

Result: Feedback-based attacks raise attack success rate (ASR) by up to ΔASR = 0.48. The fake-reward attack significantly increases vulnerability, while the highlighting defense reduces the fake-reward ΔASR from 0.23 to 0.07 without degrading utility.

Conclusion: Prompt optimization pipelines represent a first-class attack surface that requires stronger safeguards for feedback channels and optimization frameworks.

Abstract: Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $\Delta$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $\Delta$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.

Kartikay Agrawal, Abhijeet Vikram, Vedant Sharma, Vaishnavi N., Ayon Borthakur

Main category: cs.LG

TL;DR: SHaRe-SSM is a second-order spiking state space model that outperforms transformers and first-order SSMs for very-long-range sequence modeling while being highly energy-efficient (73× less energy than ANN-based SSMs).

Details

Motivation: To combine the energy efficiency of spiking neural networks with the long-sequence modeling capabilities of state space models, overcoming transformer's quadratic complexity and enabling efficient processing of very-long-range sequences.

Method: Proposed SHaRe-SSM using second-order spiking SSM with resonate and fire neurons, exploiting parallel scans for stable implementation, and introducing kernel-based spiking regressor for long-range sequences.

Result: Superior performance on sequences up to 50k length while being significantly energy-efficient (73× less energy than ANN-based SSMs for 18k sequences). Systematic analysis of heterogeneity, dissipation, and conservation in resonate-and-fire SSMs.

Conclusion: SHaRe-SSM provides an efficient alternative to transformers and first-order SSMs for very-long-range sequence modeling, combining energy efficiency with strong performance through second-order spiking dynamics.

Abstract: In recent years, with the emergence of large models, there has been a significant interest in spiking neural networks (SNNs) primarily due to their energy efficiency, multiplication-free, and sparse event-based deep learning. Similarly, state space models (SSMs) in varying designs have evolved as a powerful alternative to transformers for target modeling in long sequences, thereby overcoming the quadratic dependence on sequence length of a transformer. Inspired by this progress, we here design SHaRe-SSM (Spiking Harmonic Resonate and Fire State Space Model), for target variable modeling (including both classification and regression) for very-long-range sequences. Our second-order spiking SSM, on average, performs better than transformers or first-order SSMs while circumventing multiplication operations, making it ideal for resource-constrained applications. The proposed block consumes $73 \times$ less energy than second-order ANN-based SSMs for an 18k sequence, while retaining performance. To ensure learnability over the long-range sequences, we propose exploiting the stable and efficient implementation of the dynamical system using parallel scans. Moreover, for the first time, we propose a kernel-based spiking regressor using resonate and fire neurons for very long-range sequences. Our network shows superior performance on even a 50k sequence while being significantly energy-efficient. In addition, we conducted a systematic analysis of the impact of heterogeneity, dissipation, and conservation in resonate-and-fire SSMs.

[490] Interaction Concordance Index: Performance Evaluation for Interaction Prediction Methods

Tapio Pahikkala, Riikka Numminen, Parisa Movahedi, Napsu Karmitsa, Antti Airola

Main category: cs.LG

TL;DR: The paper introduces IC-index, a new performance metric for evaluating how well predictors capture interaction directions in drug-target affinity (DTA) prediction, complementing existing DTA prediction metrics.

Details

Motivation: Current DTA prediction methods focus on affinity values but don't adequately evaluate whether predictors capture interaction effects between drugs and targets, which is crucial for optimal allocation decisions.

Method: Proposed interaction concordance index (IC-index) to measure correct prediction of interaction directions. Analyzed invariance properties and showed that permutation equivariance in learning algorithms prevents interaction capture for unseen entities.

Result: IC-index reveals limitations of predictors that can’t capture interactions. Empirical evaluation on biomedical datasets shows how different ML algorithms perform in terms of interaction direction prediction.

Conclusion: IC-index provides valuable complementary evaluation to existing DTA prediction metrics by specifically assessing interaction effect direction prediction, which is essential for practical decision-making in drug-target allocation.

Abstract: Consider two sets of entities and their members’ mutual affinity values, say drug-target affinities (DTA). Drugs and targets are said to interact in their effects on DTAs if drug’s effect on it depends on the target. Presence of interaction implies that assigning a drug to a target and another drug to another target does not provide the same aggregate DTA as the reversed assignment would provide. Accordingly, correctly capturing interactions enables better decision-making, for example, in allocation of limited numbers of drug doses to their best matching targets. Learning to predict DTAs is popularly done from either solely from known DTAs or together with side information on the entities, such as chemical structures of drugs and targets. In this paper, we introduce interaction directions’ prediction performance estimator we call interaction concordance index (IC-index), for both fixed predictors and machine learning algorithms aimed for inferring them. IC-index complements the popularly used DTA prediction performance estimators by evaluating the ratio of correctly predicted directions of interaction effects in data. First, we show the invariance of IC-index on predictors unable to capture interactions. Secondly, we show that learning algorithm’s permutation equivariance regarding drug and target identities implies its inability to capture interactions when either drug, target or both are unseen during training. In practical applications, this equivariance is remedied via incorporation of appropriate side information on drugs and targets. We make a comprehensive empirical evaluation over several biomedical interaction data sets with various state-of-the-art machine learning algorithms. The experiments demonstrate how different types of affinity strength prediction methods perform in terms of IC-index complementing existing prediction performance estimators.

[491] MergeMoE: Efficient Compression of MoE Models via Expert Output Merging

Ruijie Miao, Yilun Yao, Zihan Wang, Zhiming Wang, Bairen Yi, LingJun Liu, Yikai Zhao, Tong Yang

Main category: cs.LG

TL;DR: MergeMoE is a novel method for compressing Mixture-of-Experts (MoE) models using mathematical optimization to merge experts’ outputs rather than just aggregating parameters, achieving better performance than baselines at the same compression ratios.

Details

Motivation: MoE models face substantial memory overhead despite their effectiveness in scaling model size, making compression an important research direction. Existing expert merging techniques need improvement.

Method: The approach interprets expert merging from the perspective of merging experts’ outputs rather than parameter aggregation. This leads to an optimization formulation where compression matrices are constructed using mathematical optimization.

Result: MergeMoE consistently outperforms baseline methods with the same compression ratios across multiple MoE models.

Conclusion: The output-based perspective on expert merging provides a more effective framework for MoE compression, and the optimization-based MergeMoE method demonstrates superior performance compared to existing approaches.

Abstract: The Mixture-of-Experts (MoE) technique has proven to be a promising solution to efficiently scale the model size, which has been widely applied in recent LLM advancements. However, the substantial memory overhead of MoE models has made their compression an important research direction. In this work, we provide a theoretical analysis of expert merging, a recently proposed technique for compressing MoE models. Rather than interpreting expert merging from the conventional perspective of parameter aggregation, we approach it from the perspective of merging experts’ outputs. Our key insight is that the merging process can be interpreted as inserting additional matrices into the forward computation, which naturally leads to an optimization formulation. Building on this analysis, we introduce MergeMoE, a method that leverages mathematical optimization to construct the compression matrices. We evaluate MergeMoE on multiple MoE models and show that our algorithm consistently outperforms the baselines with the same compression ratios.

[492] A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta

Main category: cs.LG

TL;DR: This paper challenges common intuitions about neural network pruning for LLMs, showing that reconstructing attention and MLP components separately within transformer blocks is more efficient and effective than full retraining, achieving better performance with less memory.

Details

Motivation: To challenge the assumption that full retraining should be avoided for LLM pruning due to computational infeasibility, and to study optimal reconstruction strategies after pruning.

Method: Conducted extensive computational study on GPT architectures, comparing different reconstruction granularities and retraining approaches after pruning, with focus on layer-wise mask selection and reconstruction using calibration data.

Result: Found that reconstructing attention and MLP components separately within transformer blocks is Pareto-optimal - most resource-efficient while achieving best perplexity, outperforming full retraining with only a fraction of memory. Also showed simple pruning criteria like Wanda can outperform complex approaches when reconstruction is properly executed.

Conclusion: The findings challenge the narrative that retraining should be avoided at all costs, providing important insights that proper reconstruction strategies can achieve better performance than full retraining while being more memory-efficient for LLM pruning.

Abstract: While Neural Network pruning typically requires retraining the model to recover pruning-induced performance degradation, state-of-the-art Large Language Models (LLMs) pruning methods instead solve a layer-wise mask selection and reconstruction problem on a small set of calibration data to avoid full retraining, as it is considered computationally infeasible for LLMs. Reconstructing single matrices in isolation has favorable properties, such as convexity of the objective and significantly reduced memory requirements compared to full retraining. In practice, however, reconstruction is often implemented at coarser granularities, e.g., reconstructing a whole transformer block against its dense activations instead of a single matrix. In this work, we study the key design choices when reconstructing or retraining the remaining weights after pruning. We conduct an extensive computational study on state-of-the-art GPT architectures, and report several surprising findings that challenge common intuitions about retraining after pruning. In particular, we observe a free lunch scenario: reconstructing attention and MLP components separately within each transformer block is nearly the most resource-efficient yet achieves the best perplexity. Most importantly, this Pareto-optimal setup achieves better performance than full retraining, despite requiring only a fraction of the memory. Furthermore, we demonstrate that simple and efficient pruning criteria such as Wanda can outperform much more complex approaches when the reconstruction step is properly executed, highlighting its importance. Our findings challenge the narrative that retraining should be avoided at all costs and provide important insights into post-pruning performance recovery for LLMs.

[493] Towards geological inference with process-based and deep generative modeling, part 1: training on fluvial deposits

Guillaume Rongier, Luk Peeters

Main category: cs.LG

TL;DR: GANs can effectively generate 3D fluvial deposits that reproduce geological structures and honor the law of superposition, showing promise for subsurface resource modeling.

Details

Motivation: Current generative models struggle to reproduce continuous geological structures like fluvial deposits, which are important for subsurface resource distribution.

Method: Train a generative adversarial network (GAN) using process-based model simulations of fluvial deposits as training data, with ablation study to test transferability from 2D to 3D.

Result: GAN training remains stable, generates samples that reproduce non-stationarity and details without mode collapse or memorization, and honors the law of superposition when using deposition time.

Conclusion: GANs are more robust than credited for specific geological structures, with potential for leveraging geological principles, though scalability to larger 3D images and multimodal datasets needs further exploration.

Abstract: The distribution of resources in the subsurface is deeply linked to the variations of its physical properties. Generative modeling has long been used to predict those physical properties while quantifying the associated uncertainty. But current approaches struggle to properly reproduce geological structures, and fluvial deposits in particular, because of their continuity. This study explores whether a generative adversarial network (GAN) - a type of deep-learning algorithm for generative modeling - can be trained to reproduce fluvial deposits simulated by a process-based model - a more expensive model that mimics geological processes. An ablation study shows that developments from the deep-learning community to generate large 2D images are directly transferable to 3D images of fluvial deposits. Training remains stable, and the generated samples reproduce the non-stationarity and details of the deposits without mode collapse or pure memorization of the training data. Using a process-based model to generate those training data allows us to include valuable properties other than the usual physical properties. We show how the deposition time let us monitor and validate the performance of a GAN by checking that its samples honor the law of superposition. Our work joins a series of previous studies suggesting that GANs are more robust that given credit for, at least for training datasets targeting specific geological structures. Whether this robustness transfers to larger 3D images and multimodal datasets remains to be seen. Exploring how deep generative models can leverage geological principles like the law of superposition shows a lot of promise.

[494] Feature Selection and Regularization in Multi-Class Classification: An Empirical Study of One-vs-Rest Logistic Regression with Gradient Descent Optimization and L1 Sparsity Constraints

Jahidul Arafat, Fariha Tasmin, Md Kaosar Uddin, Sanjaya Poudel, Eftakhar Ahmed Arnob

Main category: cs.LG

TL;DR: Comprehensive study of One-vs-Rest logistic regression for wine classification, comparing manual gradient descent with scikit-learn, analyzing L1 regularization effects, and proposing optimal feature subsets for cost-effective deployment.

Details

Motivation: Address trade-offs between model accuracy, feature dimensionality, and interpretability for production deployment in analytical chemistry, particularly for wine classification.

Method: Empirical study using One-vs-Rest logistic regression on UCI Wine dataset (178 samples, 3 cultivars, 13 chemical features), comparing manual gradient descent implementation against scikit-learn’s optimized solvers, and analyzing L1 regularization effects on feature sparsity.

Result: Manual gradient descent achieved 92.59% accuracy with smooth convergence; scikit-learn provided 24x training speedup and 98.15% accuracy. L1 regularization produced 54-69% feature reduction with only 4.63% accuracy decrease. Proposed optimal 5-feature subset achieves 62% complexity reduction with estimated 92-94% accuracy.

Conclusion: Findings provide actionable guidelines for practitioners balancing comprehensive chemical analysis against targeted feature measurement in resource-constrained environments, enabling cost-effective deployment with significant savings and real-time prediction capability.

Abstract: Multi-class wine classification presents fundamental trade-offs between model accuracy, feature dimensionality, and interpretability - critical factors for production deployment in analytical chemistry. This paper presents a comprehensive empirical study of One-vs-Rest logistic regression on the UCI Wine dataset (178 samples, 3 cultivars, 13 chemical features), comparing from-scratch gradient descent implementation against scikit-learn’s optimized solvers and quantifying L1 regularization effects on feature sparsity. Manual gradient descent achieves 92.59 percent mean test accuracy with smooth convergence, validating theoretical foundations, though scikit-learn provides 24x training speedup and 98.15 percent accuracy. Class-specific analysis reveals distinct chemical signatures with heterogeneous patterns where color intensity varies dramatically (0.31 to 16.50) across cultivars. L1 regularization produces 54-69 percent feature reduction with only 4.63 percent accuracy decrease, demonstrating favorable interpretability-performance trade-offs. We propose an optimal 5-feature subset achieving 62 percent complexity reduction with estimated 92-94 percent accuracy, enabling cost-effective deployment with 80 dollars savings per sample and 56 percent time reduction. Statistical validation confirms robust generalization with sub-2ms prediction latency suitable for real-time quality control. Our findings provide actionable guidelines for practitioners balancing comprehensive chemical analysis against targeted feature measurement in resource-constrained environments.

[495] Coder as Editor: Code-driven Interpretable Molecular Optimization

Wenyu Zhu, Chengzhu Li, Xiaohe Tian, Yifan Wang, Yinjun Jia, Jianhui Wang, Bowen Gao, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan

Main category: cs.LG

TL;DR: MECo is a framework that bridges reasoning and execution in molecular optimization by translating editing actions into executable code, achieving high accuracy and consistency.

Details

Motivation: LLMs struggle to faithfully execute molecular modifications when operating on non-intuitive representations like SMILES, creating a gap between reasoning and execution in drug discovery.

Method: A cascaded framework that first generates human-interpretable editing intentions from molecules and property goals, then translates those intentions into executable structural edits via code generation.

Result: Achieves over 98% accuracy in reproducing realistic edits, improves consistency by 38-86 percentage points to 90%+, and achieves higher success rates over SMILES-based baselines while preserving structural similarity.

Conclusion: MECo enables consistent, controllable and interpretable molecular design by aligning intention with execution, laying foundation for high-fidelity feedback loops and collaborative human-AI workflows in drug discovery.

Abstract: Molecular optimization is a central task in drug discovery that requires precise structural reasoning and domain knowledge. While large language models (LLMs) have shown promise in generating high-level editing intentions in natural language, they often struggle to faithfully execute these modifications-particularly when operating on non-intuitive representations like SMILES. We introduce MECo, a framework that bridges reasoning and execution by translating editing actions into executable code. MECo reformulates molecular optimization for LLMs as a cascaded framework: generating human-interpretable editing intentions from a molecule and property goal, followed by translating those intentions into executable structural edits via code generation. Our approach achieves over 98% accuracy in reproducing held-out realistic edits derived from chemical reactions and target-specific compound pairs. On downstream optimization benchmarks spanning physicochemical properties and target activities, MECo substantially improves consistency by 38-86 percentage points to 90%+ and achieves higher success rates over SMILES-based baselines while preserving structural similarity. By aligning intention with execution, MECo enables consistent, controllable and interpretable molecular design, laying the foundation for high-fidelity feedback loops and collaborative human-AI workflows in drug discovery.

[496] Holdout-Loss-Based Data Selection for LLM Finetuning via In-Context Learning

Ling Zhang, Xianliang Yang, Juwon Yu, Park Cheonyoung, Lei Song, Jiang Bian

Main category: cs.LG

TL;DR: A resource-efficient framework using In-Context Approximation (ICA) to select and reweight training data for fine-tuning language models, improving alignment with minimal overhead.

Details

Motivation: Fine-tuning large language models often suffers from noisy or off-target examples that dilute supervision, and current methods for identifying high-value training data rely on heuristics or expensive retraining.

Method: In-Context Approximation (ICA) estimates holdout loss after training on candidate examples by conditioning on a small curated holdout set in context, requiring no reference model or additional finetuning. ICA scores are used to derive per-example weights for dynamic gradient reweighting.

Result: ICA-based reweighting consistently improves model alignment across SFT, DPO, and SimPO methods, with diverse backbones and datasets, achieving performance gains with minimal computational overhead.

Conclusion: The ICA framework provides an efficient, theoretically grounded approach for data selection and reweighting that enhances fine-tuning effectiveness, though limitations exist for rapidly drifting on-policy updates that warrant future investigation.

Abstract: Fine-tuning large pretrained language models is a common approach for aligning them with human preferences, but noisy or off-target examples can dilute supervision. While small, well-chosen datasets often match the performance of much larger ones, systematic and efficient ways to identify high-value training data remain underexplored. Many current methods rely on heuristics or expensive retraining. We present a theoretically grounded, resource-efficient framework for data selection and reweighting. At its core is an In-Context Approximation (ICA) that estimates the holdout loss a model would incur after training on a candidate example by conditioning on a small, curated holdout set in context. ICA requires no reference model and no additional finetuning. Under a local linearization, ICA is equivalent to a first-order update toward the holdout optimum, motivating its use as a proxy for data value. We derive per-example weights from ICA scores, dynamically reweighting gradient updates as model parameters evolve. Across SFT, DPO, and SimPO, and over diverse backbones and datasets, ICA-based reweighting consistently improves model alignment with minimal overhead. We analyze sensitivity to score update frequency and the choice of $k$ holdout examples for in-context demonstrations, and note limitations for rapidly drifting on-policy updates, highlighting directions for future work. Code and prompts will be released.

[497] From Guess2Graph: When and How Can Unreliable Experts Safely Boost Causal Discovery in Finite Samples?

Sujai Hiremath, Dominik Janzing, Philipp Faller, Patrick Blöbaum, Elke Kirschbaum, Shiva Prasad Kasiviswanathan, Kyra Gan

Main category: cs.LG

TL;DR: The Guess2Graph (G2G) framework improves causal discovery with limited samples by using expert guesses to guide statistical tests rather than replacing them, maintaining statistical consistency while enabling performance gains.

Details

Motivation: Causal discovery algorithms perform poorly with limited samples, and existing methods that integrate expert knowledge require perfect predictions or uncertainty estimates, making them unreliable for practical use.

Method: Proposed Guess2Graph (G2G) framework with two instantiations: PC-Guess (augments PC algorithm) and gPC-Guess (learning-augmented variant designed to better leverage high-quality expert input). Both use expert guesses to guide the sequence of statistical tests.

Result: Theoretically, both methods preserve correctness regardless of expert error, with gPC-Guess provably outperforming its non-augmented counterpart when experts are ‘better than random.’ Empirically, both show monotonic improvement with expert accuracy, with gPC-Guess achieving significantly stronger gains.

Conclusion: The G2G framework provides a practical approach to improve causal discovery with limited samples by effectively leveraging expert knowledge while maintaining statistical guarantees.

Abstract: Causal discovery algorithms often perform poorly with limited samples. While integrating expert knowledge (including from LLMs) as constraints promises to improve performance, guarantees for existing methods require perfect predictions or uncertainty estimates, making them unreliable for practical use. We propose the Guess2Graph (G2G) framework, which uses expert guesses to guide the sequence of statistical tests rather than replacing them. This maintains statistical consistency while enabling performance improvements. We develop two instantiations of G2G: PC-Guess, which augments the PC algorithm, and gPC-Guess, a learning-augmented variant designed to better leverage high-quality expert input. Theoretically, both preserve correctness regardless of expert error, with gPC-Guess provably outperforming its non-augmented counterpart in finite samples when experts are “better than random.” Empirically, both show monotonic improvement with expert accuracy, with gPC-Guess achieving significantly stronger gains.

[498] Learning to Undo: Rollback-Augmented Reinforcement Learning with Reversibility Signals

Andrejs Sorstkins, Omer Tariq, Muhammad Bilal

Main category: cs.LG

TL;DR: A reversible learning framework for RL that uses transition reversibility measures and selective rollback operations to improve safety and performance by preventing catastrophic actions and value overestimation.

Details

Motivation: Address vulnerability to value overestimation and instability in partially irreversible environments where catastrophic actions can occur.

Method: Two core mechanisms: 1) Phi estimator quantifying state-action reversibility likelihood, 2) Selective rollback operation that returns to prior state when actions yield unexpectedly low returns.

Result: In CliffWalking: 99.8% reduction in catastrophic falls, 55% increase in mean episode return. In Taxi v3: ≥99.9% illegal action suppression, 65.7% cumulative reward improvement, reduced reward variance.

Conclusion: The rollback mechanism is critical for safety and performance gains, representing a robust step toward safe sequential decision making in RL.

Abstract: This paper proposes a reversible learning framework to improve the robustness and efficiency of value based Reinforcement Learning agents, addressing vulnerability to value overestimation and instability in partially irreversible environments. The framework has two complementary core mechanisms: an empirically derived transition reversibility measure called Phi of s and a, and a selective state rollback operation. We introduce an online per state action estimator called Phi that quantifies the likelihood of returning to a prior state within a fixed horizon K. This measure is used to adjust the penalty term during temporal difference updates dynamically, integrating reversibility awareness directly into the value function. The system also includes a selective rollback operator. When an action yields an expected return markedly lower than its instantaneous estimated value and violates a predefined threshold, the agent is penalized and returns to the preceding state rather than progressing. This interrupts sub optimal high risk trajectories and avoids catastrophic steps. By combining reversibility aware evaluation with targeted rollback, the method improves safety, performance, and stability. In the CliffWalking v0 domain, the framework reduced catastrophic falls by over 99.8 percent and yielded a 55 percent increase in mean episode return. In the Taxi v3 domain, it suppressed illegal actions by greater than or equal to 99.9 percent and achieved a 65.7 percent improvement in cumulative reward, while also sharply reducing reward variance in both environments. Ablation studies confirm that the rollback mechanism is the critical component underlying these safety and performance gains, marking a robust step toward safe and reliable sequential decision making.

[499] Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective

Xingjian Wu, Xiangfei Qiu, Hanyin Cheng, Zhengyu Li, Jilin Hu, Chenjuan Guo, Bin Yang

Main category: cs.LG

TL;DR: The paper proposes Selective Representation Space (SRS) module that uses learnable Selective Patching and Dynamic Reassembly to flexibly select and shuffle patches from time series, overcoming limitations of conventional fixed patching methods.

Details

Motivation: Conventional patching techniques partition time series into adjacent patches, creating a fixed representation space that results in insufficiently expressive representations for time series forecasting.

Method: Proposed SRS module with Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle patches from contextual time series. Also introduced SRSNet model combining SRS with an MLP head.

Result: SRSNet achieves state-of-the-art performance on real-world datasets from multiple domains. The SRS module also enhances performance of existing patch-based models as a plug-and-play component.

Conclusion: The SRS module effectively constructs a selective representation space that flexibly includes the most informative patches, significantly improving forecasting performance in patch-based time series models.

Abstract: Time Series Forecasting has made significant progress with the help of Patching technique, which partitions time series into multiple patches to effectively retain contextual semantic information into a representation space beneficial for modeling long-term dependencies. However, conventional patching partitions a time series into adjacent patches, which causes a fixed representation space, thus resulting in insufficiently expressful representations. In this paper, we pioneer the exploration of constructing a selective representation space to flexibly include the most informative patches for forecasting. Specifically, we propose the Selective Representation Space (SRS) module, which utilizes the learnable Selective Patching and Dynamic Reassembly techniques to adaptively select and shuffle the patches from the contextual time series, aiming at fully exploiting the information of contextual time series to enhance the forecasting performance of patch-based models. To demonstrate the effectiveness of SRS module, we propose a simple yet effective SRSNet consisting of SRS and an MLP head, which achieves state-of-the-art performance on real-world datasets from multiple domains. Furthermore, as a novel plugin-and-play module, SRS can also enhance the performance of existing patch-based models. The resources are available at https://github.com/decisionintelligence/SRSNet.

[500] On the Identifiability of Tensor Ranks via Prior Predictive Matching

Eliezer da Silva, Arto Klami, Diego Mesquita, Iñigo Urteaga

Main category: cs.LG

TL;DR: A rigorous method for determining rank identifiability in probabilistic tensor models using prior predictive moment matching, showing that PARAFAC/CP, Tensor Train, and Tensor Ring models have identifiable ranks while Tucker model does not.

Details

Motivation: Current approaches for selecting latent dimensions (ranks) in tensor factorization rely on heuristic methods, lacking rigorous theoretical foundations.

Method: Transform moment matching conditions into log-linear system of equations relating marginal moments, prior hyperparameters, and ranks; establish equivalence between rank identifiability and system solvability.

Result: PARAFAC/CP, Tensor Train, and Tensor Ring models yield solvable systems (identifiable ranks), while Tucker model leads to underdetermined system (unidentifiable ranks). Derived closed-form rank estimators for identifiable models.

Conclusion: The proposed framework provides rigorous rank identifiability analysis and practical estimators for tensor models, with empirical validation showing robustness.

Abstract: Selecting the latent dimensions (ranks) in tensor factorization is a central challenge that often relies on heuristic methods. This paper introduces a rigorous approach to determine rank identifiability in probabilistic tensor models, based on prior predictive moment matching. We transform a set of moment matching conditions into a log-linear system of equations in terms of marginal moments, prior hyperparameters, and ranks; establishing an equivalence between rank identifiability and the solvability of such system. We apply this framework to four foundational tensor-models, demonstrating that the linear structure of the PARAFAC/CP model, the chain structure of the Tensor Train model, and the closed-loop structure of the Tensor Ring model yield solvable systems, making their ranks identifiable. In contrast, we prove that the symmetric topology of the Tucker model leads to an underdetermined system, rendering the ranks unidentifiable by this method. For the identifiable models, we derive explicit closed-form rank estimators based on the moments of observed data only. We empirically validate these estimators and evaluate the robustness of the proposal.

[501] Agentic Entropy-Balanced Policy Optimization

Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou

Main category: cs.LG

TL;DR: AEPO is an agentic RL algorithm that balances entropy in rollout and policy update phases to prevent training collapse caused by excessive entropy reliance, achieving state-of-the-art performance across 14 datasets.

Details

Motivation: Mainstream agentic RL algorithms rely heavily on entropy signals for exploration, but excessive entropy can lead to training collapse and constraints in multi-turn, long-horizon tool-use capabilities.

Method: AEPO uses: (1) dynamic entropy-balanced rollout with adaptive sampling budget allocation and branch penalty for consecutive high-entropy steps; (2) Entropy-Balanced Policy Optimization with stop-gradient operation on high-entropy clipping and entropy-aware advantage estimation.

Result: AEPO outperforms 7 mainstream RL algorithms across 14 datasets. Qwen3-14B with AEPO achieves: 47.6% on GAIA, 11.2% on Humanity’s Last Exam, 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity’s Last Exam, 70.0% on WebWalker for Pass@5.

Conclusion: AEPO improves rollout sampling diversity while maintaining stable policy entropy, enabling scalable web agent training and addressing entropy-related challenges in agentic RL.

Abstract: Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity’s Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity’s Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.

[502] MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving

Jungi Lee, Junyong Park, Soohyun Cha, Jaehoon Cho, Jaewoong Sim

Main category: cs.LG

TL;DR: MX+ is a cost-effective extension to block floating-point formats that addresses outlier issues in ultra low-bit precision for LLM serving by repurposing exponent fields as extended mantissas, achieving better performance than MXFP4 with minimal overhead.

Details

Motivation: Existing reduced-precision formats for LLM serving often require intrusive software modifications or are unconventional for widespread hardware adoption. Ultra low-bit BFP variants struggle with outlier values that degrade language model performance.

Method: Proposed MX+ extension to microscaling (MX) formats that repurposes the exponent field of outlier elements as an extended mantissa to increase precision, providing a non-intrusive solution that integrates seamlessly with existing MX formats.

Result: MX+ achieves significantly higher model performance compared to 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, making it a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.

Conclusion: MX+ offers an effective solution to outlier problems in low-bit BFP formats, enabling better LLM performance with minimal cost, making it suitable for widespread adoption across hardware vendors.

Abstract: Reduced-precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are rather unconventional for widespread adoption across hardware vendors. In this paper, we instead focus on recent industry-driven variants of block floating-point (BFP) formats and conduct a comprehensive analysis to push their limits for efficient LLM serving. Our analysis shows that existing ultra low-bit BFP variants struggle to provide reasonable language model performance due to outlier values in blocks. To address the outliers with BFPs, we propose MX+, a cost-effective and non-intrusive extension designed for seamless integration into the microscaling (MX) formats. MX+ builds on the key insight that the outlier does not need to use its exponent field in the element data type, which allows us to repurpose the exponent field as an extended mantissa to increase the precision of the outlier element. Our evaluation shows that MX+ achieves significantly higher model performance compared to the 4-bit MX format (MXFP4) with negligible storage overhead and slowdown, thus offering a compelling alternative to MXFP4 or MXFP6 for efficient LLM inference.

[503] Redundancy-Aware Test-Time Graph Out-of-Distribution Detection

Yue Hou, He Zhu, Ruomei Liu, Yingke Su, Junran Wu, Ke Xu

Main category: cs.LG

TL;DR: RedOUT is an unsupervised framework that uses structural entropy to improve out-of-distribution (OOD) detection for graph classification by reducing structural redundancy through a Redundancy-aware Graph Information Bottleneck approach.

Details

Motivation: Existing graph OOD detection methods suffer from performance limitations due to structural redundancy that causes semantic shifts between training and test data distributions.

Method: Proposes RedOUT framework with Redundancy-aware Graph Information Bottleneck (ReGIB) that decomposes graph information into essential and redundant components, using structural entropy minimization to reduce redundancy with theoretically grounded optimization bounds.

Result: Achieves superior OOD detection performance with 6.7% average improvement and 17.3% improvement over best competitor on ClinTox/LIPO dataset pair.

Conclusion: RedOUT effectively addresses structural redundancy in graph OOD detection through structural entropy minimization and information bottleneck principles, demonstrating significant performance gains over existing methods.

Abstract: Distributional discrepancy between training and test data can lead models to make inaccurate predictions when encountering out-of-distribution (OOD) samples in real-world applications. Although existing graph OOD detection methods leverage data-centric techniques to extract effective representations, their performance remains compromised by structural redundancy that induces semantic shifts. To address this dilemma, we propose RedOUT, an unsupervised framework that integrates structural entropy into test-time OOD detection for graph classification. Concretely, we introduce the Redundancy-aware Graph Information Bottleneck (ReGIB) and decompose the objective into essential information and irrelevant redundancy. By minimizing structural entropy, the decoupled redundancy is reduced, and theoretically grounded upper and lower bounds are proposed for optimization. Extensive experiments on real-world datasets demonstrate the superior performance of RedOUT on OOD detection. Specifically, our method achieves an average improvement of 6.7%, significantly surpassing the best competitor by 17.3% on the ClinTox/LIPO dataset pair.

[504] State-Space Models for Tabular Prior-Data Fitted Networks

Felix Koch, Marcel Wever, Fabian Raisch, Benjamin Tischler

Main category: cs.LG

TL;DR: Hydra, a bidirectional linear-time structured state space model, is proposed as an efficient alternative to Transformers in TabPFN for tabular data, addressing order-dependence issues while maintaining competitive performance.

Details

Motivation: Transformers in foundation models like TabPFN have quadratic complexity with sequence length, motivating exploration of more efficient sequence models. SSMs offer linear-time efficiency but suffer from sensitivity to input token order, which is problematic for tabular data where row order is semantically meaningless.

Method: The study investigates using Hydra, a bidirectional linear-time structured state space model, as an alternative to Transformers in TabPFN. The bidirectional approach aims to preserve efficiency while enabling symmetric context aggregation to reduce order-dependence.

Result: Experiments show that the bidirectional Hydra approach reduces order-dependence and achieves predictive performance competitive with the original TabPFN Transformer model.

Conclusion: Hydra SSM provides an efficient alternative to Transformers for tabular data foundation models, successfully addressing the order-sensitivity problem through bidirectional processing while maintaining competitive predictive performance.

Abstract: Recent advancements in foundation models for tabular data, such as TabPFN, demonstrated that pretrained Transformer architectures can approximate Bayesian inference with high predictive performance. However, Transformers suffer from quadratic complexity with respect to sequence length, motivating the exploration of more efficient sequence models. In this work, we investigate the potential of using Hydra, a bidirectional linear-time structured state space model (SSM), as an alternative to Transformers in TabPFN. A key challenge lies in SSM’s inherent sensitivity to the order of input tokens - an undesirable property for tabular datasets where the row order is semantically meaningless. We investigate to what extent a bidirectional approach can preserve efficiency and enable symmetric context aggregation. Our experiments show that this approach reduces the order-dependence, achieving predictive performance competitive to the original TabPFN model.

[505] Selective Labeling with False Discovery Rate Control

Huipeng Huang, Wenbo Liao, Huajun Xi, Hao Zeng, Mengchen Zhao, Hongxin Wei

Main category: cs.LG

TL;DR: Conformal Labeling is a novel method that provides theoretical guarantees for AI-assigned labels by controlling false discovery rate (FDR), ensuring a predefined fraction of AI labels is correct.

Details

Motivation: High-quality human labeling is expensive, while AI labeling is cost-effective but suffers from unavoidable errors. Existing selective labeling methods lack theoretical guarantees on AI label quality.

Method: Construct conformal p-values by comparing AI models’ predicted confidence to calibration instances mislabeled by AI models, then select test instances with p-values below a data-dependent threshold.

Result: Extensive experiments show tight FDR control with high power across various tasks including image/text labeling and LLM QA.

Conclusion: Conformal Labeling provides provable trust in AI predictions by controlling FDR, offering a reliable solution for cost-effective labeling with quality guarantees.

Abstract: Obtaining high-quality labels for large datasets is expensive, requiring massive annotations from human experts. While AI models offer a cost-effective alternative by predicting labels, their label quality is compromised by the unavoidable labeling errors. Existing methods mitigate this issue through selective labeling, where AI labels a subset and human labels the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high labeling error within the AI-labeled subset. To address this, we introduce \textbf{Conformal Labeling}, a novel method to identify instances where AI predictions can be provably trusted. This is achieved by controlling the false discovery rate (FDR), the proportion of incorrect labels within the selected subset. In particular, we construct a conformal $p$-value for each test instance by comparing AI models’ predicted confidence to those of calibration instances mislabeled by AI models. Then, we select test instances whose $p$-values are below a data-dependent threshold, certifying AI models’ predictions as trustworthy. We provide theoretical guarantees that Conformal Labeling controls the FDR below the nominal level, ensuring that a predefined fraction of AI-assigned labels is correct on average. Extensive experiments demonstrate that our method achieves tight FDR control with high power across various tasks, including image and text labeling, and LLM QA.

[506] Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

Daria Frolova, Talgat Daulbaev, Egor Sevryugov, Sergei A. Nikolenko, Dmitry N. Ivankov, Ivan Oseledets, Marina A. Pak

Main category: cs.LG

TL;DR: Matcha is a molecular docking pipeline that uses multi-stage flow matching with scoring and physical filtering to predict protein-ligand binding poses, achieving superior accuracy and speed compared to existing methods.

Details

Motivation: Existing protein-ligand docking methods struggle to balance speed, accuracy, and physical plausibility, creating a need for improved approaches in structure-based drug design.

Method: Three-stage flow matching pipeline operating on geometric spaces (R³, SO(3), SO(2)) with learned scoring and unsupervised physical validity filtering to refine docking predictions.

Result: Superior performance on Astex and PDBbind test sets in docking success rate and physical plausibility, with approximately 25x faster speed than modern co-folding models.

Conclusion: Matcha provides an effective solution for accurate and fast protein-ligand binding pose prediction with enhanced physical plausibility for drug design applications.

Abstract: Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively to refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ($\mathbb{R}^3$, $\mathrm{SO}(3)$, and $\mathrm{SO}(2)$). We enhance the prediction quality through a dedicated scoring model and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBbind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.

[507] Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval

Rashmi R, Vidyadhar Upadhya

Main category: cs.LG

TL;DR: MAHA is a modality-aware hybrid retrieval architecture that combines dense vector retrieval with structured graph traversal for multimodal question answering, achieving superior performance over baseline methods.

Details

Motivation: Current RAG systems are limited to unimodal textual data, making them ineffective for unstructured multimodal documents that contain text, images, tables, equations, and graphs with unique information.

Method: MAHA integrates dense vector retrieval with structured graph traversal using a modality-aware knowledge graph that encodes cross-modal semantics and relationships.

Result: MAHA substantially outperforms baseline methods on multiple benchmark datasets, achieving a ROUGE-L score of 0.486 with complete modality coverage.

Conclusion: MAHA establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.

Abstract: Current Retrieval-Augmented Generation (RAG) systems primarily operate on unimodal textual data, limiting their effectiveness on unstructured multimodal documents. Such documents often combine text, images, tables, equations, and graphs, each contributing unique information. In this work, we present a Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for multimodal question answering with reasoning through a modality-aware knowledge graph. MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships. This design enables both semantically rich and context-aware retrieval across diverse modalities. Evaluations on multiple benchmark datasets demonstrate that MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of 0.486, providing complete modality coverage. These results highlight MAHA’s ability to combine embeddings with explicit document structure, enabling effective multimodal retrieval. Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.

[508] First Attentions Last: Better Exploiting First Attentions for Efficient Transformer Training

Gyudong Kim, Hyukju Na, Jin Hyeon Kim, Hyunsung Jang, Jaemin Park, Jaegi Hwang, Namkoo Ha, Seungryong Kim, Young Geun Kim

Main category: cs.LG

TL;DR: FAL is an efficient transformer architecture that eliminates per-block MHA-MLP connections to reduce communication overhead in distributed training, achieving up to 44% training time reduction and better perplexity than baseline GPT.

Details

Motivation: Existing transformer designs suffer from significant communication overhead in Tensor Parallelism, particularly the all-reduce communication required for each block's MHA-MLP connection.

Method: Proposes FAL architecture that redirects first MHA output to MLP inputs of following layers, eliminating per-block MHA-MLP connections and enabling parallel MHA-MLP execution on single GPU. FAL+ adds normalized first attention output to subsequent MHA outputs to augment MLP input.

Result: FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18x, and achieves better perplexity than baseline GPT. FAL+ achieves even lower perplexity without increasing training time.

Conclusion: The proposed FAL architecture effectively reduces communication overhead in distributed transformer training while maintaining or improving model quality, making it a practical solution for efficient large-scale transformer training.

Abstract: As training billion-scale transformers becomes increasingly common, employing multiple distributed GPUs along with parallel training methods has become a standard practice. However, existing transformer designs suffer from significant communication overhead, especially in Tensor Parallelism (TP), where each block’s MHA-MLP connection requires an all-reduce communication. Through our investigation, we show that the MHA-MLP connections can be bypassed for efficiency, while the attention output of the first layer can serve as an alternative signal for the bypassed connection. Motivated by the observations, we propose FAL (First Attentions Last), an efficient transformer architecture that redirects the first MHA output to the MLP inputs of the following layers, eliminating the per-block MHA-MLP connections. This removes the all-reduce communication and enables parallel execution of MHA and MLP on a single GPU. We also introduce FAL+, which adds the normalized first attention output to the MHA outputs of the following layers to augment the MLP input for the model quality. Our evaluation shows that FAL reduces multi-GPU training time by up to 44%, improves single-GPU throughput by up to 1.18x, and achieves better perplexity compared to the baseline GPT. FAL+ achieves even lower perplexity without increasing the training time than the baseline.

[509] LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Zhuo Cao, Xuan Zhao, Lena Krieger, Hanno Scharr, Ira Assent

Main category: cs.LG

TL;DR: LeapFactual is a novel counterfactual explanation algorithm using conditional flow matching to generate reliable counterfactuals, overcoming limitations of existing methods like gradient vanishing and decision boundary misalignment.

Details

Motivation: The need for interpretable ML/AI models in high-stakes domains like healthcare and scientific research, where current counterfactual methods suffer from gradient vanishing, discontinuous latent spaces, and decision boundary misalignment issues.

Method: LeapFactual uses conditional flow matching to generate counterfactual explanations, is model-agnostic (works with non-differentiable models), and can handle human-in-the-loop systems.

Result: Extensive experiments show LeapFactual generates accurate, in-distribution counterfactuals that provide actionable insights. The reliable counterfactuals can also be used as new training data to enhance model performance.

Conclusion: LeapFactual is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability, expanding counterfactual explanations to domains requiring human participation like citizen science.

Abstract: The growing integration of machine learning (ML) and artificial intelligence (AI) models into high-stakes domains such as healthcare and scientific research calls for models that are not only accurate but also interpretable. Among the existing explainable methods, counterfactual explanations offer interpretability by identifying minimal changes to inputs that would alter a model’s prediction, thus providing deeper insights. However, current counterfactual generation methods suffer from critical limitations, including gradient vanishing, discontinuous latent spaces, and an overreliance on the alignment between learned and true decision boundaries. To overcome these limitations, we propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching. LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge. Following a model-agnostic approach, LeapFactual is not limited to models with differentiable loss functions. It can even handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators, such as citizen science. We provide extensive experiments on benchmark and real-world datasets showing that LeapFactual generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability.

[510] Galaxy Morphology Classification with Counterfactual Explanation

Zhuo Cao, Lena Krieger, Hanno Scharr, Ira Assent

Main category: cs.LG

TL;DR: The paper proposes extending an encoder-decoder architecture with invertible flow to improve galaxy morphology classification by providing interpretable counterfactual explanations alongside good predictive performance.

Details

Motivation: Current machine learning approaches for galaxy morphology classification lack interpretability, making it difficult to understand how models work and explain their results, which is problematic for large astronomical datasets.

Method: Extend classical encoder-decoder architecture with invertible flow to enable counterfactual explanations while maintaining good predictive performance.

Result: The proposed approach achieves good predictive performance for galaxy morphology classification while providing additional interpretability through counterfactual explanations.

Conclusion: The integration of invertible flow with encoder-decoder architecture successfully addresses the interpretability limitations of traditional machine learning methods for galaxy morphology analysis.

Abstract: Galaxy morphologies play an essential role in the study of the evolution of galaxies. The determination of morphologies is laborious for a large amount of data giving rise to machine learning-based approaches. Unfortunately, most of these approaches offer no insight into how the model works and make the results difficult to understand and explain. We here propose to extend a classical encoder-decoder architecture with invertible flow, allowing us to not only obtain a good predictive performance but also provide additional information about the decision process with counterfactual explanations.

[511] Geometric Moment Alignment for Domain Adaptation via Siegel Embeddings

Shayan Gharib, Marcelo Hartmann, Arto Klami

Main category: cs.LG

TL;DR: A Riemannian geometry approach for unsupervised domain adaptation using Siegel embeddings to represent moments as SPD matrices, enabling simultaneous alignment of first- and second-order moments with natural geometric distances.

Details

Motivation: Existing domain adaptation methods use ad-hoc similarity measures for moment alignment, lacking principled geometric foundations for distribution matching.

Method: Propose using Siegel embeddings to represent first- and second-order moments as symmetric positive definite matrices, then align distributions using Riemannian distances on the SPD manifold.

Result: Validated on image denoising and classification benchmarks, showing improved adaptation through principled geometric alignment.

Conclusion: Riemannian geometry provides a more faithful metric for domain adaptation by preserving distribution structure through simultaneous moment alignment on SPD manifolds.

Abstract: We address the problem of distribution shift in unsupervised domain adaptation with a moment-matching approach. Existing methods typically align low-order statistical moments of the source and target distributions in an embedding space using ad-hoc similarity measures. We propose a principled alternative that instead leverages the intrinsic geometry of these distributions by adopting a Riemannian distance for this alignment. Our key novelty lies in expressing the first- and second-order moments as a single symmetric positive definite (SPD) matrix through Siegel embeddings. This enables simultaneous adaptation of both moments using the natural geometric distance on the shared manifold of SPD matrices, preserving the mean and covariance structure of the source and target distributions and yielding a more faithful metric for cross-domain comparison. We connect the Riemannian manifold distance to the target-domain error bound, and validate the method on image denoising and image classification benchmarks. Our code is publicly available at https://github.com/shayangharib/GeoAdapt.

[512] Online Reliable Anomaly Detection via Neuromorphic Sensing and Communications

Junya Shiraishi, Jiechen Chen, Osvaldo Simeone, Petar Popovski

Main category: cs.LG

TL;DR: A low-power online anomaly detection framework using neuromorphic wireless sensor networks with FDR-controlled detection and dynamic sensor querying optimization.

Details

Motivation: Need for low-power, reliable anomaly detection in applications like brain-machine interfaces and environmental monitoring, where traditional continuous sensing is inefficient.

Method: Event-driven neuromorphic sensors produce spikes for relevant changes; central reader queries subset of sensors using IR transmissions; online hypothesis testing with e-values for FDR control; dynamic sensor querying as multi-armed bandit problem.

Result: Reliable anomaly detection under stringent FDR requirements, efficient sensor communication scheduling, and low detection latency.

Conclusion: The proposed framework effectively balances detection reliability, power efficiency, and latency in neuromorphic sensor networks.

Abstract: This paper proposes a low-power online anomaly detection framework based on neuromorphic wireless sensor networks, encompassing possible use cases such as brain-machine interfaces and remote environmental monitoring. In the considered system, a central reader node actively queries a subset of neuromorphic sensor nodes (neuro-SNs) at each time frame. The neuromorphic sensors are event-driven, producing spikes in correspondence to relevant changes in the monitored system. The queried neuro-SNs respond to the reader with impulse radio (IR) transmissions that directly encode the sensed local events. The reader processes these event-driven signals to determine whether the monitored environment is in a normal or anomalous state, while rigorously controlling the false discovery rate (FDR) of detections below a predefined threshold. The proposed approach employs an online hypothesis testing method with e-values to maintain FDR control without requiring knowledge of the anomaly rate, and it dynamically optimizes the sensor querying strategy by casting it as a best-arm identification problem in a multi-armed bandit framework. Extensive performance evaluation demonstrates that the proposed method can reliably detect anomalies under stringent FDR requirements, while efficiently scheduling sensor communications and achieving low detection latency.

[513] FedPPA: Progressive Parameter Alignment for Personalized Federated Learning

Maulidi Adi Prasetia, Muhamad Risqi U. Saputra, Guntur Dharma Putra

Main category: cs.LG

TL;DR: FedPPA is a personalized federated learning method that progressively aligns client model weights with the global model while preserving local knowledge, addressing both model and data heterogeneity in non-IID settings.

Details

Motivation: Existing PFL approaches overlook the coexistence of model and data heterogeneity from clients with diverse computational capabilities, which poses challenges in real-world federated learning scenarios.

Method: Progressive Parameter Alignment (FedPPA) that aligns weights of common layers across clients with global model weights, plus entropy-based weighted averaging to enhance global model performance while maintaining personalization.

Result: Experiments on MNIST, FMNIST, and CIFAR-10 datasets show FedPPA consistently outperforms existing FL algorithms in personalized adaptation.

Conclusion: FedPPA effectively addresses model and data heterogeneity in federated learning, achieving superior personalized performance while maintaining robust global model characteristics.

Abstract: Federated Learning (FL) is designed as a decentralized, privacy-preserving machine learning paradigm that enables multiple clients to collaboratively train a model without sharing their data. In real-world scenarios, however, clients often have heterogeneous computational resources and hold non-independent and identically distributed data (non-IID), which poses significant challenges during training. Personalized Federated Learning (PFL) has emerged to address these issues by customizing models for each client based on their unique data distribution. Despite its potential, existing PFL approaches typically overlook the coexistence of model and data heterogeneity arising from clients with diverse computational capabilities. To overcome this limitation, we propose a novel method, called Progressive Parameter Alignment (FedPPA), which progressively aligns the weights of common layers across clients with the global model’s weights. Our approach not only mitigates inconsistencies between global and local models during client updates, but also preserves client’s local knowledge, thereby enhancing personalization robustness in non-IID settings. To further enhance the global model performance while retaining strong personalization, we also integrate entropy-based weighted averaging into the FedPPA framework. Experiments on three image classification datasets, including MNIST, FMNIST, and CIFAR-10, demonstrate that FedPPA consistently outperforms existing FL algorithms, achieving superior performance in personalized adaptation.

[514] Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling

Alexandru Meterez, Depen Morwani, Jingfeng Wu, Costin-Andrei Oncescu, Cengiz Pehlevan, Sham Kakade

Main category: cs.LG

TL;DR: Seesaw is a principled batch-size scheduling method that replaces learning rate decay with batch size increases while adjusting learning rate by 1/√2, reducing wall-clock time by ~36% while maintaining performance.

Details

Motivation: Batch size ramping can accelerate LLM pretraining, but optimal strategies for adaptive optimizers like Adam are unclear, leading to heuristic tuning. A principled framework is needed.

Method: Seesaw replaces standard learning rate decay: instead of halving LR, it multiplies LR by 1/√2 and doubles batch size, preserving loss dynamics while reducing serial steps.

Result: On 150M/300M/600M-parameter models at Chinchilla scale, Seesaw matches cosine decay performance at equal FLOPs while reducing wall-clock time by ~36%, approaching theoretical limits.

Conclusion: Seesaw provides a principled batch-size scheduling framework that significantly accelerates training while maintaining performance, with theoretical foundations for SGD and normalized SGD.

Abstract: Increasing the batch size during training – a ‘‘batch ramp’’ – is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/\sqrt{2}$ and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by $\approx 36%$, approaching the theoretical limit implied by our analysis.

[515] Tawa: Automatic Warp Specialization for Modern GPUs with Asynchronous References

Hongzheng Chen, Bin Fan, Alexander Collins, Bastian Hagedorn, Evghenii Gaburov, Masahiro Masuda, Matthew Brookhart, Chris Sullivan, Jason Knight, Zhiru Zhang, Vinod Grover

Main category: cs.LG

TL;DR: Tawa is an automated compiler that generates high-performance, warp-specialized GPU code from high-level tile-based programs, achieving speedups over optimized libraries while reducing programming effort.

Details

Motivation: The conventional SIMT programming model is misaligned with modern GPU's task-parallel hardware, creating a programmability gap where manual warp specialization is labor-intensive and error-prone.

Method: Tawa uses a novel IR abstraction called asynchronous references (aref) to express warp-level communication without exposing low-level hardware details, automatically partitioning programs into producer-consumer roles and managing dataflow pipelines.

Result: On NVIDIA H100 GPUs, Tawa achieves up to 1.1× speedup over cuBLAS GEMM kernels and 1.2× speedup over Triton for attention workloads, matching hand-optimized CUTLASS FlashAttention-3 performance with less programming effort.

Conclusion: Tawa successfully bridges the programmability gap by automating warp specialization, enabling high hardware utilization without requiring developers to manually orchestrate complex low-level communication.

Abstract: Modern GPUs feature specialized hardware units that enable high-performance, asynchronous dataflow execution. However, the conventional SIMT programming model is fundamentally misaligned with this task-parallel hardware, creating a significant programmability gap. While hardware-level warp specialization is the key to unlocking peak performance, it forces developers to manually orchestrate complex, low-level communication and software pipelines–a process that is labor-intensive, error-prone, and unsustainable. To address this challenge, we present Tawa, an automated compiler that systematically generates high-performance, warp-specialized code from a high-level, tile-based program. Central to our approach is a novel IR abstraction, asynchronous references (aref), which expresses warp-level communication without exposing low-level hardware details. Using this abstraction, Tawa automatically partitions programs into producer-consumer roles and manages the intricate dataflow pipeline, relieving developers of invasive kernel rewriting. Evaluation on NVIDIA H100 GPUs across representative LLM kernels shows that Tawa delivers high hardware utilization, achieving up to 1.1$\times$ speedup over highly optimized cuBLAS GEMM kernels. For attention workloads, Tawa attains 1.2$\times$ speedup over Triton and matches the performance of the hand-optimized CUTLASS C++ FlashAttention-3 kernel with far less programming effort.

[516] The Pursuit of Diversity: Multi-Objective Testing of Deep Reinforcement Learning Agents

Antony Bartlett, Cynthia Liem, Annibale Panichella

Main category: cs.LG

TL;DR: INDAGO-Nexus is a multi-objective search approach that discovers diverse failure scenarios in DRL agents by optimizing for both failure likelihood and test diversity, outperforming single-objective methods.

Details

Motivation: Existing tools like INDAGO focus only on maximizing failure counts without ensuring diversity of discovered scenarios or revealing distinct error types in safety-critical DRL testing.

Method: Uses multi-objective evolutionary algorithms with multiple diversity metrics and Pareto front selection strategies to jointly optimize for failure likelihood and test scenario diversity.

Result: INDAGO-Nexus discovers up to 83% and 40% more unique failures than INDAGO in SDC and Parking scenarios respectively, while reducing time-to-failure by up to 67% across all tested agents.

Conclusion: Multi-objective optimization approach significantly improves test effectiveness and efficiency for discovering diverse failure scenarios in DRL agents compared to single-objective methods.

Abstract: Testing deep reinforcement learning (DRL) agents in safety-critical domains requires discovering diverse failure scenarios. Existing tools such as INDAGO rely on single-objective optimization focused solely on maximizing failure counts, but this does not ensure discovered scenarios are diverse or reveal distinct error types. We introduce INDAGO-Nexus, a multi-objective search approach that jointly optimizes for failure likelihood and test scenario diversity using multi-objective evolutionary algorithms with multiple diversity metrics and Pareto front selection strategies. We evaluated INDAGO-Nexus on three DRL agents: humanoid walker, self-driving car, and parking agent. On average, INDAGO-Nexus discovers up to 83% and 40% more unique failures (test effectiveness) than INDAGO in the SDC and Parking scenarios, respectively, while reducing time-to-failure by up to 67% across all agents.

[517] Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Divyat Mahajan, Sachin Goyal, Badr Youbi Idrissi, Mohammad Pezeshki, Ioannis Mitliagkas, David Lopez-Paz, Kartik Ahuja

Main category: cs.LG

TL;DR: Future Summary Prediction (FSP) improves long-horizon reasoning by training models to predict compact representations of future content, outperforming both next-token and multi-token prediction methods.

Details

Motivation: Next-token prediction struggles with long-horizon reasoning, planning, and creative writing due to teacher-forced training limitations, while multi-token prediction only captures short-range dependencies.

Method: Proposes FSP with two variants: handcrafted summaries (e.g., bag of words) and learned summaries using embeddings from a reverse language model trained from right to left.

Result: Large-scale pretraining experiments with 3B and 8B-parameter models show FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

Conclusion: FSP effectively addresses limitations of existing prediction methods by enabling models to capture long-term dependencies through future summary prediction.

Abstract: Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

[518] Causal Discovery for Linear DAGs with Dependent Latent Variables via Higher-order Cumulants

Ming Cai, Penggang Gao, Hisayuki Hara

Main category: cs.LG

TL;DR: Proposes a novel algorithm for estimating causal DAGs in linear non-Gaussian acyclic models with latent confounders, allowing causal relationships among all variables.

Details

Motivation: Existing methods assume independent latent confounders or cannot handle causal relationships among observed variables, limiting their applicability to real-world scenarios.

Method: Leverages higher-order cumulants of observed data to identify causal structure, allowing causal relationships among latent variables, observed variables, and between them.

Result: Extensive simulations and real-world experiments demonstrate the validity and practical utility of the proposed algorithm.

Conclusion: The method successfully identifies causal DAGs in LvLiNGAM models with complex causal structures, overcoming limitations of existing approaches.

Abstract: This paper addresses the problem of estimating causal directed acyclic graphs in linear non-Gaussian acyclic models with latent confounders (LvLiNGAM). Existing methods assume mutually independent latent confounders or cannot properly handle models with causal relationships among observed variables. We propose a novel algorithm that identifies causal DAGs in LvLiNGAM, allowing causal structures among latent variables, among observed variables, and between the two. The proposed method leverages higher-order cumulants of observed data to identify the causal structure. Extensive simulations and experiments with real-world data demonstrate the validity and practical utility of the proposed algorithm.

[519] Active Jammer Localization via Acquisition-Aware Path Planning

Luis González-Gudiño, Mariona Jaramillo-Civill, Pau Closas, Tales Imbiriba

Main category: cs.LG

TL;DR: Active jammer localization using Bayesian optimization with acquisition-aware path planning for mobile agents in urban environments.

Details

Motivation: To overcome limitations of passive crowdsourced methods by adaptively guiding mobile agents to collect high-utility signal measurements while considering urban obstacles and mobility constraints.

Method: Modified A* algorithm (A-UCB*) that incorporates acquisition values into trajectory costs for high-acquisition path planning combined with Bayesian optimization.

Result: Achieves accurate jammer localization with fewer measurements compared to uninformed baselines, with consistent performance across different urban environments.

Conclusion: The proposed framework effectively combines Bayesian optimization with acquisition-aware path planning to enable efficient and accurate jammer localization in complex urban settings.

Abstract: We propose an active jammer localization framework that combines Bayesian optimization with acquisition-aware path planning. Unlike passive crowdsourced methods, our approach adaptively guides a mobile agent to collect high-utility Received Signal Strength measurements while accounting for urban obstacles and mobility constraints. For this, we modified the A* algorithm, A-UCB*, by incorporating acquisition values into trajectory costs, leading to high-acquisition planned paths. Simulations on realistic urban scenarios show that the proposed method achieves accurate localization with fewer measurements compared to uninformed baselines, demonstrating consistent performance under different environments.

[520] Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning

Shikuang Deng, Jiayuan Zhang, Yuhang Wu, Ting Chen, Shi Gu

Main category: cs.LG

TL;DR: SPHeRe is a novel unsupervised Hebbian learning method that integrates orthogonality constraints and structural information preservation through an auxiliary nonlinear block, achieving SOTA performance on image classification benchmarks and showing strong effectiveness in continual learning and transfer learning scenarios.

Details

Motivation: Traditional Hebbian learning suffers from unconstrained connection updates and lack of feedback mediation, limiting its scaling to complex architectures. SPHeRe addresses these shortcomings by incorporating structural information preservation and orthogonality constraints.

Method: SPHeRe integrates orthogonality and structural information preservation through a local auxiliary nonlinear block. The loss for structural preservation backpropagates through an auxiliary lightweight projection serving as feedback mediation, while orthogonality constraints bound update magnitudes.

Result: SPHeRe achieves SOTA performance among unsupervised synaptic plasticity approaches on CIFAR-10, CIFAR-100, and Tiny-ImageNet. It also demonstrates strong effectiveness in continual learning, transfer learning, and image reconstruction tasks, showing robust and generalizable feature extraction.

Conclusion: This work demonstrates the competitiveness of Hebbian unsupervised learning in modern deep learning frameworks, showing the potential for efficient biologically inspired algorithms without strict dependence on backpropagation.

Abstract: Hebbian learning is a biological principle that intuitively describes how neurons adapt their connections through repeated stimuli. However, when applied to machine learning, it suffers serious issues due to the unconstrained updates of the connections and the lack of accounting for feedback mediation. Such shortcomings limit its effective scaling to complex network architectures and tasks. To this end, here we introduce the Structural Projection Hebbian Representation (SPHeRe), a novel unsupervised learning method that integrates orthogonality and structural information preservation through a local auxiliary nonlinear block. The loss for structural information preservation backpropagates to the input through an auxiliary lightweight projection that conceptually serves as feedback mediation while the orthogonality constraints account for the boundedness of updating magnitude. Extensive experimental results show that SPHeRe achieves SOTA performance among unsupervised synaptic plasticity approaches on standard image classification benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet. Furthermore, the method exhibits strong effectiveness in continual learning and transfer learning scenarios, and image reconstruction tasks show the robustness and generalizability of the extracted features. This work demonstrates the competitiveness and potential of Hebbian unsupervised learning rules within modern deep learning frameworks, demonstrating the possibility of efficient and biologically inspired learning algorithms without the strong dependence on strict backpropagation. Our code is available at https://github.com/brain-intelligence-lab/SPHeRe.

[521] Efficient Dynamic Structured Sparse Training with Learned Shuffles

Abhishek Tyagi, Arjun Iyer, Liam Young, William H Renninger, Christopher Kanan, Yuhao Zhu

Main category: cs.LG

TL;DR: PA-DST combines structured sparsity with learned permutations to match unstructured DST accuracy while achieving faster training and inference.

Details

Motivation: Structured sparsity is faster on GPUs but less accurate than unstructured DST due to limited expressivity from fixed patterns like blocks or N:M layouts.

Method: Learn a single permutation matrix per layer jointly with structured weight matrix, applied to block, N:M, and diagonal structures.

Result: Matches unstructured baselines (RigL, SET) at 90-95% sparsity on ImageNet-1K and WikiText-103, with 1.21x faster training and 2.9x faster inference.

Conclusion: Structure + learned permutation represents an optimal balance between accuracy and efficiency for sparse training.

Abstract: Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures – block, N:M, and diagonals – we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90–95% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to $1.21\times$ and infers up to $2.9\times$ faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.

[522] Tackling Time-Series Forecasting Generalization via Mitigating Concept Drift

Zhiyuan Zhao, Haoxin Liu, B. Aditya Prakash

Main category: cs.LG

TL;DR: The paper proposes ShifTS, a method-agnostic framework that addresses both temporal shift and concept drift in time-series forecasting through a unified approach, using soft attention mechanisms to find invariant patterns.

Details

Motivation: Existing studies primarily focus on temporal shift but neglect proper concept drift methods for time-series forecasting, while conventional concept drift methods face challenges in this domain.

Method: ShifTS framework first tackles temporal shift then concept drift using soft attention mechanisms that find invariant patterns from both lookback and horizon time series.

Result: Extensive experiments show ShifTS consistently enhances forecasting accuracy across multiple datasets and outperforms existing concept drift, temporal shift, and combined baselines.

Conclusion: The proposed ShifTS framework effectively addresses both temporal shift and concept drift in time-series forecasting, demonstrating superior performance over existing methods.

Abstract: Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is important for time-series forecasting models to handle potential distribution shifts over time. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series forecasting, designing proper concept drift methods for time series forecasting has received comparatively less attention. Motivated by the need to address potential concept drift, while conventional concept drift methods via invariant learning face certain challenges in time-series forecasting, we propose a soft attention mechanism that finds invariant patterns from both lookback and horizon time series. Additionally, we emphasize the critical importance of mitigating temporal shifts as a preliminary to addressing concept drift. In this context, we introduce ShifTS, a method-agnostic framework designed to tackle temporal shift first and then concept drift within a unified approach. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and outperforming existing concept drift, temporal shift, and combined baselines.

[523] Programmatic Representation Learning with Language Models

Gabriel Poesia, Georgia Gabriela Sampaio

Main category: cs.LG

TL;DR: Learned Programmatic Representations (LeaPR) models combine LLM-generated feature functions with decision trees to create interpretable, efficient predictors that compete with neural networks.

Details

Motivation: To bridge the gap between interpretable classical models (like decision trees) that need manual feature engineering and neural networks that learn representations automatically but lack interpretability and require specialized hardware.

Method: Two algorithms: 1) Adaptation of FunSearch to learn feature functions using LLMs, 2) Novel ID3 variant that generates features on demand during decision tree splitting. Features are represented as code functions synthesized by LLMs.

Result: Learned neural network-free predictors competitive with neural networks in chess position evaluation, image classification, and text classification tasks.

Conclusion: LeaPR provides a flexible paradigm for learning interpretable representations end-to-end where both features and predictions can be easily inspected and understood.

Abstract: Classical models for supervised machine learning, such as decision trees, are efficient and interpretable predictors, but their quality is highly dependent on the particular choice of input features. Although neural networks can learn useful representations directly from raw data (e.g., images or text), this comes at the expense of interpretability and the need for specialized hardware to run them efficiently. In this paper, we explore a hypothesis class we call Learned Programmatic Representations (LeaPR) models, which stack arbitrary features represented as code (functions from data points to scalars) and decision tree predictors. We synthesize feature functions using Large Language Models (LLMs), which have rich prior knowledge in a wide range of domains and a remarkable ability to write code using existing domain-specific libraries. We propose two algorithms to learn LeaPR models from supervised data. First, we design an adaptation of FunSearch to learn features rather than directly generate predictors. Then, we develop a novel variant of the classical ID3 algorithm for decision tree learning, where new features are generated on demand when splitting leaf nodes. In experiments from chess position evaluation to image and text classification, our methods learn high-quality, neural network-free predictors often competitive with neural networks. Our work suggests a flexible paradigm for learning interpretable representations end-to-end where features and predictions can be readily inspected and understood.

[524] To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models

Eran Malach, Omid Saremi, Sinead Williamson, Arwen Bradley, Aryo Lotfi, Emmanuel Abbe, Josh Susskind, Etai Littwin

Main category: cs.LG

TL;DR: SSMs have efficiency advantages but theoretical limitations in long-form generation, which can be overcome by adding tool access to achieve length generalization.

Details

Motivation: To address the theoretical limitation of SSMs in solving truly long-form generation problems despite their efficiency advantages over Transformers.

Method: Augment SSMs with interactive access to external tools and use problem-dependent training data to enable learning of tractable problems.

Result: Tool-augmented SSMs achieve remarkable length generalization on arithmetic, reasoning, and coding tasks.

Conclusion: SSMs with tool access can be an efficient alternative to Transformers in interactive tool-based and agentic settings, overcoming their theoretical limitations.

Abstract: State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any ``truly long-form’’ generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.

[525] Intelligent Dynamic Handover via AI-assisted Signal Quality Prediction in 6G Multi-RAT Networks

Maria Lamprini A. Bartsioka, Anastasios Giannopoulos, Sotirios Spantideas

Main category: cs.LG

TL;DR: This paper proposes a Machine Learning-assisted Predictive Conditional Handover (P-CHO) framework for 6G multi-RAT networks using LSTM networks to forecast signal quality and enable proactive handovers.

Details

Motivation: Current handover in multi-RAT networks is reactive and event-triggered, relying on instantaneous measurements, which is unreliable under fast channel dynamics, interference, and heterogeneous coverage in 6G networks.

Method: Proposes a P-CHO framework with RAT Steering Controller that standardizes data collection, parallel per-RAT predictions using RAT-aware LSTM networks, decision logic with hysteresis-based conditions, and CHO execution. Evaluates direct multi-step vs recursive P-CHO variants under different channel models.

Result: The hysteresis-enabled P-CHO scheme reduces handover failures and ping-pong events. The framework enables accurate, low-latency, and proactive handovers suitable for 6G multi-RAT deployments.

Conclusion: The proposed P-CHO framework can effectively enable proactive handovers in 6G multi-RAT networks, addressing the limitations of current reactive approaches.

Abstract: The emerging paradigm of 6G multiple Radio Access Technology (multi-RAT) networks, where cellular and Wireless Fidelity (WiFi) transmitters coexist, requires mobility decisions that remain reliable under fast channel dynamics, interference, and heterogeneous coverage. Handover in multi-RAT deployments is still highly reactive and event-triggered, relying on instantaneous measurements and threshold events. This work proposes a Machine Learning (ML)-assisted Predictive Conditional Handover (P-CHO) framework based on a model-driven and short-horizon signal quality forecasts. We present a generalized P-CHO sequence workflow orchestrated by a RAT Steering Controller, which standardizes data collection, parallel per-RAT predictions, decision logic with hysteresis-based conditions, and CHO execution. Considering a realistic multi-RAT environment, we train RAT-aware Long Short Term Memory (LSTM) networks to forecast the signal quality indicators of mobile users along randomized trajectories. The proposed P-CHO models are trained and evaluated under different channel models for cellular and IEEE 802.11 WiFi integrated coverage. We study the impact of hyperparameter tuning of LSTM models under different system settings, and compare direct multi-step versus recursive P-CHO variants. Comparisons against baseline predictors are also carried out. Finally, the proposed P-CHO is tested under soft and hard handover settings, showing that hysteresis-enabled P-CHO scheme is able to reduce handover failures and ping-pong events. Overall, the proposed P-CHO framework can enable accurate, low-latency, and proactive handovers suitable for ML-assisted handover steering in 6G multi-RAT deployments.

[526] Reinforcement Learning with Stochastic Reward Machines

Jan Corazza, Ivan Gavran, Daniel Neider

Main category: cs.LG

TL;DR: This paper introduces stochastic reward machines to handle noisy rewards in reinforcement learning, along with a constraint-solving based algorithm that learns minimal stochastic reward machines from agent explorations.

Details

Motivation: Existing reward machine learning algorithms assume noise-free rewards, which is an overly idealized setting that limits practical applications in real-world scenarios with noisy reward functions.

Method: The authors introduce stochastic reward machines and develop a constraint-solving based algorithm that learns minimal stochastic reward machines from reinforcement learning agent explorations. This approach can be paired with existing RL algorithms for reward machines.

Result: The algorithm guarantees convergence to an optimal policy in the limit and outperforms both existing methods and naive approaches for handling noisy reward functions in two case studies.

Conclusion: Stochastic reward machines and the proposed learning algorithm effectively address the practical limitation of noisy rewards in reinforcement learning, demonstrating superior performance compared to existing approaches.

Abstract: Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.

[527] Provable Unlearning with Gradient Ascent on Two-Layer ReLU Neural Networks

Odelia Melamed, Gilad Yehudai, Gal Vardi

Main category: cs.LG

TL;DR: Theoretical analysis shows gradient ascent can effectively remove specific data from trained models by reversing gradient descent’s influence, achieving results close to retraining while preserving generalization.

Details

Motivation: Address privacy and ethical concerns by enabling removal of specific data from trained models without full retraining from scratch.

Method: Use gradient ascent to reverse influence of specific data points, leveraging gradient descent’s implicit bias toward KKT conditions of margin maximization problems.

Result: For linear models and two-layer neural networks with high-dimensional data, properly scaled gradient ascent satisfies the proposed (ε,δ,τ)-successful unlearning criterion and approximates retrained solutions.

Conclusion: Gradient ascent is an effective unlearning method that removes specific data while preserving model performance on retained data and maintaining generalization capabilities.

Abstract: Machine Unlearning aims to remove specific data from trained models, addressing growing privacy and ethical concerns. We provide a theoretical analysis of a simple and widely used method - gradient ascent - used to reverse the influence of a specific data point without retraining from scratch. Leveraging the implicit bias of gradient descent towards solutions that satisfy the Karush-Kuhn-Tucker (KKT) conditions of a margin maximization problem, we quantify the quality of the unlearned model by evaluating how well it satisfies these conditions w.r.t. the retained data. To formalize this idea, we propose a new success criterion, termed \textbf{$(\epsilon, \delta, \tau)$-successful} unlearning, and show that, for both linear models and two-layer neural networks with high dimensional data, a properly scaled gradient-ascent step satisfies this criterion and yields a model that closely approximates the retrained solution on the retained data. We also show that gradient ascent performs successful unlearning while still preserving generalization in a synthetic Gaussian-mixture setting.

[528] Backdoor Unlearning by Linear Task Decomposition

Amel Abdelraheem, Alessandro Favero, Gerome Bovet, Pascal Frossard

Main category: cs.LG

TL;DR: Foundation models are vulnerable to adversarial attacks and backdoors. This paper introduces a method to remove backdoors by exploiting their disentangled representation in weight space, achieving near-perfect unlearning while preserving 96% clean accuracy.

Details

Motivation: Foundation models are susceptible to adversarial perturbations and backdoor attacks, but retraining is impractical due to their scale. Existing removal methods degrade performance on other tasks. The paper investigates whether backdoors can be removed without compromising general capabilities.

Method: The method leverages the finding that backdoors are disentangled from benign tasks in the model weight space. It introduces a simple unlearning approach that isolates and erases the backdoor’s influence. Works with both known attacks and unknown attacks using reverse-engineered triggers.

Result: The method achieves approximately perfect unlearning of backdoors while retaining 96% of clean accuracy on average. It works effectively even when attack details are unknown, using reverse-engineered triggers. Outperforms state-of-the-art defenses in unlearning and clean accuracy tradeoffs.

Conclusion: Backdoors can be effectively removed from foundation models without compromising their general capabilities by exploiting the disentangled nature of backdoor representations in weight space. The proposed unlearning method provides a practical defense against both known and unknown backdoor attacks.

Abstract: Foundation models have revolutionized computer vision by enabling broad generalization across diverse tasks. Yet, they remain highly susceptible to adversarial perturbations and targeted backdoor attacks. Mitigating such vulnerabilities remains an open challenge, especially given that the large-scale nature of the models prohibits retraining to ensure safety. Existing backdoor removal approaches rely on costly fine-tuning to override the harmful behavior, and can often degrade performance on other unrelated tasks. This raises the question of whether backdoors can be removed without compromising the general capabilities of the models. In this work, we address this question and study how backdoors are encoded in the model weight space, finding that they are disentangled from other benign tasks. Specifically, this separation enables the isolation and erasure of the backdoor’s influence on the model with minimal impact on clean performance. Building on this insight, we introduce a simple unlearning method that leverages such disentanglement. Through extensive experiments with CLIP-based models and common adversarial triggers, we show that, given the knowledge of the attack, our method achieves approximately perfect unlearning, while retaining, on average, 96% of clean accuracy. Additionally, we demonstrate that even when the attack and its presence are unknown, our method successfully unlearns backdoors by proper estimation using reverse-engineered triggers. Overall, our method consistently yields better unlearning and clean accuracy tradeoffs when compared to present state-of-the-art defenses.

[529] Predicting kernel regression learning curves from only raw data statistics

Dhruva Karkada, Joseph Turnbull, Yuxi Liu, James B. Simon

Main category: cs.LG

TL;DR: The paper introduces the Hermite eigenstructure ansatz (HEA) framework to predict kernel regression learning curves on real datasets using only empirical data covariance and target function decomposition.

Details

Motivation: To develop a theoretical framework that can predict learning curves (test risk vs sample size) for kernel regression on real datasets without requiring complex computations, enabling end-to-end theory of learning from dataset structure to model performance.

Method: Proposes the Hermite eigenstructure ansatz (HEA) - an analytical approximation of kernel eigenvalues and eigenfunctions with respect to anisotropic data distributions, where eigenfunctions resemble Hermite polynomials. Proves HEA for Gaussian data and validates it empirically on real image datasets.

Result: HEA holds well on real image data (CIFAR-5m, SVHN, ImageNet) despite data not being perfectly Gaussian, enabling accurate learning curve predictions. Also finds that MLPs in feature-learning regime learn Hermite polynomials in the order predicted by HEA.

Conclusion: The HEA framework demonstrates that end-to-end theory mapping dataset structure to model performance is possible for non-trivial learning algorithms on real datasets, providing a proof of concept for comprehensive learning theories.

Abstract: We study kernel regression with common rotation-invariant kernels on real datasets including CIFAR-5m, SVHN, and ImageNet. We give a theoretical framework that predicts learning curves (test risk vs. sample size) from only two measurements: the empirical data covariance matrix and an empirical polynomial decomposition of the target function $f_*$. The key new idea is an analytical approximation of a kernel’s eigenvalues and eigenfunctions with respect to an anisotropic data distribution. The eigenfunctions resemble Hermite polynomials of the data, so we call this approximation the Hermite eigenstructure ansatz (HEA). We prove the HEA for Gaussian data, but we find that real image data is often “Gaussian enough” for the HEA to hold well in practice, enabling us to predict learning curves by applying prior results relating kernel eigenstructure to test risk. Extending beyond kernel regression, we empirically find that MLPs in the feature-learning regime learn Hermite polynomials in the order predicted by the HEA. Our HEA framework is a proof of concept that an end-to-end theory of learning which maps dataset structure all the way to model performance is possible for nontrivial learning algorithms on real datasets.

[530] Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards

Sarah Liaw, Benjamin Plaut

Main category: cs.LG

TL;DR: This paper proposes a cautious bandit algorithm for high-stakes AI applications where errors can cause irreparable damage, using an abstain option and trusted regions to avoid harmful actions.

Details

Motivation: Standard bandit algorithms assume all errors are recoverable, but in high-stakes applications, single actions can cause irreparable damage. Existing mentor-based approaches may not always be available.

Method: A two-action contextual bandit with abstain option, where the agent either abstains (0 reward) or commits (executes task policy). The algorithm learns a trusted region and only commits where evidence doesn’t certify harm, using cautious exploration.

Result: The proposed algorithm achieves sublinear regret guarantees under i.i.d. inputs, demonstrating effective safe deployment in high-stakes environments.

Conclusion: Cautious exploration through trusted regions and abstain options enables safe learning in high-stakes AI applications where irreparable damage is possible, without requiring constant mentor supervision.

Abstract: In high-stakes AI applications, even a single action can cause irreparable damage. However, nearly all of sequential decision-making theory assumes that all errors are recoverable (e.g., by bounding rewards). Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails. Some prior work avoids irreparable errors by asking for help from a mentor, but a mentor may not always be available. In this work, we formalize a model of learning with unbounded rewards without a mentor as a two-action contextual bandit with an abstain option: at each round the agent observes an input and chooses either to abstain (always 0 reward) or to commit (execute a preexisting task policy). Committing yields rewards that are upper-bounded but can be arbitrarily negative, and the commit reward is assumed Lipschitz in the input. We propose a caution-based algorithm that learns when not to learn: it chooses a trusted region and commits only where the available evidence does not already certify harm. Under these conditions and i.i.d. inputs, we establish sublinear regret guarantees, theoretically demonstrating the effectiveness of cautious exploration for deploying learning agents safely in high-stakes environments.

[531] Reasoning with Sampling: Your Base Model is Smarter Than You Think

Aayush Karan, Yilun Du

Main category: cs.LG

TL;DR: This paper shows that comparable reasoning capabilities to RL-posttraining can be achieved from base LLMs through pure sampling at inference time using an MCMC-inspired iterative algorithm, without additional training.

Details

Motivation: To determine if comparable reasoning capabilities can be elicited from base models through pure sampling rather than requiring RL posttraining, addressing the question of whether novel behaviors truly emerge during RL or can be achieved through inference-time methods.

Method: Proposed a simple iterative sampling algorithm inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, leveraging base models’ own likelihoods without additional training.

Result: The algorithm provides substantial reasoning boosts that nearly match and even outperform RL-posttraining on tasks like MATH500, HumanEval, and GPQA, while avoiding the diversity collapse characteristic of RL-posttraining over multiple samples.

Conclusion: The method demonstrates broad applicability beyond easily verifiable domains as it requires no training, curated datasets, or verifier, suggesting that sophisticated reasoning capabilities can be achieved through inference-time sampling alone.

Abstract: Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models’ own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.

[532] Circuit Insights: Towards Interpretability Beyond Activations

Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, Sebastian Lapuschkin

Main category: cs.LG

TL;DR: WeightLens and CircuitLens are two new methods for neural network interpretability that analyze features through learned weights and circuit-level interactions, improving on activation-based approaches.

Details

Motivation: Existing interpretability methods rely on manual inspection, are limited to toy tasks, and automated approaches miss feature interactions while depending heavily on external LLMs and dataset quality.

Method: WeightLens interprets features directly from learned weights without explainer models or datasets. CircuitLens captures how feature activations arise from component interactions, revealing circuit-level dynamics.

Result: WeightLens matches or exceeds existing methods on context-independent features. CircuitLens reveals circuit-level dynamics that activation-only approaches cannot identify.

Conclusion: Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.

Abstract: The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.

[533] Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Jonas Geiping, Xinyu Yang, Guinan Su

Main category: cs.LG

TL;DR: The paper introduces a new diffusion forcing sampler that accelerates generation in recurrent-depth language models by enabling parallel token refinement through recurrence, achieving up to 5x speedup without tuning.

Details

Motivation: To explore the relationship between recurrent-depth models and diffusion language models, and leverage their similarities to develop more efficient generation methods.

Method: Developed a diffusion forcing sampler that decodes new tokens at every forward pass while allowing parallel refinement of latent states through recurrence, based on principles from diffusion literature.

Result: The sampler provides strictly more expressive generation than baseline autoregressive methods with the same time budget, achieving up to 5x speedup when applied to existing 3.5B recurrent-depth transformers without any tuning.

Conclusion: Recurrent-depth models can be naturally viewed as strong continuous, though causal, diffusion language models, and the proposed sampler provides an efficient mechanism for parallelizing extra computation at inference.

Abstract: Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.

[534] Identity-Link IRT for Label-Free LLM Evaluation: Preserving Additivity in TVD-MI Scores

Zachary Robertson

Main category: cs.LG

TL;DR: The paper shows that using identity link functions instead of logistic/probit transformations preserves the additive structure of TVD-MI pairwise comparisons for LLM evaluation, enabling more efficient ranking with fewer comparisons.

Details

Motivation: To develop a more efficient method for evaluating large language models using pairwise comparisons that preserves the geometric structure of TVD-MI measurements and reduces the number of required evaluations.

Method: Proposes using identity link functions with item-response theory instead of nonlinear logistic/probit links, derived from Gini entropy maximization, resulting in a box-constrained least-squares formulation that handles boundary saturation.

Result: Achieved holdout RMSE of 0.117 ± 0.008 with 33% coverage (three times fewer evaluations than full dense), preserved agent rankings (Spearman ρ = 0.972 ± 0.015), and showed strong judge agreement (ρ = 0.872) between GPT-4o-mini and Llama3-70b.

Conclusion: Identity mapping best preserves TVD-MI’s geometry for efficient LLM evaluation and is applicable to other bounded-response domains, outperforming traditional nonlinear link functions.

Abstract: Pairwise comparisons of large language models using total variation distance mutual information (TVD-MI) produce binary critic decisions per pair. We show that averaging TVD-MI’s binary trials yields centered-probability scores with additive structure suitable for item-response theory (IRT) without nonlinear link functions. Maximum-likelihood approaches to IRT use logistic links, but we find empirically that these transformations introduce curvature that breaks additivity: across three domains, the identity link yields median curl on raw data of 0.080-0.150 (P95 = [0.474, 0.580]), whereas probit/logit introduce substantially higher violations (median [0.245, 0.588], P95 [0.825, 2.252]). We derive this clipped-linear model from Gini entropy maximization, yielding a box-constrained least-squares formulation that handles boundary saturation. At 33% coverage, we achieve holdout RMSE $0.117 \pm 0.008$ while preserving agent rankings (Spearman $\rho = 0.972 \pm 0.015$), three times fewer evaluations than full dense. Judge robustness analysis (GPT-4o-mini vs. Llama3-70b) shows strong agreement in agent rankings ($\rho = 0.872$) and consistent identity-link advantage. TVD-MI’s geometry is best preserved by identity mapping for efficient LLM evaluation, applicable to other bounded-response domains.

[535] Biology-informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability

Katiana Kontolati, Rini Jasmine Gladstone, Ian Davis, Ethan Pickering

Main category: cs.LG

TL;DR: Biologically-informed neural networks (BINNs) improve genomic prediction and selection in crops by integrating SNPs with multi-omics data and biological knowledge, achieving up to 56% higher accuracy and identifying nonlinear biological relationships missed by traditional methods.

Details

Motivation: Traditional genotype-to-phenotype models have limited accuracy, requiring costly field trials. Models using intermediate molecular phenotypes are impractical for genomic selection since such data is unavailable during deployment. BINNs overcome this by using multi-omics data only during training while using genotype data alone during inference.

Method: BINNs encode pathway-level inductive biases and leverage multi-omics data during training, but use only genotype data during inference. Applied to maize gene-expression and multi-environment field-trial data, and tested with synthetic metabolomics benchmark.

Result: BINN improves rank-correlation accuracy by up to 56% within and across subpopulations under sparse-data conditions. Reduces prediction error by 75% relative to conventional neural nets in synthetic benchmark. Identifies genes that GWAS/TWAS fail to uncover and correctly identifies important nonlinear pathways.

Conclusion: BINNs learn biologically-relevant representations from genotype to phenotype and establish a framework that leverages domain knowledge to improve genomic prediction accuracy and reveal nonlinear biological relationships for genomic selection, candidate gene selection, and gene-editing prioritization.

Abstract: We extend biologically-informed neural networks (BINNs) for genomic prediction (GP) and selection (GS) in crops by integrating thousands of single-nucleotide polymorphisms (SNPs) with multi-omics measurements and prior biological knowledge. Traditional genotype-to-phenotype (G2P) models depend heavily on direct mappings that achieve only modest accuracy, forcing breeders to conduct large, costly field trials to maintain or marginally improve genetic gain. Models that incorporate intermediate molecular phenotypes such as gene expression can achieve higher predictive fit, but they remain impractical for GS since such data are unavailable at deployment or design time. BINNs overcome this limitation by encoding pathway-level inductive biases and leveraging multi-omics data only during training, while using genotype data alone during inference. Applied to maize gene-expression and multi-environment field-trial data, BINN improves rank-correlation accuracy by up to 56% within and across subpopulations under sparse-data conditions and nonlinearly identifies genes that GWAS/TWAS fail to uncover. With complete domain knowledge for a synthetic metabolomics benchmark, BINN reduces prediction error by 75% relative to conventional neural nets and correctly identifies the most important nonlinear pathway. Importantly, both cases show highly sensitive BINN latent variables correlate with the experimental quantities they represent, despite not being trained on them. This suggests BINNs learn biologically-relevant representations, nonlinear or linear, from genotype to phenotype. Together, BINNs establish a framework that leverages intermediate domain information to improve genomic prediction accuracy and reveal nonlinear biological relationships that can guide genomic selection, candidate gene selection, pathway enrichment, and gene-editing prioritization.

[536] pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, Sai Bi

Main category: cs.LG

TL;DR: π-Flow is a policy-based flow model that predicts dynamic flow velocities for fast ODE integration, achieving state-of-the-art performance in few-step diffusion with better diversity and quality.

Details

Motivation: Existing few-step diffusion models suffer from a quality-diversity trade-off due to format mismatch between teacher and student models, leading to complex distillation procedures.

Method: π-Flow modifies student flow models to predict network-free policies that generate dynamic flow velocities for future substeps, enabling fast ODE integration without extra network evaluations. Uses imitation distillation with ℓ₂ flow matching loss.

Result: Achieves 1-NFE FID of 2.85 on ImageNet 256², outperforming MeanFlow. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, achieves substantially better diversity than state-of-the-art methods while maintaining teacher-level quality.

Conclusion: π-Flow enables stable and scalable training by simply mimicking teacher behavior, avoiding the quality-diversity trade-off in few-step diffusion models.

Abstract: Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration on these substeps without extra network evaluations. To match the policy’s ODE trajectory to the teacher’s, we introduce a novel imitation distillation approach, which matches the policy’s velocity to the teacher’s along the policy’s trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher’s behavior, $\pi$-Flow enables stable and scalable training and avoids the quality-diversity trade-off. On ImageNet 256$^2$, it attains a 1-NFE FID of 2.85, outperforming MeanFlow of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art few-step methods, while maintaining teacher-level quality.

[537] Why do explanations fail? A typology and discussion on failures in XAI

Clara Bove, Thibault Laugel, Marie-Jeanne Lesot, Charles Tijus, Marcin Detyniecki

Main category: cs.LG

TL;DR: This paper advocates for a holistic perspective on XAI limitations, proposing a typological framework to distinguish between system-specific and user-specific failures in explanation methods.

Details

Motivation: Existing XAI approaches often fail to meet expectations due to technical limitations or user misinterpretations, but current studies fail to capture the complex overlap of multiple failures in XAI systems.

Method: The authors propose a typological framework that distinguishes between system-specific and user-specific failures to reveal the nuanced complexities of explanation failures in XAI.

Result: The framework helps systematically investigate limitations of current XAI methods and their impact on explanation interpretation, providing a more comprehensive understanding of XAI failures.

Conclusion: The typology enables better understanding of XAI limitations and suggests research directions to enhance the quality of machine learning explanations by addressing both system and user-related issues.

Abstract: As Machine Learning models achieve unprecedented levels of performance, the XAI domain aims at making these models understandable by presenting end-users with intelligible explanations. Yet, some existing XAI approaches fail to meet expectations: several issues have been reported in the literature, generally pointing out either technical limitations or misinterpretations by users. In this paper, we argue that the resulting harms arise from a complex overlap of multiple failures in XAI, which existing ad-hoc studies fail to capture. This work therefore advocates for a holistic perspective, presenting a systematic investigation of limitations of current XAI methods and their impact on the interpretation of explanations. % By distinguishing between system-specific and user-specific failures, we propose a typological framework that helps revealing the nuanced complexities of explanation failures. Leveraging this typology, we discuss some research directions to help practitioners better understand the limitations of XAI systems and enhance the quality of ML explanations.

[538] Lost in the Averages: A New Specific Setup to Evaluate Membership Inference Attacks Against Machine Learning Models

Nataša Krčo, Florent Guépin, Matthieu Meeus, Bogdan Kulynych, Yves-Alexandre de Montjoye

Main category: cs.LG

TL;DR: The paper proposes a new ‘model-seeded’ privacy game for membership inference attacks that provides more accurate record-specific privacy risk estimates compared to the traditional approach, which averages risk across datasets.

Details

Motivation: Traditional membership inference attack evaluations average privacy risk across datasets, providing misleading estimates when specific models or synthetic datasets are released. This fails to account for how the specific dataset context affects individual record privacy.

Method: The authors propose using a leave-one-out game (called ‘model-seeded’ game) instead of the traditional privacy game. They formalize this approach and evaluate state-of-the-art MIAs for synthetic data generators across both privacy games on multiple datasets.

Result: The two privacy games yield different risk scores, with up to 94% of high-risk records being overlooked by the traditional game. Records in smaller datasets and models without strong differential privacy have larger gaps between risk estimates.

Conclusion: The model-seeded setup provides risk estimates specific to released models/synthetic datasets that align with standard privacy leakage notions, offering meaningful improvements over dataset-averaged risk from traditional approaches.

Abstract: Synthetic data generators and machine learning models can memorize their training data, posing privacy concerns. Membership inference attacks (MIAs) are a standard method of estimating the privacy risk of these systems. The risk of individual records is typically computed by evaluating MIAs in a record-specific privacy game. We analyze the record-specific privacy game commonly used for evaluating attackers under realistic assumptions (the \textit{traditional} game) – particularly for synthetic tabular data – and show that it averages a record’s privacy risk across datasets. We show this implicitly assumes the dataset a record is part of has no impact on the record’s risk, providing a misleading risk estimate when a specific model or synthetic dataset is released. Instead, we propose a novel use of the leave-one-out game, used in existing work exclusively to audit differential privacy guarantees, and call this the \textit{model-seeded} game. We formalize it and show that it provides an accurate estimate of the privacy risk posed by a given adversary for a record in its specific dataset. We instantiate and evaluate the state-of-the-art MIA for synthetic data generators in the traditional and model-seeded privacy games, and show across multiple datasets and models that the two privacy games indeed result in different risk scores, with up to 94% of high-risk records being overlooked by the traditional game. We further show that records in smaller datasets and models not protected by strong differential privacy guarantees tend to have a larger gap between risk estimates. Taken together, our results show that the model-seeded setup yields a risk estimate specific to a certain model or synthetic dataset released and in line with the standard notion of privacy leakage from prior work, meaningfully different from the dataset-averaged risk provided by the traditional privacy game.

[539] Kernel Neural Operators (KNOs) for Scalable, Memory-efficient, Geometrically-flexible Operator Learning

Matthew Lowery, John Turnage, Zachary Morrow, John D. Jakeman, Akil Narayan, Shandian Zhe, Varun Shankar

Main category: cs.LG

TL;DR: KNO is a convergent operator-learning architecture using deep kernel-based integral operators that decouples kernel choice from numerical integration, enabling flexible operator learning on irregular geometries with fewer parameters.

Details

Motivation: To create a geometrically-flexible operator learning method that can work on irregular domains while maintaining convergence guarantees and implementation simplicity of traditional kernel methods.

Method: Uses compositions of deep kernel-based integral operators, decouples kernel choice from quadrature schemes, employs domain-specific quadrature rules on irregular domains, and uses dimension-wise factorization on regular domains. Also introduces neural anisotropic kernels whose parameters are computed by neural networks.

Result: KNO achieves comparable or higher accuracy than popular operator learning techniques while using an order of magnitude fewer trainable parameters. More expressive kernels proved important for high accuracy.

Conclusion: KNO facilitates low-memory, geometrically-flexible deep operator learning while retaining the implementation simplicity and transparency of traditional kernel methods from both scientific computing and machine learning.

Abstract: This paper introduces the Kernel Neural Operator (KNO), a provably convergent operator-learning architecture that utilizes compositions of deep kernel-based integral operators for function-space approximation of operators (maps from functions to functions). The KNO decouples the choice of kernel from the numerical integration scheme (quadrature), thereby naturally allowing for operator learning with explicitly-chosen trainable kernels on irregular geometries. On irregular domains, this allows the KNO to utilize domain-specific quadrature rules. To help ameliorate the curse of dimensionality, we also leverage an efficient dimension-wise factorization algorithm on regular domains. More importantly, the ability to explicitly specify kernels also allows the use of highly expressive, non-stationary, neural anisotropic kernels whose parameters are computed by training neural networks. Numerical results demonstrate that on existing benchmarks the training and test accuracy of KNOs is comparable to or higher than popular operator learning techniques while typically using an order of magnitude fewer trainable parameters, with the more expressive kernels proving important to attaining high accuracy. KNOs thus facilitate low-memory, geometrically-flexible, deep operator learning, while retaining the implementation simplicity and transparency of traditional kernel methods from both scientific computing and machine learning.

[540] Boosting Graph Foundation Model from Structural Perspective

Yao Cheng, Yige Zhao, Jianxiang Yu, Xiang Li

Main category: cs.LG

TL;DR: BooG is a graph foundation model that constructs virtual super nodes to unify structural characteristics across different graph domains, using contrastive learning for pre-training to achieve strong generalization.

Details

Motivation: Existing graph foundation models rely on language models for semantic representations but ignore the unique structural characteristics of graphs from different domains, limiting their cross-domain generalization.

Method: Constructs virtual super nodes that fuse anchor node information and class labels, connects them via virtual edges to all nodes in their neighborhood, and uses contrastive learning for pre-training.

Result: Experimental results on various datasets and tasks demonstrate superior performance compared to existing methods.

Conclusion: BooG effectively unifies cross-domain structural characteristics and generalizes well to different domains and downstream tasks through its novel super node approach and contrastive learning objective.

Abstract: Graph foundation models have recently attracted significant attention due to its strong generalizability. Although existing methods resort to language models to learn unified semantic representations across domains, they disregard the unique structural characteristics of graphs from different domains. To address the problem, in this paper, we boost graph foundation model from structural perspective and propose BooG. The model constructs virtual super nodes to unify structural characteristics of graph data from different domains. Specifically, the super nodes fuse the information of anchor nodes and class labels, where each anchor node captures the information of a node or a graph instance to be classified. Instead of using the raw graph structure, we connect super nodes to all nodes within their neighborhood by virtual edges. This new structure allows for effective information aggregation while unifying cross-domain structural characteristics. Additionally, we propose a novel pre-training objective based on contrastive learning, which learns more expressive representations for graph data and generalizes effectively to different domains and downstream tasks. Experimental results on various datasets and tasks demonstrate the superior performance of BooG. We provide our code and data here: https://anonymous.4open.science/r/BooG-EE42/.

[541] Say My Name: a Model’s Bias Discovery Framework

Massimiliano Ciranni, Luca Molinaro, Carlo Alberto Barbano, Attilio Fiandrotti, Vittorio Murino, Vito Paolo Pastore, Enzo Tartaglione

Main category: cs.LG

TL;DR: SaMyNa is the first tool to semantically identify biases in deep learning models using a text-based pipeline that enhances explainability and supports debiasing during training or post-hoc validation.

Details

Motivation: Deep models often learn biased patterns from non-representative data, but existing unsupervised debiasing methods using pseudo-labels lack semantic interpretability for end users.

Method: A text-based pipeline that identifies biases learned by models, disentangles task-related information, and provides semantic information about bias features rather than just clustering-based pseudo-labels.

Result: Evaluation on traditional benchmarks shows SaMyNa effectively detects biases and can disclaim them, demonstrating broad applicability for model diagnosis.

Conclusion: SaMyNa provides an explainable approach to bias identification that supports debiasing efforts and serves as a valuable tool for analyzing model biases with semantic interpretability.

Abstract: In the last few years, due to the broad applicability of deep learning to downstream tasks and end-to-end training capabilities, increasingly more concerns about potential biases to specific, non-representative patterns have been raised. Many works focusing on unsupervised debiasing usually leverage the tendency of deep models to learn easier'' samples, for example by clustering the latent space to obtain bias pseudo-labels. However, the interpretation of such pseudo-labels is not trivial, especially for a non-expert end user, as it does not provide semantic information about the bias features. To address this issue, we introduce Say My Name’’ (SaMyNa), the first tool to identify biases within deep models semantically. Unlike existing methods, our approach focuses on biases learned by the model. Our text-based pipeline enhances explainability and supports debiasing efforts: applicable during either training or post-hoc validation, our method can disentangle task-related information and proposes itself as a tool to analyze biases. Evaluation on traditional benchmarks demonstrates its effectiveness in detecting biases and even disclaiming them, showcasing its broad applicability for model diagnosis.

[542] GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data

Gleb Bazhenov, Oleg Platonov, Liudmila Prokhorenkova

Main category: cs.LG

TL;DR: GraphLand is a new benchmark with 14 diverse graph datasets from industrial applications to address the narrow evaluation scope of current graph ML models and foundation models.

Details

Motivation: Current graph ML benchmarks are limited to narrow domains (mainly academic citation networks), which is problematic for evaluating graph foundation models that should transfer across diverse domains.

Method: Created GraphLand benchmark with 14 diverse graph datasets from industrial applications, enabling evaluation on graphs with varied sizes, structures, and features. Also compared GNNs with GBDT models using graph-based features.

Result: GBDT models with graph-based features can be strong baselines, and current graph foundation models fail to produce competitive results on the proposed diverse datasets.

Conclusion: GraphLand addresses the evaluation gap in graph ML and reveals limitations of current graph foundation models, highlighting the need for more comprehensive benchmarks and improved model generalization.

Abstract: Although data that can be naturally represented as graphs is widespread in real-world applications across diverse industries, popular graph ML benchmarks for node property prediction only cover a surprisingly narrow set of data domains, and graph neural networks (GNNs) are often evaluated on just a few academic citation networks. This issue is particularly pressing in light of the recent growing interest in designing graph foundation models. These models are supposed to be able to transfer to diverse graph datasets from different domains, and yet the proposed graph foundation models are often evaluated on a very limited set of datasets from narrow applications. To alleviate this issue, we introduce GraphLand: a benchmark of 14 diverse graph datasets for node property prediction from a range of different industrial applications. GraphLand allows evaluating graph ML models on a wide range of graphs with diverse sizes, structural characteristics, and feature sets, all in a unified setting. Further, GraphLand allows investigating such previously underexplored research questions as how realistic temporal distributional shifts under transductive and inductive settings influence graph ML model performance. To mimic realistic industrial settings, we use GraphLand to compare GNNs with gradient-boosted decision trees (GBDT) models that are popular in industrial applications and show that GBDTs provided with additional graph-based input features can sometimes be very strong baselines. Further, we evaluate currently available general-purpose graph foundation models and find that they fail to produce competitive results on our proposed datasets.

[543] Benchmarking drug-drug interaction prediction methods: a perspective of distribution changes

Zhenqian Shen, Mingyang Zhou, Yongqi Zhang, Quanming Yao

Main category: cs.LG

TL;DR: DDI-Ben is a benchmarking framework for emerging drug-drug interaction prediction that addresses distribution changes between known and new drugs, showing most existing methods suffer performance degradation under such changes.

Details

Motivation: Emerging DDI prediction is crucial but hindered by distribution changes between known and new drugs in real-world scenarios, with current evaluation often neglecting these changes due to absence of drug approval data.

Method: Proposes DDI-Ben framework with distribution change simulation that leverages distribution changes between drug sets as surrogate for real-world DDI distribution changes, compatible with various drug split strategies.

Result: Benchmarking on ten representative methods shows most existing approaches suffer substantial performance degradation under distribution changes. LLM-based methods and integration of drug-related textual information show promising robustness.

Conclusion: DDI-Ben highlights importance of addressing distribution changes explicitly and provides foundation for developing more resilient methods for emerging DDI prediction.

Abstract: Motivation: Emerging drug-drug interaction (DDI) prediction is crucial for new drugs but is hindered by distribution changes between known and new drugs in real-world scenarios. Current evaluation often neglects these changes, relying on unrealistic i.i.d. split due to the absence of drug approval data. Results: We propose DDI-Ben, a benchmarking framework for emerging DDI prediction under distribution changes. DDI-Ben introduces a distribution change simulation framework that leverages distribution changes between drug sets as a surrogate for real-world distribution changes of DDIs, and is compatible with various drug split strategies. Through extensive benchmarking on ten representative methods, we show that most existing approaches suffer substantial performance degradation under distribution changes. Our analysis further indicates that large language model (LLM) based methods and the integration of drug-related textual information offer promising robustness against such degradation. To support future research, we release the benchmark datasets with simulated distribution changes. Overall, DDI-Ben highlights the importance of explicitly addressing distribution changes and provides a foundation for developing more resilient methods for emerging DDI prediction. Availability and implementation: Our code and data are available at https://github.com/LARS-research/DDI-Bench.

[544] Disentangled and Self-Explainable Node Representation Learning

Simone Piaggesi, André Panisson, Megha Khosla

Main category: cs.LG

TL;DR: DiSeNE is a framework that generates self-explainable node embeddings through disentangled representation learning, making each embedding dimension interpretable by aligning with distinct topological structures.

Details

Motivation: Current research focuses on explaining graph model decisions, but interpretability of unsupervised node embeddings remains underexplored.

Method: Uses disentangled representation learning to produce dimension-wise interpretable embeddings, formalizes desiderata for interpretable embeddings, and develops new objective functions optimizing for both interpretability and disentanglement.

Result: Proposes new metrics for evaluation and demonstrates effectiveness through extensive experiments on multiple benchmark datasets.

Conclusion: DiSeNE successfully bridges the gap in interpretable unsupervised node embeddings by generating self-explainable embeddings that are both interpretable and disentangled.

Abstract: Node representations, or embeddings, are low-dimensional vectors that capture node properties, typically learned through unsupervised structural similarity objectives or supervised tasks. While recent efforts have focused on explaining graph model decisions, the interpretability of unsupervised node embeddings remains underexplored. To bridge this gap, we introduce DiSeNE (Disentangled and Self-Explainable Node Embedding), a framework that generates self-explainable embeddings in an unsupervised manner. Our method employs disentangled representation learning to produce dimension-wise interpretable embeddings, where each dimension is aligned with distinct topological structure of the graph. We formalize novel desiderata for disentangled and interpretable embeddings, which drive our new objective functions, optimizing simultaneously for both interpretability and disentanglement. Additionally, we propose several new metrics to evaluate representation quality and human interpretability. Extensive experiments across multiple benchmark datasets demonstrate the effectiveness of our approach.

[545] Hopfield-Fenchel-Young Networks: A Unified Framework for Associative Memory Retrieval

Saul Santos, Vlad Niculae, Daniel McNamee, André F. T. Martins

Main category: cs.LG

TL;DR: A unified framework called Hopfield-Fenchel-Young networks that generalizes associative memory models using Fenchel-Young losses and different entropy functions, enabling sparse transformations and structured pattern retrieval.

Details

Motivation: To create a unified framework that generalizes traditional and modern Hopfield networks, connecting them with self-attention mechanisms and enabling sparse transformations and structured pattern associations.

Method: Formulate energy functions as differences between two Fenchel-Young losses parameterized by generalized entropies (Tsallis and norm entropies), derive differentiable update rules, and extend to structured networks using SparseMAP transformation.

Result: The framework enables sparse transformations, exact retrieval of single memory patterns, retrieval of pattern associations, and provides energy minimization perspective for common post-transformations like normalization.

Conclusion: Hopfield-Fenchel-Young networks successfully unify and extend associative memory models, validated on diverse memory recall tasks including simulated data, image retrieval, and text rationalization.

Abstract: Associative memory models, such as Hopfield networks and their modern variants, have garnered renewed interest due to advancements in memory capacity and connections with self-attention in transformers. In this work, we introduce a unified framework-Hopfield-Fenchel-Young networks-which generalizes these models to a broader family of energy functions. Our energies are formulated as the difference between two Fenchel-Young losses: one, parameterized by a generalized entropy, defines the Hopfield scoring mechanism, while the other applies a post-transformation to the Hopfield output. By utilizing Tsallis and norm entropies, we derive end-to-end differentiable update rules that enable sparse transformations, uncovering new connections between loss margins, sparsity, and exact retrieval of single memory patterns. We further extend this framework to structured Hopfield networks using the SparseMAP transformation, allowing the retrieval of pattern associations rather than a single pattern. Our framework unifies and extends traditional and modern Hopfield networks and provides an energy minimization perspective for widely used post-transformations like $\ell_2$-normalization and layer normalization-all through suitable choices of Fenchel-Young losses and by using convex analysis as a building block. Finally, we validate our Hopfield-Fenchel-Young networks on diverse memory recall tasks, including free and sequential recall. Experiments on simulated data, image retrieval, multiple instance learning, and text rationalization demonstrate the effectiveness of our approach.

[546] REX: Causal discovery based on machine learning and explainability techniques

Jesus Renero, Idoia Ochoa, Roberto Maestre

Main category: cs.LG

TL;DR: ReX is a novel causal discovery method that combines machine learning models with explainability techniques (Shapley values) to identify and interpret causal relationships, outperforming state-of-the-art methods on synthetic and real-world datasets.

Details

Motivation: Current causal discovery methods lack explainability features, despite XAI techniques' potential to enhance causal understanding in complex systems across healthcare, economics, and AI domains.

Method: ReX leverages machine learning models combined with Shapley values explainability techniques to identify significant causal relationships among variables from continuous tabular data.

Result: ReX outperforms state-of-the-art causal discovery methods on synthetic datasets with non-linear and additive noise models, and achieves 0.952 precision on the Sachs protein-signaling dataset with no incorrect edges.

Conclusion: ReX effectively bridges predictive modeling and causal inference, offering a robust tool for understanding complex causal structures while minimizing false positives across diverse datasets.

Abstract: Explainable Artificial Intelligence (XAI) techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive the causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce ReX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables. Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that ReX outperforms state-of-the-art causal discovery methods across diverse data generation processes, including non-linear and additive noise models. Moreover, ReX was tested on the Sachs single-cell protein-signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase ReX’s effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real-world problems. By combining ML and explainability techniques with causal discovery, ReX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures.

[547] Exploring the Noise Robustness of Online Conformal Prediction

Huajun Xi, Kangdao Liu, Hao Zeng, Wenguang Sun, Hongxin Wei

Main category: cs.LG

TL;DR: This paper proposes NR-OCP, a noise-robust online conformal prediction method that maintains coverage guarantees under uniform label noise by using a novel robust pinball loss.

Details

Motivation: Existing online conformal prediction methods assume perfect label accuracy, which rarely holds in practice. Label noise causes persistent coverage gaps between actual and desired mis-coverage rates.

Method: Proposed Noise Robust Online Conformal Prediction (NR-OCP) that updates thresholds using a novel robust pinball loss, providing unbiased estimates without requiring ground-truth labels.

Result: NR-OCP eliminates coverage gaps in both constant and dynamic learning rate schedules, achieving O(T^{-1/2}) convergence rate for empirical and expected coverage errors under uniform label noise.

Conclusion: The method effectively achieves both precise coverage and improved efficiency under label noise conditions, addressing a key limitation of existing conformal prediction approaches.

Abstract: Conformal prediction is an emerging technique for uncertainty quantification that constructs prediction sets guaranteed to contain the true label with a predefined probability. Recent work develops online conformal prediction methods that adaptively construct prediction sets to accommodate distribution shifts. However, existing algorithms typically assume perfect label accuracy which rarely holds in practice. In this work, we investigate the robustness of online conformal prediction under uniform label noise with a known noise rate, in both constant and dynamic learning rate schedules. We show that label noise causes a persistent gap between the actual mis-coverage rate and the desired rate $\alpha$, leading to either overestimated or underestimated coverage guarantees. To address this issue, we propose Noise Robust Online Conformal Prediction (dubbed NR-OCP) by updating the threshold with a novel robust pinball loss, which provides an unbiased estimate of clean pinball loss without requiring ground-truth labels. Our theoretical analysis shows that NR-OCP eliminates the coverage gap in both constant and dynamic learning rate schedules, achieving a convergence rate of $\mathcal{O}(T^{-1/2})$ for both empirical and expected coverage errors under uniform label noise. Extensive experiments demonstrate the effectiveness of our method by achieving both precise coverage and improved efficiency.

[548] Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Bootstrapping

Pu Yang, Yunzhen Feng, Ziyuan Chen, Yuhang Wu, Zhuoyuan Li

Main category: cs.LG

TL;DR: The paper analyzes optimal budget allocation strategies for iterative bootstrapping in foundation models, showing that constant policies fail while increasing policies (especially exponential growth) perform better.

Details

Motivation: Foundation models use iterative bootstrapping where synthetic data is generated, filtered, and used for fine-tuning, but there's no clear guidance on how to allocate budgets across iterations to maximize performance.

Method: Developed a theoretical framework to analyze budget allocation strategies, comparing constant policies with increasing policies (exponential and polynomial growth), and validated through experiments on image denoising with diffusion models and math reasoning with LLMs.

Result: Constant policies fail to converge with high probability, while exponential and polynomial growth policies consistently outperform constant policies, with exponential policies providing more stable performance.

Conclusion: Increasing budget allocation policies, particularly exponential growth strategies, are theoretically and empirically superior to constant policies for iterative bootstrapping in foundation models.

Abstract: Modern foundation models often undergo iterative ``bootstrapping’’ in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model performance improves, raising a crucial question: How should the total budget for generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework for analyzing budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies – particularly exponential growth policies – exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant policies, with exponential policies often providing more stable performance.

[549] FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling

Hong Huang, Hai Yang, Yuan Chen, Jiaxun Ye, Dapeng Wu

Main category: cs.LG

TL;DR: FedRTS is a federated learning framework that uses combinatorial Thompson Sampling for robust sparse model training, addressing issues of unstable topologies and communication inefficiency in dynamic pruning methods.

Details

Motivation: Existing federated learning methods with dynamic pruning suffer from greedy adjustments, unstable topologies, and communication inefficiency, leading to less robust models and suboptimal performance under data heterogeneity and partial client availability.

Method: FedRTS uses a Thompson Sampling-based Adjustment (TSAdj) mechanism that makes probabilistic decisions informed by stable, farsighted information instead of deterministic decisions based on unstable and myopic information.

Result: Extensive experiments show FedRTS achieves state-of-the-art performance in computer vision and NLP tasks while reducing communication costs, particularly excelling in scenarios with heterogeneous data distributions and partial client participation.

Conclusion: FedRTS successfully addresses the limitations of existing dynamic pruning methods in federated learning by providing a more robust and efficient framework through probabilistic decision-making with stable information.

Abstract: Federated Learning (FL) enables collaborative model training across distributed clients without data sharing, but its high computational and communication demands strain resource-constrained devices. While existing methods use dynamic pruning to improve efficiency by periodically adjusting sparse model topologies while maintaining sparsity, these approaches suffer from issues such as greedy adjustments, unstable topologies, and communication inefficiency, resulting in less robust models and suboptimal performance under data heterogeneity and partial client availability. To address these challenges, we propose Federated Robust pruning via combinatorial Thompson Sampling (FedRTS), a novel framework designed to develop robust sparse models. FedRTS enhances robustness and performance through its Thompson Sampling-based Adjustment (TSAdj) mechanism, which uses probabilistic decisions informed by stable, farsighted information instead of deterministic decisions reliant on unstable and myopic information in previous methods. Extensive experiments demonstrate that FedRTS achieves state-of-the-art performance in computer vision and natural language processing tasks while reducing communication costs, particularly excelling in scenarios with heterogeneous data distributions and partial client participation. Our codes are available at: https://github.com/Little0o0/FedRTS

[550] Tokenizing Single-Channel EEG with Time-Frequency Motif Learning

Jathurshan Pradeepkumar, Xihao Piao, Zheng Chen, Jimeng Sun

Main category: cs.LG

TL;DR: TFM-Tokenizer is a novel EEG tokenization framework that learns time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens, achieving significant performance improvements across diverse EEG benchmarks.

Details

Motivation: Foundation models are transforming EEG analysis, but effective EEG tokenization remains a challenge. The paper addresses the need for robust tokenization methods that can capture meaningful patterns from EEG signals while being device-agnostic and compatible with existing foundation models.

Method: Proposes TFM-Tokenizer with a dual-path architecture using time-frequency masking to learn robust motif representations from single-channel EEG signals. The framework is model-agnostic and supports both lightweight transformers and existing foundation models.

Result: Achieves up to 17% improvement in Cohen’s Kappa over strong baselines across four EEG benchmarks. Consistently boosts performance of foundation models like BIOT and LaBraM. Shows 14% improvement on ear-EEG sleep staging despite differences in signal format, channel configuration, and recording devices. Token analysis reveals class-discriminative, frequency-aware structures.

Conclusion: TFM-Tokenizer provides an effective solution for EEG tokenization that offers accuracy improvements, generalization across models, and scalability to different EEG devices through single-channel processing, enabling better representation quality and interpretability in EEG analysis.

Abstract: Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge. This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens. We propose a dual-path architecture with time-frequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks. Our study demonstrates three key benefits: Accuracy: Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to 17% improvement in Cohen’s Kappa over strong baselines. Generalization: Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM. Scalability: By operating at the single-channel level rather than relying on the strict 10-20 EEG system, our method has the potential to be device-agnostic. Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by 14%. A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability. Code is available at https://github.com/Jathurshan0330/TFM-Tokenizer.

[551] Offline Reinforcement Learning via Inverse Optimization

Ioannis Dimanidis, Tolga Ok, Peyman Mohajerin Esfahani

Main category: cs.LG

TL;DR: Proposes a novel offline RL algorithm using inverse optimization’s sub-optimality loss with robust MPC expert to handle distribution shift, achieving competitive performance with fewer parameters.

Details

Motivation: Leverage inverse optimization successes and address distribution shift in offline RL for continuous spaces using robust MPC expert.

Method: Uses sub-optimality loss from IO literature with robust non-causal MPC expert that has exact convex reformulation, trained on MuJoCo benchmark.

Result: Achieves competitive performance with SOTA methods in low-data regime using three orders of magnitude fewer parameters, requiring less computational resources.

Conclusion: The proposed IO-based approach with robust MPC expert is effective for offline RL with reduced computational requirements and good performance.

Abstract: Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.

[552] Strategyproof Reinforcement Learning from Human Feedback

Thomas Kleine Buening, Jiarui Gan, Debmalya Mandal, Marta Kwiatkowska

Main category: cs.LG

TL;DR: RLHF algorithms are vulnerable to strategic labelers who can manipulate feedback to steer policies toward their preferences. Existing methods are not strategyproof, and any strategyproof algorithm must sacrifice policy performance. The paper proposes a new algorithm that achieves approximate strategyproofness while converging to optimal policies.

Details

Motivation: Current RLHF methods assume honest feedback from labelers, but in practice, labelers may strategically misreport preferences to influence the learned policy. This creates a vulnerability where even a single strategic labeler can cause significant misalignment with true social welfare.

Method: The paper proposes the Pessimistic Median of MLEs algorithm, which is designed to be approximately strategyproof under policy coverage assumptions. The method works by taking a pessimistic approach to aggregating maximum likelihood estimates from multiple labelers.

Result: Theoretical analysis shows that existing RLHF algorithms are not strategyproof, and any strategyproof algorithm must perform k-times worse than optimal (where k is the number of labelers). The proposed algorithm achieves approximate strategyproofness and converges to the optimal policy as labelers and samples increase.

Conclusion: There is a fundamental trade-off between incentive alignment (truthful reporting) and policy alignment (social welfare maximization) in RLHF. The proposed Pessimistic Median of MLEs algorithm provides a practical solution that balances these competing objectives, with theoretical guarantees for both contextual bandits and Markov decision processes.

Abstract: We study Reinforcement Learning from Human Feedback (RLHF) in settings where multiple labelers may strategically misreport feedback to steer the learned policy toward their own preferences. We show that existing RLHF algorithms, including recent pluralistic methods, are not strategyproof, and that even a single strategic labeler can cause arbitrarily large misalignment with social welfare. Moreover, we prove that, in the worst case, any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, where $k$ is the number of labelers. This suggests a fundamental trade-off between incentive alignment (ensuring labelers report truthfully) and policy alignment (maximizing social welfare). To address this, we propose the Pessimistic Median of MLEs algorithm, which, under appropriate policy coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of labelers and samples increases. Our results apply to both contextual bandits and Markov decision processes.

[553] Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda

Main category: cs.LG

TL;DR: The paper identifies issues in Crosscoders model diffing method and proposes Latent Scaling to flag misattributed concepts, then improves crosscoder training with BatchTopK loss to find more genuine chat-specific interpretable concepts.

Details

Motivation: To address issues in Crosscoders model diffing method that misattribute concepts as unique to fine-tuned models when they actually exist in both base and fine-tuned models, stemming from L1 training loss problems.

Method: Developed Latent Scaling to accurately measure latent presence across models, and trained crosscoder with BatchTopK loss to mitigate the identified issues.

Result: BatchTopK crosscoder substantially mitigated the issues, finding more genuinely chat-specific and highly interpretable concepts including refusal-related latents and concepts like false information and personal questions.

Conclusion: The work advances best practices for crosscoder-based model diffing methodology and demonstrates it can provide concrete insights into how chat-tuning modifies model behavior.

Abstract: Model diffing is the study of how fine-tuning changes a model’s representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent’s presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.

[554] Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach

Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J. Su, Li Shen

Main category: cs.LG

TL;DR: Preference alignment in LLMs causes poor calibration, leading to overconfidence. The paper analyzes this issue and proposes calibration-aware fine-tuning methods to maintain good calibration while preserving model performance.

Details

Motivation: Large Language Models become poorly calibrated after preference alignment, exhibiting overconfidence despite being well-calibrated pre-training. This calibration degradation is a significant side effect that needs addressing.

Method: The paper proposes two approaches: 1) Calibration-aware fine-tuning for models in the calibratable regime, and 2) EM-algorithm-based ECE regularization for models in the non-calibratable regime to maintain low calibration error during fine-tuning.

Result: Extensive experiments show the proposed methods effectively address the calibration issue caused by preference alignment, maintaining proper calibration without compromising LLM performance.

Conclusion: Preference alignment causes calibration collapse due to preference generalization, but this can be mitigated through calibration-aware fine-tuning approaches that maintain model performance while ensuring proper calibration.

Abstract: One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pre-trained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuning with domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model’s performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs’ performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.

[555] Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang

Main category: cs.LG

TL;DR: Absolute Zero is a new RLVR paradigm where a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without any external data. The Absolute Zero Reasoner (AZR) system achieves SOTA performance on coding and math reasoning tasks despite zero human supervision.

Details

Motivation: Address concerns about scalability of human supervision in RLVR and prepare for future where AI surpasses human intelligence, making human-provided tasks potentially limiting for superintelligent systems.

Method: AZR system self-evolves training curriculum and reasoning ability using code executor to validate proposed code reasoning tasks and verify answers, serving as unified source of verifiable reward for open-ended learning.

Result: AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of human-curated examples, despite being trained without external data.

Conclusion: The Absolute Zero paradigm enables effective self-supervised learning without human data, works across different model scales and classes, and provides a scalable approach for advancing AI reasoning capabilities.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

[556] The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

Jae-Won Chung, Jeff J. Ma, Ruofan Wu, Jiachen Liu, Oh Jun Kweon, Yuxuan Xia, Zhiyu Wu, Mosharaf Chowdhury

Main category: cs.LG

TL;DR: The ML.ENERGY Benchmark is a tool for measuring AI inference energy consumption in realistic service environments, with a leaderboard to help optimize energy usage in generative AI services.

Details

Motivation: As Generative AI adoption grows, energy has become a critical bottleneck resource that is often overlooked or poorly understood in ML system development.

Method: Developed a benchmark suite and tool for measuring inference energy consumption under realistic service environments, with four key design principles for benchmarking ML energy.

Result: Measured energy consumption of 40 widely used model architectures across 6 tasks, showing that automated optimization recommendations can achieve over 40% energy savings without changing model computations.

Conclusion: The ML.ENERGY Benchmark provides a valuable open-source resource for understanding and optimizing energy consumption in generative AI services, with demonstrated significant energy savings potential.

Abstract: As the adoption of Generative AI in real-world services grow explosively, energy has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML$.$ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding ML$.$ENERGY Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the ML$.$ENERGY Benchmark. We then highlight results from the early 2025 iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40%) energy savings without changing what is being computed by the model. The ML$.$ENERGY Benchmark is open-source and can be easily extended to various customized models and application scenarios.

[557] ConDiSim: Conditional Diffusion Models for Simulation Based Inference

Mayank Nautiyal, Andreas Hellander, Prashant Singh

Main category: cs.LG

TL;DR: ConDiSim is a conditional diffusion model for simulation-based inference that uses denoising diffusion to approximate posterior distributions for complex systems with intractable likelihoods.

Details

Motivation: To address the challenge of inference in complex systems where likelihood functions are intractable, requiring robust methods for posterior approximation.

Method: Leverages denoising diffusion probabilistic models with forward process adding Gaussian noise to parameters and reverse process learning to denoise conditioned on observed data.

Result: Demonstrates effective posterior approximation accuracy across ten benchmark problems and two real-world test problems, maintaining computational efficiency and training stability.

Conclusion: ConDiSim provides a robust and extensible framework for simulation-based inference, particularly suitable for parameter inference workflows requiring fast inference methods.

Abstract: We present a conditional diffusion model - ConDiSim, for simulation-based inference of complex systems with intractable likelihoods. ConDiSim leverages denoising diffusion probabilistic models to approximate posterior distributions, consisting of a forward process that adds Gaussian noise to parameters, and a reverse process learning to denoise, conditioned on observed data. This approach effectively captures complex dependencies and multi-modalities within posteriors. ConDiSim is evaluated across ten benchmark problems and two real-world test problems, where it demonstrates effective posterior approximation accuracy while maintaining computational efficiency and stability in model training. ConDiSim offers a robust and extensible framework for simulation-based inference, particularly suitable for parameter inference workflows requiring fast inference methods.

[558] Just One Layer Norm Guarantees Stable Extrapolation

Juliusz Ziomek, George Whittle, Michael A. Osborne

Main category: cs.LG

TL;DR: Layer Norm transforms neural network extrapolation behavior by making the Neural Tangent Kernel bounded-variance, preventing uncontrolled output growth far from training data.

Details

Motivation: Neural Networks behave poorly when extrapolating far from training distribution, but existing analysis is limited to specific cases. Understanding this fundamental behavior is crucial for real-world applications.

Method: Applied Neural Tangent Kernel theory to analyze infinitely-wide neural networks with Layer Norm, proving it transforms the NTK into a bounded-variance kernel. Compared networks with and without LN through theoretical analysis and empirical experiments.

Result: Networks with at least one Layer Norm produce bounded outputs even on inputs far from training data, while networks without LN can exhibit pathologically large outputs. Empirical results on finite-width networks confirm this stability difference.

Conclusion: Layer Norm fundamentally improves neural network extrapolation stability, making outputs bounded far from training distribution. This has practical implications for applications like protein structure prediction and facial age estimation on underrepresented groups.

Abstract: In spite of their prevalence, the behaviour of Neural Networks when extrapolating far from the training distribution remains poorly understood, with existing results limited to specific cases. In this work, we prove general results – the first of their kind – by applying Neural Tangent Kernel (NTK) theory to analyse infinitely-wide neural networks trained until convergence and prove that the inclusion of just one Layer Norm (LN) fundamentally alters the induced NTK, transforming it into a bounded-variance kernel. As a result, the output of an infinitely wide network with at least one LN remains bounded, even on inputs far from the training data. In contrast, we show that a broad class of networks without LN can produce pathologically large outputs for certain inputs. We support these theoretical findings with empirical experiments on finite-width networks, demonstrating that while standard NNs often exhibit uncontrolled growth outside the training domain, a single LN layer effectively mitigates this instability. Finally, we explore real-world implications of this extrapolatory stability, including applications to predicting residue sizes in proteins larger than those seen during training and estimating age from facial images of underrepresented ethnicities absent from the training set.

[559] Electrostatics from Laplacian Eigenbasis for Neural Network Interatomic Potentials

Maksim Zhdanov, Vladislav Kurenkov

Main category: cs.LG

TL;DR: Phi-Module is a universal plugin that enforces Poisson’s equation in message-passing neural networks to learn electrostatic interactions self-supervised, improving molecular energy predictions.

Details

Motivation: To improve neural interatomic potentials by embedding first-principles physics constraints (Poisson's equation) for better electrostatic interaction modeling.

Method: Enforces discretized Poisson’s equation on atom-wise representations to learn potential φ and charge ρ from Laplacian eigenbasis coefficients, deriving electrostatic energy term.

Result: Significantly improves total energy predictions while being hyperparameter-friendly, memory-efficient, and lightweight in training.

Conclusion: Embedding first-principles constraints in neural potentials enhances performance with minimal computational overhead, making Phi-Module a versatile plugin for existing neural potentials.

Abstract: In this work, we introduce Phi-Module, a universal plugin module that enforces Poisson’s equation within the message-passing framework to learn electrostatic interactions in a self-supervised manner. Specifically, each atom-wise representation is encouraged to satisfy a discretized Poisson’s equation, making it possible to acquire a potential {\phi} and corresponding charges \r{ho} linked to the learnable Laplacian eigenbasis coefficients of a given molecular graph. We then derive an electrostatic energy term, crucial for improved total energy predictions. This approach integrates seamlessly into any existing neural potential with insignificant computational overhead. Our results underscore how embedding a first-principles constraint in neural interatomic potentials can significantly improve performance while remaining hyperparameter-friendly, memory-efficient, and lightweight in training. Code will be available at https://github.com/dunnolab/phi-module.

[560] KL-regularization Itself is Differentially Private in Bandits and RLHF

Yizhou Zhang, Kishan Panaganti, Laixi Shi, Juba Ziani, Adam Wierman

Main category: cs.LG

TL;DR: KL-regularization in learning objectives provides differential privacy for free in multi-armed bandits, linear contextual bandits, and RLHF without additional noise injection.

Details

Motivation: To achieve differential privacy without explicit noise injection by leveraging the intrinsic randomness of existing algorithms and regularization techniques.

Method: Adding KL-regularization to learning objectives in stochastic policies for multi-armed bandits, linear contextual bandits, and RLHF in offline data settings.

Result: The action sampled from the resulting stochastic policy becomes differentially private, providing privacy guarantees while maintaining regularization benefits.

Conclusion: KL-regularization offers a new approach to achieve differential privacy without additional noise, preserving both privacy and performance advantages of regularization.

Abstract: Differential Privacy (DP) provides a rigorous framework for privacy, ensuring the outputs of data-driven algorithms remain statistically indistinguishable across datasets that differ in a single entry. While guaranteeing DP generally requires explicitly injecting noise either to the algorithm itself or to its outputs, the intrinsic randomness of existing algorithms presents an opportunity to achieve DP ``for free’’. In this work, we explore the role of regularization in achieving DP across three different decision-making problems: multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF), in offline data settings. We show that adding KL-regularization to the learning objective (a common approach in optimization algorithms) makes the action sampled from the resulting stochastic policy itself differentially private. This offers a new route to privacy guarantees without additional noise injection, while also preserving the inherent advantage of regularization in enhancing performance.

[561] Adaptive Budget Allocation for Orthogonal-Subspace Adapter Tuning in LLMs Continual Learning

Zhiyi Wan, Wanrou Du, Liang Li, Miao Pan, Xiaoqi Qin

Main category: cs.LG

TL;DR: OA-Adapter is a parameter-efficient continual learning method for LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in a single training stage, addressing catastrophic forgetting and task interference.

Details

Motivation: LLMs suffer from catastrophic forgetting in continual learning, and existing methods either use fixed budget allocation (ignoring task complexity variations) or employ multi-stage approaches that cause misalignment between optimization and budget allocation.

Method: OA-Adapter introduces dynamic bottleneck dimension adaptation that simultaneously allocates parameter budgets and optimizes task objectives. It applies orthogonal constraints between current task parameters and dynamically allocated historical task subspaces to preserve knowledge.

Result: OA-Adapter outperforms state-of-the-art methods in accuracy and parameter efficiency, achieving higher average accuracy with 58.5% fewer parameters on standard CL benchmarks, and maintains advantages on larger benchmarks with 15 tasks.

Conclusion: The proposed OA-Adapter effectively addresses catastrophic forgetting in LLMs through unified dynamic budget adaptation and orthogonal subspace learning, demonstrating superior performance and parameter efficiency across multiple continual learning benchmarks.

Abstract: Large language models (LLMs) often suffer from catastrophic forgetting in continual learning (CL) scenarios, where performance on previously learned tasks degrades severely while training on sequentially arriving tasks. Although pioneering CL approaches using orthogonal subspaces can mitigate task interference, they typically employ fixed budget allocation, neglecting the varying complexity across tasks and layers. Besides, recent budget-adaptive tuning methods for LLMs often adopt multi-stage paradigms that decouple optimization and budget allocation. Such decoupling results in potential misalignment, which hinders those approaches’ practical application in CL scenarios. To address these limitations, we propose OA-Adapter, a novel parameter-efficient approach for continual learning in LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in an end-to-end training stage. Specifically, OA-Adapter introduces a dynamic bottleneck dimension adaptation mechanism that simultaneously allocates an efficient parameter budget and optimizes task objectives without misalignment.To effectively preserve previously acquired knowledge while coordinating with the dynamic budget allocation, orthogonal constraints are applied specifically between the parameter subspace of the current task and the dynamically allocated parameter subspaces of historical tasks. Experimental results on continual learning benchmarks demonstrate that OA-Adapter outperforms state-of-the-art methods in both accuracy and parameter efficiency. OA-Adapter achieves higher average accuracy while using 58.5% fewer parameters on the standard CL benchmark, and maintains its advantages on two larger benchmarks comprising 15 tasks.

[562] Uni-LoRA: One Vector is All You Need

Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji

Main category: cs.LG

TL;DR: Uni-LoRA presents a unified framework for LoRA variants that reconstructs LoRA parameters from a single trainable vector using an isometric projection matrix, achieving state-of-the-art parameter efficiency.

Details

Motivation: To unify various LoRA variants (Tied-LoRA, VeRA, VB-LoRA) under a single framework and address limitations of layer-wise projections that restrict cross-layer parameter sharing.

Method: Proposes Uni-LoRA framework where LoRA parameters are reconstructed from a low-dimensional subspace via an isometric projection matrix, enabling global parameter sharing with just one trainable vector for the entire LLM.

Result: Achieves state-of-the-art parameter efficiency while matching or outperforming prior approaches on GLUE, mathematical reasoning, and instruction tuning benchmarks.

Conclusion: Uni-LoRA provides both a unified theoretical framework and a highly efficient ‘one-vector-only’ solution for parameter-efficient fine-tuning of LLMs.

Abstract: Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space $R^D$, can be reconstructed through a projection from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a “one-vector-only” solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance.

[563] WeightLoRA: Keep Only Necessary Adapters

Andrey Veprikov, Vladimir Solodkin, Alexander Zyl, Andrey Savchenko, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: WeightLoRA is a novel parameter-efficient fine-tuning method that adaptively selects the most critical LoRA heads during optimization, significantly reducing trainable parameters while maintaining or improving performance.

Details

Motivation: Traditional LoRA requires significant memory for training large models and relies on intuition for adapter placement, creating scalability and efficiency challenges.

Method: Proposes WeightLoRA which uses adaptive selection of the most critical LoRA heads throughout the optimization process to reduce trainable parameters.

Result: Experimental results on DeBERTa, BART, and Llama models show WeightLoRA reduces parameters while maintaining or improving performance, with WeightLoRA+ performing best in most cases.

Conclusion: WeightLoRA provides an effective solution for parameter-efficient fine-tuning that overcomes LoRA’s memory and layer selection limitations while achieving competitive or superior results.

Abstract: The widespread utilization of language models in modern applications is inconceivable without Parameter-Efficient Fine-Tuning techniques, such as low-rank adaptation ($\texttt{LoRA}$), which adds trainable adapters to selected layers. Although $\texttt{LoRA}$ may obtain accurate solutions, it requires significant memory to train large models and intuition on which layers to add adapters. In this paper, we propose a novel method, $\texttt{WeightLoRA}$, which overcomes this issue by adaptive selection of the most critical $\texttt{LoRA}$ heads throughout the optimization process. As a result, we can significantly reduce the number of trainable parameters while maintaining the capability to obtain consistent or even superior metric values. We conduct experiments for a series of competitive benchmarks and DeBERTa, BART, and Llama models, comparing our method with different adaptive approaches. The experimental results demonstrate the efficacy of $\texttt{WeightLoRA}$ and the superior performance of $\texttt{WeightLoRA+}$ in almost all cases.

[564] When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi

Main category: cs.LG

TL;DR: Style patterns in malicious queries can inflate jailbreak success rates in LLMs, and superficial style alignment increases vulnerability. A defense called SafeStyle is proposed to mitigate these risks.

Details

Motivation: To understand how style patterns in queries compromise LLM safety, how superficial style alignment increases vulnerability, and how to mitigate these risks during alignment.

Method: Defined ASR inflation, evaluated 32 LLMs across 7 benchmarks, investigated superficial style alignment through fine-tuning, and proposed SafeStyle defense with safety training data augmented to match style patterns.

Result: Nearly all models exhibit ASR inflation, which correlates with attention to style patterns. Fine-tuning with specific styles makes LLMs more vulnerable to same-style jailbreaks. SafeStyle consistently outperforms baselines in maintaining safety.

Conclusion: Style patterns significantly impact LLM safety, and SafeStyle provides an effective defense strategy against style-based jailbreak attacks.

Abstract: Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating 32 LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM’s relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs, six fine-tuning style settings, and two real-world instruction-tuning datasets, SafeStyle consistently outperforms baselines in maintaining LLM safety.

[565] Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: Proposes JAGUAR SignSGD and JAGUAR Muon, two zero-order optimization methods for memory-efficient LLM fine-tuning that achieve comparable performance to first-order methods with significant memory reduction.

Details

Motivation: Traditional first-order optimizers like SGD and Adam are memory-intensive and scale poorly with large LLMs, creating a need for more efficient alternatives.

Method: Develops zero-order momentum-based algorithms that work with parameter-efficient techniques like LoRA, requiring same parameters as ZO SGD but only O(1) function evaluations per iteration.

Result: Achieves rigorous convergence guarantees for SignSGD in stochastic ZO case, and demonstrates through experiments that proposed methods meet or exceed convergence quality of first-order methods with significant memory reduction.

Conclusion: Zero-order optimization methods provide a practical and theoretically grounded approach for resource-constrained LLM adaptation, establishing them as viable alternatives to traditional first-order methods.

Abstract: Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM

[566] TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness

Zhiyuan Zhao, Juntong Ni, Shangqing Xu, Haoxin Liu, Wei Jin, B. Aditya Prakash

Main category: cs.LG

TL;DR: TimeRecipe is a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level through over 10,000 experiments, revealing design insights and providing practical toolkit recommendations.

Details

Motivation: There is considerable debate over which architectures and design components are most effective in time-series forecasting, with existing benchmarks offering limited insight into why certain designs work better under varying conditions.

Method: Proposed TimeRecipe framework conducts systematic module-level evaluation of time-series forecasting methods across diverse datasets, forecasting horizons, and task settings through over 10,000 experiments.

Result: Exhaustive exploration of the design space yields models that outperform existing state-of-the-art methods and uncovers meaningful intuitions linking specific design choices to forecasting scenarios.

Conclusion: TimeRecipe provides a practical toolkit that recommends suitable model architectures based on empirical insights, with the benchmark publicly available for community use.

Abstract: Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TimeRecipe, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods and uncover meaningful intuitions linking specific design choices to forecasting scenarios. Furthermore, we release a practical toolkit within TimeRecipe that recommends suitable model architectures based on these empirical insights. The benchmark is available at: https://github.com/AdityaLab/TimeRecipe.

[567] Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

Hen Davidov, Gilad Freidkin, Shai Feldman, Yaniv Romano

Main category: cs.LG

TL;DR: Proposes time-to-unsafe-sampling as a new safety metric for LLMs and develops a conformal prediction method with optimized sampling to estimate it with statistical guarantees.

Details

Motivation: Existing safety evaluation methods don't capture how many generations are needed to trigger unsafe responses in well-aligned models, where unsafe outputs are rare.

Method: Frames the problem as survival analysis, uses conformal prediction with a novel calibration technique to construct lower predictive bounds, and employs optimized sampling-budget allocation for efficiency.

Result: The method provides rigorous coverage guarantees for time-to-unsafe-sampling estimation and demonstrates practical utility on synthetic and real data.

Conclusion: Time-to-unsafe-sampling offers a new dimension for prompt-adaptive safety evaluation, and the proposed method enables efficient estimation with statistical guarantees for generative AI risk assessment.

Abstract: We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

[568] Subspace-Boosted Model Merging

Ronald Skorobogat, Karsten Roth, Mariana-Iuliana Georgescu

Main category: cs.LG

TL;DR: Model merging combines expert models but suffers from diminishing returns due to rank collapse in task vector space. Subspace Boosting maintains task vector ranks through SVD, improving merging efficacy by over 10% for up to 20 experts.

Details

Motivation: Address the diminishing returns and reduced performance gains when merging increasing numbers of specialized expert models, caused by rank collapse in task vector space.

Method: Introduces Subspace Boosting which operates on singular value decomposed task vector space to maintain task vector ranks, and uses Higher-Order Generalized SVD to quantify task similarity.

Result: Subspace Boosting raises merging efficacy by large margins of more than 10% for up to 20 expert models when evaluated on both vision and language benchmarks.

Conclusion: The proposed Subspace Boosting method effectively mitigates rank collapse in model merging, significantly improving performance while providing interpretable task similarity analysis through Higher-Order Generalized SVD.

Abstract: Model merging enables the combination of multiple specialized expert models into a single model capable of performing multiple tasks. However, the benefits of merging an increasing amount of specialized experts generally lead to diminishing returns and reduced overall performance gains. In this work, we offer an explanation and analysis from a task arithmetic perspective; revealing that as the merging process (across numerous existing merging methods) continues for more and more experts, the associated task vector space experiences rank collapse. To mitigate this issue, we introduce Subspace Boosting, which operates on the singular value decomposed task vector space and maintains task vector ranks. Subspace Boosting raises merging efficacy for up to 20 expert models by large margins of more than 10% when evaluated on both vision and language benchmarks. Moreover, we propose employing Higher-Order Generalized Singular Value Decomposition to quantify task similarity, offering a new interpretable perspective on model merging.

[569] Byzantine Failures Harm the Generalization of Robust Distributed Learning Algorithms More Than Data Poisoning

Thomas Boudou, Batiste Le Bars, Nirupam Gupta, Aurélien Bellet

Main category: cs.LG

TL;DR: This paper shows a fundamental gap in generalization guarantees between Byzantine failures and data poisoning threat models in distributed learning, with Byzantine failures yielding strictly worse generalization rates.

Details

Motivation: To understand whether the empirical generalization gap between Byzantine failures and data poisoning threat models is fundamental or just due to suboptimal attacks, and to provide theoretical guarantees.

Method: Leverages tight algorithmic stability analysis of robust distributed learning algorithms, proving bounds on uniform algorithmic stability degradation under both threat models.

Result: Under data poisoning, stability degrades by Θ(f/(n-f)), while under Byzantine failures, degradation is Ω(√(f/(n-2f))), showing Byzantine failures yield strictly worse generalization rates.

Conclusion: There is a fundamental gap in generalization guarantees between the two threat models, with Byzantine failures being strictly more harmful to generalization than data poisoning.

Abstract: Robust distributed learning algorithms aim to maintain reliable performance despite the presence of misbehaving workers. Such misbehaviors are commonly modeled as Byzantine failures, allowing arbitrarily corrupted communication, or as data poisoning, a weaker form of corruption restricted to local training data. While prior work shows similar optimization guarantees for both models, an important question remains: How do these threat models impact generalization? Empirical evidence suggests a gap, yet it remains unclear whether it is unavoidable or merely an artifact of suboptimal attacks. We show, for the first time, a fundamental gap in generalization guarantees between the two threat models: Byzantine failures yield strictly worse rates than those achievable under data poisoning. Our findings leverage a tight algorithmic stability analysis of robust distributed learning. Specifically, we prove that: (i) under data poisoning, the uniform algorithmic stability of an algorithm with optimal optimization guarantees degrades by an additive factor of $\varTheta ( \frac{f}{n-f} )$, with $f$ out of $n$ workers misbehaving; whereas $\textit{(ii)}$ under Byzantine failures, the degradation is in $\Omega \big( \sqrt{ \frac{f}{n-2f}} \big)$.

[570] LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock, Cheng-Kai Lai, Amir Barati Farimani

Main category: cs.LG

TL;DR: A multi-agent LLM framework autonomously infers operating constraints from minimal process descriptions and guides optimization, achieving competitive performance with 31x faster convergence than grid search.

Details

Motivation: Chemical process optimization becomes impractical when operating constraints are ill-defined or unavailable, requiring a solution that can autonomously infer constraints.

Method: AutoGen-based multi-agent framework using OpenAI’s o3 model with specialized agents for constraint generation, parameter validation, simulation, and optimization guidance.

Result: Achieved competitive performance with conventional methods while reducing wall-time 31-fold relative to grid search, converging in under 20 minutes on hydrodealkylation process.

Conclusion: Reasoning-capable LLMs are essential for successful optimization, and this approach is valuable for emerging processes where operational constraints are poorly characterized.

Abstract: Chemical process optimization maximizes production efficiency and economic performance, but optimization algorithms, including gradient-based solvers, numerical methods, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable. We present a multi-agent LLM framework that autonomously infers operating constraints from minimal process descriptions, then collaboratively guides optimization. Our AutoGen-based framework employs OpenAI’s o3 model with specialized agents for constraint generation, parameter validation, simulation, and optimization guidance. Through autonomous constraint generation and iterative multi-agent optimization, the framework eliminates the need for predefined operational bounds. Validated on hydrodealkylation across cost, yield, and yield-to-cost ratio metrics, the framework achieved competitive performance with conventional methods while reducing wall-time 31-fold relative to grid search, converging in under 20 minutes. The reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs and applying domain-informed heuristics. Unlike conventional methods requiring predefined constraints, our approach uniquely combines autonomous constraint generation with interpretable parameter exploration. Model comparison reveals reasoning-capable architectures (o3, o1) are essential for successful optimization, while standard models fail to converge. This approach is particularly valuable for emerging processes and retrofit applications where operational constraints are poorly characterized or unavailable.

[571] VALID-Mol: a Systematic Framework for Validated LLM-Assisted Molecular Design

Malikussaid, Hilal Hudan Nuha, Isman Kurniawan

Main category: cs.LG

TL;DR: VALID-Mol framework improves LLM-driven molecular design by integrating chemical validation, increasing valid chemical structure generation from 3% to 83% while maintaining synthetic feasibility and improving binding affinity.

Details

Motivation: Large Language Models show promise for scientific discovery but struggle with factual precision and domain constraints in molecular design, often generating chemically infeasible structures.

Method: Integrates systematic prompt optimization, automated chemical verification, and domain-adapted fine-tuning to ensure dependable generation of synthesizable molecules with enhanced properties.

Result: Achieved 83% valid chemical structure generation (from 3% baseline) and up to 17-fold predicted improvements in target binding affinity while preserving synthetic feasibility.

Conclusion: Provides a transferable methodology for scientifically-constrained LLM applications with measurable reliability enhancements for molecular design in pharmaceutical development.

Abstract: Large Language Models demonstrate substantial promise for advancing scientific discovery, yet their deployment in disciplines demanding factual precision and specialized domain constraints presents significant challenges. Within molecular design for pharmaceutical development, these models can propose innovative molecular modifications but frequently generate chemically infeasible structures. We introduce VALID-Mol, a comprehensive framework that integrates chemical validation with LLM-driven molecular design, achieving an improvement in valid chemical structure generation from 3% to 83%. Our methodology synthesizes systematic prompt optimization, automated chemical verification, and domain-adapted fine-tuning to ensure dependable generation of synthesizable molecules with enhanced properties. Our contribution extends beyond implementation details to provide a transferable methodology for scientifically-constrained LLM applications with measurable reliability enhancements. Computational analyses indicate our framework generates promising synthesis candidates with up to 17-fold predicted improvements in target binding affinity while preserving synthetic feasibility.

[572] BoltzNCE: Learning Likelihoods for Boltzmann Generation with Stochastic Interpolants and Noise Contrastive Estimation

Rishal Aggarwal, Jacky Chen, Nicholas M. Boffi, David Ryan Koes

Main category: cs.LG

TL;DR: The paper introduces a method to accelerate Boltzmann Generators by training an energy-based model to approximate likelihoods using noise contrastive estimation and score matching, achieving 100x faster inference while maintaining accurate Boltzmann statistics.

Details

Motivation: Boltzmann Generators face computational challenges due to costly Jacobian computations during integration, making them impractical for large molecular systems. The motivation is to overcome this limitation while maintaining accurate Boltzmann statistics.

Method: Train an energy-based model (EBM) using both noise contrastive estimation (NCE) and score matching to approximate likelihoods, avoiding expensive Jacobian computations. The approach enables transfer learning across molecular systems.

Result: Achieved 100x faster inference on alanine dipeptide while maintaining accurate free energy profiles and energy distributions. Demonstrated effective transfer learning with at least 6x speedup over standard molecular dynamics.

Conclusion: The work demonstrates the design of models with accelerated likelihoods rather than just fast sampling, enabling unbiased Boltzmann statistics at scale through efficient reweighting schemes.

Abstract: Efficient sampling from the Boltzmann distribution given its energy function is a key challenge for modeling complex physical systems such as molecules. Boltzmann Generators address this problem by leveraging continuous normalizing flows to transform a simple prior into a distribution that can be reweighted to match the target using sample likelihoods. Despite the elegance of this approach, obtaining these likelihoods requires computing costly Jacobians during integration, which is impractical for large molecular systems. To overcome this difficulty, we train an energy-based model (EBM) to approximate likelihoods using both noise contrastive estimation (NCE) and score matching, which we show outperforms the use of either objective in isolation. On 2d synthetic systems where failure can be easily visualized, NCE improves mode weighting relative to score matching alone. On alanine dipeptide, our method yields free energy profiles and energy distributions that closely match those obtained using exact likelihoods while achieving $100\times$ faster inference. By training on multiple dipeptide systems, we show that our approach also exhibits effective transfer learning, generalizing to new systems at inference time and achieving at least a $6\times$ speedup over standard MD. While many recent efforts in generative modeling have prioritized models with fast sampling, our work demonstrates the design of models with accelerated likelihoods, enabling the application of reweighting schemes that ensure unbiased Boltzmann statistics at scale. Our code is available at https://github.com/RishalAggarwal/BoltzNCE.

[573] Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling

Ning Liao, Xiaoxing Wang, Zehao Lin, Weiyang Guo, Feng Hong, Shixiang Song, Geng Yu, Zihua Zhao, Sitao Xie, Longxuan Wei, Xiangqi Jin, Xiaohan Qin, Jiale Ma, Kai Chen, Jiangchao Yao, Zhouhan Lin, Junchi Yan, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Linfeng Zhang

Main category: cs.LG

TL;DR: Innovator upcycles a dense LLM into a Mixture-of-Experts model to prevent catastrophic forgetting during science-focused continued pretraining, achieving significant improvements in scientific tasks while maintaining general capabilities.

Details

Motivation: To create a science-general intelligence model without suffering from catastrophic forgetting of general abilities when training on scientific data.

Method: Four-stage upcycle training: Scientific Expert Induction, Fine-grained Expert Splitting via FFN decomposition, Science-Aware Routing warmup, and Generalist-Scientist Integration training on hybrid datasets.

Result: 25% average improvement across 30 scientific tasks with 70% win rate, while retaining 99% performance in general tasks. Innovator-Reason shows over 30% improvement in complex scientific reasoning.

Conclusion: The Mixture-of-Experts approach successfully decouples knowledge across domains, enabling specialized scientific expertise without compromising general capabilities.

Abstract: A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.

[574] Flows and Diffusions on the Neural Manifold

Daniel Saragih, Deyu Cao, Tejas Balaji

Main category: cs.LG

TL;DR: The paper extends diffusion and flow-based generative models to weight space learning by modeling gradient descent trajectories as inference problems, incorporating optimization dynamics as structural priors for generating neural network weights.

Details

Motivation: To leverage the success of diffusion and flow-based generative models in domains like image synthesis and apply them to weight space learning, using optimization trajectories as inductive bias for better weight generation and initialization.

Method: Models gradient descent trajectories as trajectory inference problems, unifies trajectory inference techniques to match gradient flow, uses autoencoders for latent weight representation, conditions on task-specific context, employs informative source distributions like Kaiming uniform, and includes reward fine-tuning via adjoint matching.

Result: The method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, supports fine-tuning to enhance performance, and outperforms comparable baselines in detecting harmful covariate shifts in safety-critical systems.

Conclusion: The approach successfully extends generative modeling to weight space by leveraging optimization dynamics as structural priors, demonstrating practical benefits in weight generation, initialization, and safety-critical applications.

Abstract: Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques towards matching a gradient flow, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline.

[575] One-Step Flow Policy Mirror Descent

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, Bo Dai

Main category: cs.LG

TL;DR: FPMD enables 1-step sampling for flow policies in online RL, achieving comparable performance to diffusion policies with much faster inference.

Details

Motivation: Diffusion policies have strong performance but slow iterative sampling limits responsiveness. Need faster inference while maintaining performance.

Method: Uses theoretical connection between distribution variance and discretization error to enable 1-step sampling in flow matching models. Presents two variants: rectified flow policy and MeanFlow policy, requiring no extra distillation.

Result: Strong performance comparable to diffusion policy baselines on MuJoCo and visual DeepMind Control Suite benchmarks, with orders of magnitude less computational cost during inference.

Conclusion: FPMD successfully addresses the inference speed limitation of diffusion policies while maintaining competitive performance, making flow policies more practical for real-time applications.

Abstract: Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during flow policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on rectified flow policy and MeanFlow policy, respectively. Extensive empirical evaluations on MuJoCo and visual DeepMind Control Suite benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring orders of magnitude less computational cost during inference.

[576] On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang

Main category: cs.LG

TL;DR: Dynamic Fine-Tuning (DFT) improves SFT by dynamically rescaling the objective function with token probabilities, addressing limited generalization issues and outperforming standard SFT across multiple benchmarks.

Details

Motivation: Standard SFT has limited generalization compared to reinforcement learning due to problematic reward structures in its gradients, which restrict model capabilities.

Method: Propose Dynamic Fine-Tuning (DFT) that stabilizes gradient updates by dynamically rescaling the objective function with the probability of each token - a single-line code change.

Result: DFT significantly outperforms standard SFT across multiple challenging benchmarks and base models, showing greatly improved generalization and competitive results in offline RL settings.

Conclusion: DFT bridges theoretical insight and practical solutions, substantially advancing SFT performance while offering an effective yet simpler alternative to reinforcement learning approaches.

Abstract: We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

[577] ECG-Soup: Harnessing Multi-Layer Synergy for ECG Foundation Models

Phu X. Nguyen, Huy Phan, Hieu Pham, Christos Chatzichristos, Bert Vandenberk, Maarten De Vos

Main category: cs.LG

TL;DR: Transformer-based foundation models for ECGs show impressive performance in downstream applications.

Details

Motivation: To leverage transformer architectures for ECG analysis, building on their success in other domains.

Method: Using transformer-based foundation models specifically designed for ECG data processing.

Result: Achieved impressive performance across multiple downstream ECG applications.

Conclusion: Transformer models are effective for ECG analysis and show promise for various clinical applications.

Abstract: Transformer-based foundation models for Electrocardiograms (ECGs) have recently achieved impressive performance in many downstream applications.

[578] Provable Mixed-Noise Learning with Flow-Matching

Paul Hagemann, Robert Gruhlke, Bernhard Stankewitz, Claudia Schillings, Gabriele Steidl

Main category: cs.LG

TL;DR: A novel EM framework with flow matching for Bayesian inverse problems with mixed additive and multiplicative Gaussian noise, enabling joint estimation of posterior samplers and unknown noise parameters.

Details

Motivation: Real-world applications in physics and chemistry often involve noise with unknown and heterogeneous structure, while traditional methods assume fixed or known noise characteristics.

Method: Conditional flow matching embedded within an Expectation-Maximization algorithm, using simulation-free ODE-based flow matching as the generative model in the E-step.

Result: The EM updates converge to true noise parameters in the population limit of infinite observations, and numerical results show effectiveness for mixed-noise Bayesian inverse problems.

Conclusion: Combining EM inference with flow matching provides an effective framework for handling mixed-noise Bayesian inverse problems with unknown noise parameters.

Abstract: We study Bayesian inverse problems with mixed noise, modeled as a combination of additive and multiplicative Gaussian components. While traditional inference methods often assume fixed or known noise characteristics, real-world applications, particularly in physics and chemistry, frequently involve noise with unknown and heterogeneous structure. Motivated by recent advances in flow-based generative modeling, we propose a novel inference framework based on conditional flow matching embedded within an Expectation-Maximization (EM) algorithm to jointly estimate posterior samplers and noise parameters. To enable high-dimensional inference and improve scalability, we use simulation-free ODE-based flow matching as the generative model in the E-step of the EM algorithm. We prove that, under suitable assumptions, the EM updates converge to the true noise parameters in the population limit of infinite observations. Our numerical results illustrate the effectiveness of combining EM inference with flow matching for mixed-noise Bayesian inverse problems.

[579] Merge-of-Thought Distillation

Zhanming Shen, Zeyu Qin, Zenan Huang, Hao Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, Junbo Zhao

Main category: cs.LG

TL;DR: Merge-of-Thought Distillation (MoT) is a lightweight framework that alternates between teacher-specific supervised fine-tuning and weight-space merging to unify multiple teachers’ reasoning abilities into a student model, overcoming conflicts among various teachers’ supervision.

Details

Motivation: Current reasoning distillation for long chain-of-thought models is constrained by the assumption of a single oracle teacher, despite the practical availability of multiple candidate teachers and growing CoT corpora. Different students have different "best teachers," and even for the same student, the best teacher can vary across datasets.

Method: MoT alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. It uses only about 200 CoT samples and applies this framework to distill reasoning capabilities from multiple teachers.

Result: On competition math benchmarks, applying MoT to a Qwen3-14B student surpasses strong models including Deepseek-R1, Qwen3-32B, and OpenAI-O1. It consistently outperforms the best single-teacher distillation, improves general reasoning beyond mathematics while reducing catastrophic forgetting, and shows robustness to distribution-shifted and peer-level teachers.

Conclusion: MoT possesses consensus CoT by eliminating teacher-specific inductive biases and inter-teacher conflicts while repeatedly reinforcing the learning of consensus reasoning features. It positions as a simple, effective route to efficiently distilling long CoT capabilities from diverse teachers into compact students.

Abstract: Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite the practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different “best teachers,” and even for the same student, the best teacher can vary across datasets. Therefore, to unify multiple teachers’ reasoning abilities into a student to overcome conflicts among various teachers’ supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including Deepseek-R1, Qwen3-32B, and OpenAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation, improves general reasoning beyond mathematics while reducing catastrophic forgetting, and shows robustness to distribution-shifted and peer-level teachers. Finally, we have demonstrated MoT possesses consensus CoT by eliminating teacher-specific inductive biases and inter-teacher conflicts while repeatedly reinforcing the learning of consensus reasoning features. These results position MoT as a simple, effective route to efficiently distilling long CoT capabilities from diverse teachers into compact students.

[580] Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Benjamin Sterling, Yousef El-Laham, Mónica F. Bugallo

Main category: cs.LG

TL;DR: This paper proposes a defense method against membership inference attacks on diffusion models using critically-damped higher-order Langevin dynamics with auxiliary variables to protect training data privacy.

Details

Motivation: Recent advances in generative AI raise data security concerns, particularly defending diffusion models against membership inference attacks where attackers can determine if specific data points were used in training.

Method: Utilizes critically-damped higher-order Langevin dynamics that introduces auxiliary variables and a joint diffusion process to mix external randomness, corrupting sensitive input data earlier in the diffusion process.

Result: The defense method was theoretically investigated and validated on a toy dataset and speech dataset using AUROC curves and FID metric, showing improved resistance to membership inference attacks.

Conclusion: The proposed defense using higher-order Langevin dynamics with auxiliary variables effectively enhances diffusion models’ resistance to membership inference attacks while maintaining model performance.

Abstract: Recent advances in generative artificial intelligence applications have raised new data security concerns. This paper focuses on defending diffusion models against membership inference attacks. This type of attack occurs when the attacker can determine if a certain data point was used to train the model. Although diffusion models are intrinsically more resistant to membership inference attacks than other generative models, they are still susceptible. The defense proposed here utilizes critically-damped higher-order Langevin dynamics, which introduces several auxiliary variables and a joint diffusion process along these variables. The idea is that the presence of auxiliary variables mixes external randomness that helps to corrupt sensitive input data earlier on in the diffusion process. This concept is theoretically investigated and validated on a toy dataset and a speech dataset using the Area Under the Receiver Operating Characteristic (AUROC) curves and the FID metric.

[581] Regularizing Extrapolation in Causal Inference

David Arbour, Harsh Parikh, Bijan Niknam, Elizabeth Stuart, Kara Rudolph, Avi Feller

Main category: cs.LG

TL;DR: A unified framework that replaces hard non-negativity constraints with soft constraints to balance feature imbalance, model misspecification, and variance in linear smoothers.

Details

Motivation: Current estimators either allow negative weights (improving imbalance but increasing parametric dependence and variance) or restrict to non-negative weights (reducing parametric dependence and variance but worsening imbalance). There's a need for a balanced approach.

Method: Propose a framework that directly penalizes extrapolation level with soft constraints, derive worst-case extrapolation error bound, introduce ‘bias-bias-variance’ tradeoff, and develop optimization procedure regularizing this bound while minimizing imbalance.

Result: The approach provides effective balance between competing objectives, demonstrated through synthetic experiments and real-world application generalizing RCT estimates to target populations.

Conclusion: The soft constraint framework offers a principled alternative to hard non-negativity constraints, enabling better tradeoffs between feature imbalance, model misspecification, and variance, especially in high-dimensional settings with poor positivity.

Abstract: Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel “bias-bias-variance” tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.

Bo Wang, Imran Khan, Martin White, Natalia Beloff

Main category: cs.LG

TL;DR: A federated multi-modal phishing detector that supports URL, HTML, and IMAGE inputs with flexible modality selection at inference. Uses role-aware bucket aggregation with hard-gated experts to stabilize training and isolate cross-embedding conflicts.

Details

Motivation: To enable flexible multi-modal phishing detection in federated settings where clients can use any modality without being bound to fixed inputs, while maintaining strict privacy constraints.

Method: Proposes role-aware bucket aggregation on FedProx with hard gating (selecting IMAGE/HTML/URL expert by sample modality) instead of learnable routing. Uses separate aggregation of modality-specific parameters. For text, employs GraphCodeBERT for URLs and early three-way embedding for raw HTML.

Result: Fusion head achieves 97.5% accuracy with 2.4% FPR across two data types on TR-OP. Image subset: 95.5% accuracy with 5.9% FPR. WebPhish (HTML): 96.5% accuracy with 1.8% FPR. TR-OP (raw HTML): 95.1% accuracy with 4.6% FPR.

Conclusion: Bucket aggregation with hard-gated experts enables stable federated training under strict privacy while improving usability and flexibility of multi-modal phishing detection.

Abstract: We present a federated, multi-modal phishing website detector that supports URL, HTML, and IMAGE inputs without binding clients to a fixed modality at inference: any client can invoke any modality head trained elsewhere. Methodologically, we propose role-aware bucket aggregation on top of FedProx, inspired by Mixture-of-Experts and FedMM. We drop learnable routing and use hard gating (selecting the IMAGE/HTML/URL expert by sample modality), enabling separate aggregation of modality-specific parameters to isolate cross-embedding conflicts and stabilize convergence. On TR-OP, the Fusion head reaches Acc 97.5% with FPR 2.4% across two data types; on the image subset (ablation) it attains Acc 95.5% with FPR 5.9%. For text, we use GraphCodeBERT for URLs and an early three-way embedding for raw, noisy HTML. On WebPhish (HTML) we obtain Acc 96.5% / FPR 1.8%; on TR-OP (raw HTML) we obtain Acc 95.1% / FPR 4.6%. Results indicate that bucket aggregation with hard-gated experts enables stable federated training under strict privacy, while improving the usability and flexibility of multi-modal phishing detection.

[583] Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

Main category: cs.LG

TL;DR: This paper presents MR-GPTQ, a specialized quantization method for 4-bit floating-point formats (MXFP4 and NVFP4) that addresses their unique challenges and enables significant speedups while maintaining accuracy.

Details

Motivation: Hardware-accelerated 4-bit floating-point formats like MXFP4 and NVFP4 promise to revolutionize LLM inference but their practical benefits remain unproven, with existing methods struggling due to format-specific limitations.

Method: Proposed Micro-Rotated-GPTQ (MR-GPTQ) uses block-wise Hadamard transforms and format-specific optimizations, supported by high-performance GPU kernels that enable rotation fusion into weights and fast online computation of activations.

Result: Achieves speedups vs. FP16 of up to 3.6x layer-wise and 2.2x end-to-end on NVIDIA B200, and 6x layer-wise and 4x end-to-end on RTX5090, while matching or outperforming state-of-the-art accuracy.

Conclusion: While FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock new accuracy-performance trade-offs for 4-bit floating-point quantization.

Abstract: The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4’s small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4’s power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4’s unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it can near the accuracy that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

[584] PATCH: Learnable Tile-level Hybrid Sparsity for LLMs

Younes Hourri, Mohammad Mozaffari, Maryam Mehri Dehnavi

Main category: cs.LG

TL;DR: PATCH introduces a hybrid sparsity framework that enables continuous sparsity ratios between 0% and 50% by partitioning weight matrices into tiles and assigning each tile to be either dense or 2:4 sparse, achieving better accuracy-acceleration tradeoffs than existing pruning methods.

Details

Motivation: Existing model pruning approaches face limitations: unstructured sparsity preserves accuracy but prevents GPU acceleration due to irregular access patterns, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. There's a need to bridge this gap.

Method: PATCH partitions weight matrices into tiles and uses a learnable mask selection mechanism to assign each tile as either dense or 2:4 sparse. This provides fine-grained control over sparsity ratios and supports non-uniform sparsity across layers.

Result: Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. On LLaMA-2 7B with A6000 GPU, it achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to state-of-the-art 2:4 pruning method MaskLLM.

Conclusion: PATCH successfully bridges the gap between unstructured and semi-structured sparsity by enabling continuous sparsity ratios, providing superior accuracy-acceleration tradeoffs and supporting non-uniform sparsity across layers for better overall model quality.

Abstract: Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 8B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.

[585] Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

Longxiang He, Deheng Ye, Junbo Tan, Xueqian Wang, Li Shen

Main category: cs.LG

TL;DR: RPEX is a robust offline-to-online RL method that addresses data corruption by incorporating Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailed behavior induced by corrupted data.

Details

Motivation: Existing O2O RL methods focus on mitigating offline policy conservatism but ignore robustness under data corruption (states, actions, rewards, dynamics), which severely degrades performance and induces heavy-tailed policy behavior.

Method: Propose RPEX (Robust Policy Expansion) that incorporates Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness caused by data corruption.

Result: Extensive experiments on D4RL datasets show RPEX achieves state-of-the-art O2O performance across various data corruption scenarios.

Conclusion: RPEX is a simple yet effective method that successfully addresses the robustness challenge in offline-to-online RL under data corruption conditions.

Abstract: Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios. Code is available at $\href{https://github.com/felix-thu/RPEX}{https://github.com/felix-thu/RPEX}$.

[586] Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez

Main category: cs.LG

TL;DR: TPCs are flexible safety monitors for LLMs that progressively evaluate polynomial terms, allowing early stopping for easy inputs and more computation for difficult cases, balancing efficiency and safety.

Details

Motivation: Traditional safety monitors use fixed computation for all queries, creating inefficiency with easy inputs and risk with subtle cases. There's a need for flexible monitoring where costs scale with input difficulty.

Method: Introduces Truncated Polynomial Classifiers (TPCs) as an extension of linear probes, trained and evaluated progressively term-by-term, enabling early stopping for lightweight monitoring or more terms for stronger guardrails.

Result: TPCs compete with or outperform MLP-based probe baselines on two large-scale safety datasets (WildGuardMix and BeaverTails) across 4 models up to 30B parameters, while being more interpretable than black-box alternatives.

Conclusion: TPCs provide flexible safety monitoring through progressive evaluation, serving as both a safety dial for adjustable protection levels and an adaptive cascade for efficient computation based on input difficulty.

Abstract: Monitoring large language models’ (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible–costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can “buy” stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.

[587] Predictive Preference Learning from Human Interventions

Haoyuan Cai, Zhenghao Peng, Bolei Zhou

Main category: cs.LG

TL;DR: PPL (Predictive Preference Learning) is an interactive imitation learning method that propagates human intervention corrections to future states using preference optimization, improving learning efficiency and reducing human demonstrations needed.

Details

Motivation: Current interactive imitation learning methods only correct agent actions at current states but fail to adjust potentially more hazardous actions in future states, limiting their effectiveness in safety-critical applications.

Method: PPL bootstraps each human intervention into L future time steps (preference horizon), assuming the agent follows the same action and human makes the same intervention. It then applies preference optimization on these future states to propagate expert corrections.

Result: Experiments on autonomous driving and robotic manipulation benchmarks show PPL significantly improves learning efficiency and reduces human demonstrations needed. Theoretical analysis shows appropriate preference horizon L balances risky state coverage with label correctness.

Conclusion: PPL effectively propagates human intervention corrections to future safety-critical regions, improving learning efficiency while bounding algorithmic optimality gap through proper preference horizon selection.

Abstract: Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: https://metadriverse.github.io/ppl

[588] Lightweight Transformer for EEG Classification via Balanced Signed Graph Algorithm Unrolling

Junyi Yao, Parham Eftekhar, Gene Cheung, Xujin Chris Liu, Yao Wang, Wei Hu

Main category: cs.LG

TL;DR: The paper proposes a lightweight transformer-like neural network for EEG signal classification by unrolling a spectral denoising algorithm on balanced signed graphs, achieving comparable performance to deep learning methods with fewer parameters.

Details

Motivation: EEG signals have inherent anti-correlations that can be modeled by negative edges in graphs. The goal is to differentiate epilepsy patients from healthy subjects using interpretable and efficient models.

Method: Build transformer-like neural nets by unrolling spectral denoising on balanced signed graphs. Use Lanczos approximation for efficient low-pass filtering on mapped positive graphs, with learned optimal cutoff frequency. Use reconstruction errors from two balanced signed graph denoisers for binary classification.

Result: The method achieves classification performance comparable to representative deep learning schemes while employing dramatically fewer parameters.

Conclusion: The proposed approach provides an interpretable and parameter-efficient alternative to deep learning for EEG signal classification, leveraging the spectral properties of balanced signed graphs.

Abstract: Samples of brain signals collected by EEG sensors have inherent anti-correlations that are well modeled by negative edges in a finite graph. To differentiate epilepsy patients from healthy subjects using collected EEG signals, we build lightweight and interpretable transformer-like neural nets by unrolling a spectral denoising algorithm for signals on a balanced signed graph – graph with no cycles of odd number of negative edges. A balanced signed graph has well-defined frequencies that map to a corresponding positive graph via similarity transform of the graph Laplacian matrices. We implement an ideal low-pass filter efficiently on the mapped positive graph via Lanczos approximation, where the optimal cutoff frequency is learned from data. Given that two balanced signed graph denoisers learn posterior probabilities of two different signal classes during training, we evaluate their reconstruction errors for binary classification of EEG signals. Experiments show that our method achieves classification performance comparable to representative deep learning schemes, while employing dramatically fewer parameters.

[589] Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Fatmazohra Rezkellah, Ramzi Dakhmouche

Main category: cs.LG

TL;DR: The paper proposes a unified constrained optimization approach for LLM safety that addresses both sensitive information unlearning and jail-breaking robustness through minimal weight interventions.

Details

Motivation: With increasing LLM adoption, there's a need for privacy-preserving and safe generation through unlearning sensitive information and improving robustness against jail-breaking attacks.

Method: Constrained optimization formulations that find smallest interventions on LLM weights to make vocabulary sets unreachable or shift weights to safer regions, without requiring oracle classifiers.

Result: The proposed point-wise constraint-based intervention outperforms max-min interventions and achieves superior performance compared to state-of-the-art defense methods.

Conclusion: A unified optimization approach effectively addresses both unlearning and robustness in LLMs with better performance and lower computational cost than existing methods.

Abstract: With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn’t require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.

[590] Merge and Guide: Unifying Model Merging and Guided Decoding for Controllable Multi-Objective Generation

Guofu Xie, Chen Zhang, Xiao Zhang, Yunsheng Shi, Ting Yao, Jun Xu

Main category: cs.LG

TL;DR: MAGE is a two-stage framework that combines model merging with guided decoding to improve controllable multi-objective generation, addressing compatibility issues and outperforming existing methods.

Details

Motivation: Existing methods for controllable multi-objective generation are insufficient - merging approaches provide indirect control while decoding guidance requires multiple expert models with high space overhead and dependency on individual model capacity.

Method: Two-stage framework: Stage 1 dynamically constructs a robust base model by merging backbone models for multiple objectives; Stage 2 merges explicit and implicit value models into a unified guidance proxy that steers the base model’s decoding.

Result: Empirically validates Linear Mode Connectivity in value models, explores relationship between model merging and prediction ensembling, and demonstrates enhanced controllability. Outperforms existing approaches with superior controllability, Pareto-optimal performance, and enhanced adaptability.

Conclusion: MAGE framework successfully addresses limitations of existing methods by combining model merging with guided decoding, achieving better performance in controllable multi-objective generation tasks.

Abstract: Adapting to diverse user needs at test time is a key challenge in controllable multi-objective generation. Existing methods are insufficient: merging-based approaches provide indirect, suboptimal control at the parameter level, often disregarding the impacts of multiple objectives. While decoding-based guidance is more direct, it typically requires aggregating logits from multiple expert models, incurring significant space overhead and relying heavily on individual model capacity. To address these issues, we introduce Merge-And-GuidE (MAGE), a two-stage framework that leverages model merging for guided decoding. We first identify a critical compatibility problem between the guidance and base models. In Stage 1, MAGE resolves this by dynamically constructing a more robust base model, merging a series of backbone models that account for multiple objectives. In Stage 2, we merge explicit and implicit value models into a unified guidance proxy, which then steers the decoding of the base model from Stage 1. Our analysis empirically validates Linear Mode Connectivity (LMC) in value models, explores the relationship between model merging and prediction ensembling, and demonstrates the enhanced controllability afforded by our approach. Extensive experiments show that our method outperforms existing approaches, achieving superior controllability, Pareto-optimal performance, and enhanced adaptability.

[591] Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models

David Debot, Giuseppe Marra

Main category: cs.LG

TL;DR: Proposes SIS regularization to balance accuracy and interpretability in Concept Sidechannel Models by quantifying and penalizing sidechannel reliance.

Details

Motivation: Address the trade-off between accuracy and interpretability in Concept Sidechannel Models, where sidechannels improve accuracy but compromise interpretability.

Method: Introduces Sidechannel Independence Score (SIS) metric and SIS regularization to penalize sidechannel reliance, within a unified probabilistic framework.

Result: SIS regularization substantially improves interpretability, intervenability, and quality of learned interpretable task predictors while maintaining accuracy.

Conclusion: Provides theoretical and practical tools for principled development of CSMs that balance accuracy and interpretability.

Abstract: Concept Bottleneck Models (CBNMs) are deep learning models that provide interpretability by enforcing a bottleneck layer where predictions are based exclusively on human-understandable concepts. However, this constraint also restricts information flow and often results in reduced predictive accuracy. Concept Sidechannel Models (CSMs) address this limitation by introducing a sidechannel that bypasses the bottleneck and carry additional task-relevant information. While this improves accuracy, it simultaneously compromises interpretability, as predictions may rely on uninterpretable representations transmitted through sidechannels. Currently, there exists no principled technique to control this fundamental trade-off. In this paper, we close this gap. First, we present a unified probabilistic concept sidechannel meta-model that subsumes existing CSMs as special cases. Building on this framework, we introduce the Sidechannel Independence Score (SIS), a metric that quantifies a CSM’s reliance on its sidechannel by contrasting predictions made with and without sidechannel information. We propose SIS regularization, which explicitly penalizes sidechannel reliance to improve interpretability. Finally, we analyze how the expressivity of the predictor and the reliance of the sidechannel jointly shape interpretability, revealing inherent trade-offs across different CSM architectures. Empirical results show that state-of-the-art CSMs, when trained solely for accuracy, exhibit low representation interpretability, and that SIS regularization substantially improves their interpretability, intervenability, and the quality of learned interpretable task predictors. Our work provides both theoretical and practical tools for developing CSMs that balance accuracy and interpretability in a principled manner.

[592] Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, Kejiang Chen, Xing Hu

Main category: cs.LG

TL;DR: AER is an adaptive entropy regularization framework that dynamically balances exploration and exploitation in RLVR training to prevent policy entropy collapse and improve reasoning performance in LLMs.

Details

Motivation: RLVR training suffers from policy entropy collapse where policies become overly deterministic, hindering exploration. Fixed entropy regularization coefficients are unstable across tasks and models.

Method: Proposed Adaptive Entropy Regularization (AER) with three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.

Result: Experiments on multiple mathematical reasoning benchmarks show AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

Conclusion: AER effectively addresses the limitations of fixed entropy regularization in RLVR, demonstrating that adaptive entropy control can significantly enhance reasoning performance in LLMs.

Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)–a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

[593] Best-of-Both Worlds for linear contextual bandits with paid observations

Nathan Boyer, Dorian Baudry, Patrick Rebeschini

Main category: cs.LG

TL;DR: A Best-of-Both-Worlds algorithm for linear contextual bandits with paid observations that achieves minimax-optimal regret in adversarial settings and poly-logarithmic regret in stochastic regimes.

Details

Motivation: To address the problem of linear contextual bandits where observing arm losses incurs a fixed cost, requiring efficient algorithms that perform well in both adversarial and stochastic environments.

Method: Follow-the-Regularized-Leader framework with efficient estimators via Matrix Geometric Resampling, building on the BOBW framework for hard problems.

Result: The algorithm achieves Θ(T^{2/3}) minimax-optimal regret in adversarial settings and poly-logarithmic regret in (corrupted) stochastic regimes.

Conclusion: The proposed BOBW algorithm provides computationally efficient performance guarantees for linear contextual bandits with paid observations across different environmental settings.

Abstract: We study the problem of linear contextual bandits with paid observations, where at each round the learner selects an action in order to minimize its loss in a given context, and can then decide to pay a fixed cost to observe the loss of any arm. Building on the Follow-the-Regularized-Leader framework with efficient estimators via Matrix Geometric Resampling, we introduce a computationally efficient Best-of-Both-Worlds (BOBW) algorithm for this problem. We show that it achieves the minimax-optimal regret of $\Theta(T^{2/3})$ in adversarial settings, while guaranteeing poly-logarithmic regret in (corrupted) stochastic regimes. Our approach builds on the framework from \cite{BOBWhardproblems} to design BOBW algorithms for ``hard problem’’, using analysis techniques tailored for the setting that we consider.

[594] Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

Aman Sharma, Paras Chopra

Main category: cs.LG

TL;DR: A novel entropy-based framework that uses token-level Shannon entropy as a confidence signal for early stopping in LLM reasoning tasks, achieving 25-50% computational savings while maintaining accuracy.

Details

Motivation: To improve token efficiency in large language models during reasoning tasks by exploiting emergent confidence awareness in advanced reasoning models.

Method: Uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, with entropy thresholds calculated easily using few examples from reasoning datasets.

Result: Achieves 25-50% computational savings while maintaining task accuracy, demonstrating consistent performance across reasoning-optimized model families.

Conclusion: Entropy-based confidence calibration is an emergent property of advanced post-training optimization in modern reasoning models, representing a distinguishing characteristic versus predecessors.

Abstract: We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they’ve gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.

[595] ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Ariel Kuperman, Tim Elson

Main category: cs.LG

TL;DR: ENIGMA is a novel LLM training approach that jointly improves reasoning, alignment and robustness by treating organizational policies as directions on the model’s information manifold, combining GRPO, SAMI-style mutual information, and entropic regularization.

Details

Motivation: To address the challenge that reasoning, alignment, and robustness in LLMs are typically treated as separate problems, and to demonstrate they can be unified through a single information-geometric objective without requiring reward models.

Method: Single-loop training combining Group-Relative Policy Optimization (GRPO) with Chain-of-Thought rewards, Self-Supervised Alignment with Mutual Information (SAMI), and entropic Sinkhorn optimal-transport regularization on hidden-state distributions. Introduces infoNCE metrics and Sufficiency Index (SI) for policy selection.

Result: Experiments with 1B LLMs show high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Information-geometry analysis validates desirable structural changes in the manifold.

Conclusion: ENIGMA demonstrates that reasoning, alignment, and robustness are projections of a single information-geometric objective, enabling principled reasoning without reward models and offering a path to trusted capability.

Abstract: We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA), a novel approach to Large-Language Model (LLM) training that jointly improves reasoning, alignment and robustness by treating an organisation’s policies/principles as directions to move on a model’s information manifold. Our single-loop trainer combines Group-Relative Policy Optimisation (GRPO), an on-policy, critic-free RL method with Chain-of-Thought (CoT)-format only rewards; a Self-Supervised Alignment with Mutual Information (SAMI)-style symmetric InfoNCE auxiliary; and an entropic Sinkhorn optimal-transport regulariser on hidden-state distributions to bound geometry drift. We also introduce infoNCE metrics that specialise to a standard MI lower bound under matched negatives to measure how strongly a model’s CoT encodes these policies. These metrics include a Sufficiency Index (SI) that enables the selection and creation of principles that maximise downstream performance prior to training. In our experiments using small (1B) LLMs, high-SI principles predict steadier training dynamics and improved benchmark performance over GRPO ablations. Our information-geometry analysis of trained models validates desirable structural change in the manifold. These results support our hypothesis that reasoning, alignment, and robustness are projections of a single information-geometric objective, and that models trained using ENIGMA demonstrate principled reasoning without the use of a reward model, offering a path to trusted capability

[596] Rényi Sharpness: A Novel Sharpness that Strongly Correlates with Generalization

Qiaozhe Zhang, Jun Sun, Ruijie Zhang, Yingzhuang Liu

Main category: cs.LG

TL;DR: Proposes Rényi sharpness, a novel sharpness measure based on Rényi entropy of loss Hessian eigenvalues, which shows strong correlation with generalization and outperforms existing sharpness-aware minimization methods when used as a regularizer.

Details

Motivation: Existing sharpness measures often show weak correlation with generalization despite the intuition that flatter loss landscapes should generalize better. The authors aim to bridge this gap between intuition and empirical reality.

Method: Define Rényi sharpness as negative Rényi entropy of loss Hessian eigenvalues, leveraging the insight that uniform eigenvalues are desirable for generalization. Use reparametrization invariance and data-to-weight perturbation translation to establish generalization bounds. Also propose Rényi Sharpness Aware Minimization (RSAM) as a training regularizer.

Result: Strong correlation (Kendall rank correlation) between Rényi sharpness and generalization. RSAM outperforms all existing sharpness-aware minimization methods, achieving up to 2.5% test accuracy gain over classical SAM method.

Conclusion: Rényi sharpness effectively captures the relationship between loss landscape geometry and generalization, providing both theoretical guarantees and practical improvements through RSAM regularization.

Abstract: Sharpness (of the loss minima) is a common measure to investigate the generalization of neural networks. Intuitively speaking, the flatter the landscape near the minima is, the better generalization might be. Unfortunately, the correlation between many existing sharpness measures and the generalization is usually not strong, sometimes even weak. To close the gap between the intuition and the reality, we propose a novel sharpness measure, i.e., \textit{R'enyi sharpness}, which is defined as the negative R'enyi entropy (a generalization of the classical Shannon entropy) of the loss Hessian. The main ideas are as follows: 1) we realize that \textit{uniform} (identical) eigenvalues of the loss Hessian is most desirable (while keeping the sum constant) to achieve good generalization; 2) we employ the \textit{R'enyi entropy} to concisely characterize the extent of the spread of the eigenvalues of loss Hessian. Normally, the larger the spread, the smaller the (R'enyi) entropy. To rigorously establish the relationship between generalization and (R'enyi) sharpness, we provide several generalization bounds in terms of R'enyi sharpness, by taking advantage of the reparametrization invariance property of R'enyi sharpness, as well as the trick of translating the data discrepancy to the weight perturbation. Furthermore, extensive experiments are conducted to verify the strong correlation (in specific, Kendall rank correlation) between the R'enyi sharpness and generalization. Moreover, we propose to use a variant of R'enyi Sharpness as regularizer during training, i.e., R'enyi Sharpness Aware Minimization (RSAM), which turns out to outperform all existing sharpness-aware minimization methods. It is worthy noting that the test accuracy gain of our proposed RSAM method could be as high as nearly 2.5%, compared against the classical SAM method.

[597] The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models

Konrad Löhr, Shuzhou Yuan, Michael Färber

Main category: cs.LG

TL;DR: This study investigates political bias and stereotype propagation in eight large language models using the Political Compass Test, revealing consistent left-leaning alignment and more pronounced implicit stereotypes through multilingual testing.

Details

Motivation: Understanding political biases in LLMs is crucial to prevent undue influence on public opinion and democratic processes, given their growing societal influence in information dissemination and decision-making.

Method: Used the two-dimensional Political Compass Test (PCT) to assess inherent political leanings, employed persona prompting with PCT to explore explicit stereotypes, and evaluated implicit stereotypes using multilingual versions of PCT.

Result: All investigated models showed consistent left-leaning political alignment. Implicit stereotypes elicited through language variation were more pronounced than explicit ones, with most models showing notable alignment between implicit and explicit stereotypes.

Conclusion: The study underscores the complex interplay of political bias and stereotypes in LLMs, suggesting models may have some awareness of their inherent biases, which has implications for their responsible deployment in society.

Abstract: Large Language Models (LLMs) are increasingly integral to information dissemination and decision-making processes. Given their growing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propagation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the inherent political leanings of these models. Subsequently, persona prompting with the PCT is used to explore explicit stereotypes across various social dimensions. In a final step, implicit stereotypes are uncovered by evaluating models with multilingual versions of the PCT. Key findings reveal a consistent left-leaning political alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those identified via explicit persona prompting. Interestingly, for most models, implicit and explicit stereotypes show a notable alignment, suggesting a degree of transparency or “awareness” regarding their inherent biases. This study underscores the complex interplay of political bias and stereotypes in LLMs.

[598] TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference

Michal Podstawski

Main category: cs.LG

TL;DR: Small transformer models can infer graph parameters from graph representations, with LoRA fine-tuning improving performance on the TinyGraphEstimator dataset.

Details

Motivation: To explore whether compact, resource-efficient language models can perform structural inference on graphs, as this capability remains largely unexplored compared to larger models.

Method: Introduce TinyGraphEstimator dataset with connected graphs and structural metadata; evaluate small open models on predicting graph parameters; apply LoRA fine-tuning for efficient parameter adaptation.

Result: Small language models show non-trivial reasoning capacity on graph-structured data, with LoRA fine-tuning achieving consistent improvements across all evaluated metrics.

Conclusion: Compact transformer-based models can effectively perform structural inference on graphs through efficient parameter tuning, demonstrating their potential for graph analysis tasks.

Abstract: Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.

[599] Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Fengyuan Liu, Marco Ciccone, Angelo Porrello, Simone Calderara

Main category: cs.LG

TL;DR: GradFix enables efficient transfer of task vectors across different foundation model versions by leveraging gradient sign structure, requiring only a few labeled samples without additional fine-tuning.

Details

Motivation: Current practice requires full fine-tuning for each new foundation model release, even when the same task was already solved in previous versions. Task vectors often fail to transfer due to misaligned parameter spaces.

Method: GradFix approximates the ideal gradient sign structure of the new model and uses it to mask the source task vector, achieving local alignment with the target loss landscape through minimal gradient computations.

Result: Significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.

Conclusion: GradFix provides an efficient method for transferring task knowledge across foundation model versions with theoretical guarantees and practical effectiveness.

Abstract: When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.

[600] Latent Retrieval Augmented Generation of Cross-Domain Protein Binders

Zishen Zhang, Xiangzhe Kong, Wenbing Huang, Yang Liu

Main category: cs.LG

TL;DR: RADiAnce is a framework that uses retrieval-augmented diffusion to design protein binders by leveraging known interfaces to guide novel binder generation through cross-domain interface transfer.

Details

Motivation: Current structure-based generative models lack sufficient rationality and interpretability in generating protein interfaces, creating a fundamental challenge in drug discovery for designing functional binders.

Method: Unifies retrieval and generation in a shared contrastive latent space, identifies relevant interfaces for binding sites, and integrates them through conditional latent diffusion generator for cross-domain interface transfer.

Result: Significantly outperforms baseline models in binding affinity, geometry recovery, and interaction recovery. Validates cross-domain generalization where retrieving interfaces from diverse domains enhances binder generation performance.

Conclusion: Establishes a new paradigm for protein binder design that successfully bridges retrieval-based knowledge and generative AI, opening new possibilities for drug discovery.

Abstract: Designing protein binders targeting specific sites, which requires to generate realistic and functional interaction patterns, is a fundamental challenge in drug discovery. Current structure-based generative models are limited in generating nterfaces with sufficient rationality and interpretability. In this paper, we propose Retrieval-Augmented Diffusion for Aligned interface (RADiAnce), a new framework that leverages known interfaces to guide the design of novel binders. By unifying retrieval and generation in a shared contrastive latent space, our model efficiently identifies relevant interfaces for a given binding site and seamlessly integrates them through a conditional latent diffusion generator, enabling cross-domain interface transfer. Extensive exeriments show that RADiAnce significantly outperforms baseline models across multiple metrics, including binding affinity and recovery of geometries and interactions. Additional experimental results validate cross-domain generalization, demonstrating that retrieving interfaces from diverse domains, such as peptides, antibodies, and protein fragments, enhances the generation performance of binders for other domains. Our work establishes a new paradigm for protein binder design that successfully bridges retrieval-based knowledge and generative AI, opening new possibilities for drug discovery.

[601] Thompson Sampling via Fine-Tuning of LLMs

Nicolas Menet, Aleksandar Terzić, Michael Hersche, Andreas Krause, Abbas Rahimi

Main category: cs.LG

TL;DR: ToSFiT is a scalable Bayesian optimization method that uses Thompson sampling via fine-tuning of large language models to avoid acquisition function maximization in large discrete spaces.

Details

Motivation: Bayesian optimization in large unstructured discrete spaces is computationally expensive due to the need for acquisition function maximization without gradients.

Method: Thompson Sampling via Fine-Tuning (ToSFiT) parameterizes the probability that a candidate yields maximum reward, leveraging prompt-conditioned LLMs and incrementally adapting them toward the posterior.

Result: The method achieves strong theoretical regret bounds matching standard Thompson sampling and shows significant sample efficiency improvements on FAQ response refinement, protein search, and quantum circuit design tasks.

Conclusion: Online fine-tuning with ToSFiT provides efficient Bayesian optimization in large discrete spaces with negligible computational overhead.

Abstract: Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality–a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. We demonstrate that online fine-tuning significantly improves sample efficiency, with negligible impact on computational efficiency.

[602] A Hybrid Machine Learning Approach for Synthetic Data Generation with Post Hoc Calibration for Clinical Tabular Datasets

Md Ibrahim Shikder Mahin, Md Shamsul Arefin, Md Tanvir Hasan

Main category: cs.LG

TL;DR: A hybrid framework for healthcare data synthesis using multiple augmentation methods with reinforcement learning-based dynamic weight selection and advanced calibration techniques to generate high-fidelity synthetic data while preserving privacy.

Details

Motivation: Healthcare research faces data scarcity and privacy regulations (HIPAA, GDPR) that limit access to real medical data, impeding AI innovation and patient care advancements.

Method: Hybrid framework integrating five augmentation methods (noise injection, interpolation, GMM sampling, CVAE sampling, SMOTE) with reinforcement learning-based dynamic weight selection and advanced calibration techniques including moment matching, histogram matching, and iterative refinement.

Result: Achieved Wasserstein distances as low as 0.001, Kolmogorov-Smirnov statistics around 0.01, pairwise trend scores over 90%, and downstream classifiers with up to 94% accuracy and F1 scores above 93%, comparable to real data performance.

Conclusion: The scalable, privacy-preserving approach sets new benchmarks for joint-distribution fidelity in healthcare and supports sensitive AI applications while maintaining robust privacy protection.

Abstract: Healthcare research and development face significant obstacles due to data scarcity and stringent privacy regulations, such as HIPAA and the GDPR, restricting access to essential real-world medical data. These limitations impede innovation, delay robust AI model creation, and hinder advancements in patient-centered care. Synthetic data generation offers a transformative solution by producing artificial datasets that emulate real data statistics while safeguarding patient privacy. We introduce a novel hybrid framework for high-fidelity healthcare data synthesis integrating five augmentation methods: noise injection, interpolation, Gaussian Mixture Model (GMM) sampling, Conditional Variational Autoencoder (CVAE) sampling, and SMOTE, combined via a reinforcement learning-based dynamic weight selection mechanism. Its key innovations include advanced calibration techniques – moment matching, full histogram matching, soft and adaptive soft histogram matching, and iterative refinement – that align marginal distributions and preserve joint feature dependencies. Evaluated on the Breast Cancer Wisconsin (UCI Repository) and Khulna Medical College cardiology datasets, our calibrated hybrid achieves Wasserstein distances as low as 0.001 and Kolmogorov-Smirnov statistics around 0.01, demonstrating near-zero marginal discrepancy. Pairwise trend scores surpass 90%, and Nearest Neighbor Adversarial Accuracy approaches 50%, confirming robust privacy protection. Downstream classifiers trained on synthetic data achieve up to 94% accuracy and F1 scores above 93%, comparable to models trained on real data. This scalable, privacy-preserving approach matches state-of-the-art methods, sets new benchmarks for joint-distribution fidelity in healthcare, and supports sensitive AI applications.

cs.MA

[603] Benefits and Limitations of Communication in Multi-Agent Reasoning

Michael Rizvi-Martel, Satwik Bhattamishra, Neil Rathi, Guillaume Rabusseau, Michael Hahn

Main category: cs.MA

TL;DR: The paper proposes a theoretical framework to analyze the expressivity of multi-agent systems for complex reasoning tasks, deriving bounds on agent requirements, communication patterns, and speedups, with empirical validation.

Details

Motivation: Chain-of-thought prompting struggles with complex problems and long contexts, while multi-agent systems offer a promising solution but their fundamental capacities are poorly understood.

Method: Developed a theoretical framework to analyze multi-agent expressivity, applied to state tracking, recall, and k-hop reasoning algorithms, with experimental validation on pretrained LLMs using synthetic benchmarks.

Result: Derived bounds on agent requirements, communication structure, and speedups; identified regimes where communication is beneficial and tradeoffs between agent count and bandwidth; exposed limitations under resource constraints.

Conclusion: The analysis provides principled guidance for designing scalable multi-agent reasoning systems, with empirical results confirming theoretical predictions about key tradeoffs.

Abstract: Chain-of-thought prompting has popularized step-by-step reasoning in large language models, yet model performance still degrades as problem complexity and context length grow. By decomposing difficult tasks with long contexts into shorter, manageable ones, recent multi-agent paradigms offer a promising near-term solution to this problem. However, the fundamental capacities of such systems are poorly understood. In this work, we propose a theoretical framework to analyze the expressivity of multi-agent systems. We apply our framework to three algorithmic families: state tracking, recall, and $k$-hop reasoning. We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter-agent communication, and (iii) the achievable speedups as problem size and context scale. Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained. We complement our theoretical analysis with a set of experiments on pretrained LLMs using controlled synthetic benchmarks. Empirical outcomes confirm the tradeoffs between key quantities predicted by our theory. Collectively, our analysis offers principled guidance for designing scalable multi-agent reasoning systems.

[604] Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations

Jinkun Chen, Sher Badshah, Xuemin Yu, Sijia Han, Jiechao Gao

Main category: cs.MA

TL;DR: The paper argues that current multi-agent simulations using LLMs are too static and constrained, and calls for developing open-ended, adaptive systems that can evolve unpredictably to better model real-world social complexity.

Details

Motivation: Current LLM-powered multi-agent systems and social simulations are limited by static sandboxes with predefined tasks and rigid evaluation, preventing them from capturing the complexity of real-world societies.

Method: The authors critically review emerging architectures combining LLMs with multi-agent dynamics, identify key challenges (balancing stability/diversity, evaluating unexpected behaviors, scaling complexity), and introduce a new taxonomy for the field.

Result: The paper presents a research roadmap focused on open-endedness, continuous co-evolution, and developing resilient, socially aligned AI ecosystems.

Conclusion: The community should move beyond static paradigms and help create the next generation of adaptive, socially-aware multi-agent simulations that can evolve and reshape their worlds in unpredictable ways.

Abstract: What if artificial agents could not just communicate, but also evolve, adapt, and reshape their worlds in ways we cannot fully predict? With llm now powering multi-agent systems and social simulations, we are witnessing new possibilities for modeling open-ended, ever-changing environments. Yet, most current simulations remain constrained within static sandboxes, characterized by predefined tasks, limited dynamics, and rigid evaluation criteria. These limitations prevent them from capturing the complexity of real-world societies. In this paper, we argue that static, task-specific benchmarks are fundamentally inadequate and must be rethought. We critically review emerging architectures that blend llm with multi-agent dynamics, highlight key hurdles such as balancing stability and diversity, evaluating unexpected behaviors, and scaling to greater complexity, and introduce a fresh taxonomy for this rapidly evolving field. Finally, we present a research roadmap centered on open-endedness, continuous co-evolution, and the development of resilient, socially aligned AI ecosystems. \textbf{We call on the community to move beyond static paradigms and help shape the next generation of adaptive, socially-aware multi-agent simulations.}

[605] Stop Reducing Responsibility in LLM-Powered Multi-Agent Systems to Local Alignment

Jinwei Hu, Yi Dong, Shuang Ao, Zhuoyun Li, Boxuan Wang, Lokesh Singh, Guangliang Cheng, Sarvapali D. Ramchurn, Xiaowei Huang

Main category: cs.MA

TL;DR: LLM-powered Multi-Agent Systems require a paradigm shift from local agent alignment to global systemic agreement, conceptualizing responsibility as a lifecycle-wide property with three key dimensions: agreement, uncertainty, and security.

Details

Motivation: LLM-MAS introduce new risks including unguaranteed agreement, cascading uncertainty, and adversarial vulnerabilities that require addressing responsibility at the system level rather than individual agent level.

Method: Proposes a dual-perspective governance framework combining interdisciplinary design with human-AI collaborative oversight to trace and ensure responsibility throughout the LLM-MAS lifecycle.

Result: Presents a conceptual framework for viewing LLM-MAS as unified socio-technical systems requiring principled mechanisms for responsibility dimensions.

Conclusion: Responsibility in LLM-MAS should be treated as a dynamic lifecycle property requiring systemic approaches that integrate human values with objective verifiability for ethically aligned and resilient behavior.

Abstract: LLM-powered Multi-Agent Systems (LLM-MAS) unlock new potentials in distributed reasoning, collaboration, and task generalization but also introduce additional risks due to unguaranteed agreement, cascading uncertainty, and adversarial vulnerabilities. We argue that ensuring responsible behavior in such systems requires a paradigm shift: from local, superficial agent-level alignment to global, systemic agreement. We conceptualize responsibility not as a static constraint but as a lifecycle-wide property encompassing agreement, uncertainty, and security, each requiring the complementary integration of subjective human-centered values and objective verifiability. Furthermore, a dual-perspective governance framework that combines interdisciplinary design with human-AI collaborative oversight is essential for tracing and ensuring responsibility throughout the lifecycle of LLM-MAS. Our position views LLM-MAS not as loose collections of agents, but as unified, dynamic socio-technical systems that demand principled mechanisms to support each dimension of responsibility and enable ethically aligned, verifiably coherent, and resilient behavior for sustained, system-wide agreement.

Prateek Gupta, Qiankun Zhong, Hiromu Yakura, Thomas Eisenmann, Iyad Rahwan

Main category: cs.MA

TL;DR: A CPR simulation framework that removes explicit reward signals and incorporates social learning and norm-based punishment to study emergent cooperation in LLM societies.

Details

Motivation: To study how norms and cooperation emerge in mixed-motive scenarios without explicit reward functions, mimicking human cooperation that relies on heuristics, communication, and punishment rather than full visibility into payoffs.

Method: A CPR simulation framework embedding cultural-evolutionary mechanisms: social learning (adopting strategies from successful peers) and norm-based punishment grounded in Ostrom’s principles. Agents learn individually from environmental feedback of harvesting, monitoring, and punishing.

Result: The framework reproduces key findings from human behavior studies and reveals systematic model differences in sustaining cooperation and norm formation across different environmental and social initializations (resource-rich vs. scarce; altruistic vs. selfish).

Conclusion: The framework serves as a rigorous testbed for studying emergent norms in mixed-motive LLM societies, informing AI system design for social and organizational contexts where cooperative norm alignment is critical.

Abstract: A growing body of multi-agent studies with Large Language Models (LLMs) explores how norms and cooperation emerge in mixed-motive scenarios, where pursuing individual gain can undermine the collective good. While prior work has explored these dynamics in both richly contextualized simulations and simplified game-theoretic environments, most LLM systems featuring common-pool resource (CPR) games provide agents with explicit reward functions directly tied to their actions. In contrast, human cooperation often emerges without full visibility into payoffs and population, relying instead on heuristics, communication, and punishment. We introduce a CPR simulation framework that removes explicit reward signals and embeds cultural-evolutionary mechanisms: social learning (adopting strategies and beliefs from successful peers) and norm-based punishment, grounded in Ostrom’s principles of resource governance. Agents also individually learn from the consequences of harvesting, monitoring, and punishing via environmental feedback, enabling norms to emerge endogenously. We establish the validity of our simulation by reproducing key findings from existing studies on human behavior. Building on this, we examine norm evolution across a $2\times2$ grid of environmental and social initialisations (resource-rich vs. resource-scarce; altruistic vs. selfish) and benchmark how agentic societies comprised of different LLMs perform under these conditions. Our results reveal systematic model differences in sustaining cooperation and norm formation, positioning the framework as a rigorous testbed for studying emergent norms in mixed-motive LLM societies. Such analysis can inform the design of AI systems deployed in social and organizational contexts, where alignment with cooperative norms is critical for stability, fairness, and effective governance of AI-mediated environments.

[607] Internet of Agents: Fundamentals, Applications, and Challenges

Yuntao Wang, Shaolong Guo, Yanghe Pan, Zhou Su, Fahao Chen, Tom H. Luan, Peng Li, Jiawen Kang, Dusit Niyato

Main category: cs.MA

TL;DR: The paper introduces Internet of Agents (IoA) as a unified framework for connecting heterogeneous AI agents, enabling seamless interconnection, dynamic discovery, and collaborative orchestration at scale.

Details

Motivation: With the proliferation of large language models and vision-language models, AI agents are evolving into autonomous, interactive entities operating across virtual and physical environments, creating a need for agent-centric infrastructure.

Method: Presents a general IoA architecture with hierarchical organization, analyzes key operational enablers including capability notification/discovery, adaptive communication protocols, dynamic task matching, consensus mechanisms, and incentive models.

Result: Establishes IoA as a foundational framework for agent interconnection and collaboration, identifying its distinguishing features relative to the traditional Internet and emerging applications.

Conclusion: Identifies open research directions toward building resilient and trustworthy IoA ecosystems, emphasizing the need for continued development in agent infrastructure.

Abstract: With the rapid proliferation of large language models and vision-language models, AI agents have evolved from isolated, task-specific systems into autonomous, interactive entities capable of perceiving, reasoning, and acting without human intervention. As these agents proliferate across virtual and physical environments, from virtual assistants to embodied robots, the need for a unified, agent-centric infrastructure becomes paramount. In this survey, we introduce the Internet of Agents (IoA) as a foundational framework that enables seamless interconnection, dynamic discovery, and collaborative orchestration among heterogeneous agents at scale. We begin by presenting a general IoA architecture, highlighting its hierarchical organization, distinguishing features relative to the traditional Internet, and emerging applications. Next, we analyze the key operational enablers of IoA, including capability notification and discovery, adaptive communication protocols, dynamic task matching, consensus and conflict-resolution mechanisms, and incentive models. Finally, we identify open research directions toward building resilient and trustworthy IoA ecosystems.

[608] ABMax: A JAX-based Agent-based Modeling Framework

Siddharth Chaturvedi, Ahmed El-Gazzar, Marcel van Gerven

Main category: cs.MA

TL;DR: ABMax is a JAX-based agent-based modeling framework that enables dynamic agent selection and updates while maintaining high performance through JIT compilation and vectorization.

Details

Motivation: JAX enables scaling ABMs but requires immutable array shapes, which constrains dynamic agent manipulation operations like updating variable numbers of agents with distinct changes.

Method: Developed ABMax framework with multiple JIT-compilable algorithms that support dynamic agent selection and updates while maintaining JAX’s performance benefits.

Result: Achieved comparable runtime performance to state-of-the-art implementations on predation model benchmark, and demonstrated vectorization capability for running multiple ABMs in parallel.

Conclusion: ABMax successfully bridges the gap between JAX’s performance requirements and ABM’s need for flexible agent manipulation, enabling scalable dynamic agent modeling.

Abstract: Agent-based modeling (ABM) is a principal approach for studying complex systems. By decomposing a system into simpler, interacting agents, agent-based modeling (ABM) allows researchers to observe the emergence of complex phenomena. High-performance array computing libraries like JAX can help scale such computational models to a large number of agents by using automatic vectorization and just-in-time (JIT) compilation. One of the caveats of using JAX to achieve such scaling is that the shapes of arrays used in the computational model should remain immutable throughout the simulation. In the context of agent-based modeling (ABM), this can pose constraints on certain agent manipulation operations that require flexible data structures. A subset of which is represented by the ability to update a dynamically selected number of agents by applying distinct changes to them during a simulation. To this effect, we introduce ABMax, an ABM framework based on JAX that implements multiple just-in-time (JIT) compilable algorithms to provide this functionality. On the canonical predation model benchmark, ABMax achieves runtime performance comparable to state-of-the-art implementations. Further, we show that this functionality can also be vectorized, making it possible to run many similar agent-based models in parallel. We also present two examples in the form of a traffic-flow model and a financial market model to show the use case of ABMax

cs.MM

[609] 360CityGML: Realistic and Interactive Urban Visualization System Integrating CityGML Model and 360° Videos

Tatsuro Banno, Mizuki Takenawa, Leslie Wöhler, Satoshi Ikehata, Kiyoharu Aizawa

Main category: cs.MM

TL;DR: A novel urban visualization system that integrates 3D CityGML models with 360-degree walkthrough videos to create photorealistic urban visualizations from pedestrian perspectives.

Details

Motivation: To enable intuitive interpretation of geospatial data by providing photorealistic urban visualizations from pedestrian viewpoints, bridging the gap between abstract 3D models and real-world video footage.

Method: Align 360-degree walkthrough videos with 3D urban models (CityGML) and dynamically project relevant video frames onto the model geometries.

Result: Creates photorealistic urban visualizations that allow users to intuitively interpret geospatial data from pedestrian perspectives.

Conclusion: The integrated system successfully combines 3D urban models with 360-degree videos to provide realistic, intuitive urban visualization experiences for pedestrian-level geospatial data interpretation.

Abstract: We introduce a novel urban visualization system that integrates 3D urban model (CityGML) and 360{\deg} walkthrough videos. By aligning the videos with the model and dynamically projecting relevant video frames onto the geometries, our system creates photorealistic urban visualizations, allowing users to intuitively interpret geospatial data from a pedestrian view.

[610] Deep Compositional Phase Diffusion for Long Motion Sequence Generation

Ho Yin Au, Jie Chen, Junkun Jiang, Jingyu Xiang

Main category: cs.MM

TL;DR: Compositional Phase Diffusion generates smooth transitions between multiple motion clips using semantic and transitional phase diffusion modules in the frequency domain.

Details

Motivation: Existing motion generation models struggle with maintaining continuity at transition boundaries between multiple semantically generated motion clips, causing awkward transitions and artifacts.

Method: Uses Semantic Phase Diffusion Module (SPDM) and Transitional Phase Diffusion Module (TPDM) operating in latent motion frequency domain via pre-trained Action-Centric Motion Phase Autoencoder (ACT-PAE) to incorporate semantic guidance and phase details from adjacent clips.

Result: Competitive performance in generating compositional motion sequences that align semantically with input conditions while preserving phase transitional continuity between motion clips.

Conclusion: The framework effectively addresses transition continuity issues in compositional motion generation and can be extended to motion inbetweening tasks by fixing phase parameters throughout diffusion.

Abstract: Recent research on motion generation has shown significant progress in generating semantically aligned motion with singular semantics. However, when employing these models to create composite sequences containing multiple semantically generated motion clips, they often struggle to preserve the continuity of motion dynamics at the transition boundaries between clips, resulting in awkward transitions and abrupt artifacts. To address these challenges, we present Compositional Phase Diffusion, which leverages the Semantic Phase Diffusion Module (SPDM) and Transitional Phase Diffusion Module (TPDM) to progressively incorporate semantic guidance and phase details from adjacent motion clips into the diffusion process. Specifically, SPDM and TPDM operate within the latent motion frequency domain established by the pre-trained Action-Centric Motion Phase Autoencoder (ACT-PAE). This allows them to learn semantically important and transition-aware phase information from variable-length motion clips during training. Experimental results demonstrate the competitive performance of our proposed framework in generating compositional motion sequences that align semantically with the input conditions, while preserving phase transitional continuity between preceding and succeeding motion clips. Additionally, motion inbetweening task is made possible by keeping the phase parameter of the input motion sequences fixed throughout the diffusion process, showcasing the potential for extending the proposed framework to accommodate various application scenarios. Codes are available at https://github.com/asdryau/TransPhase.

[611] Block-Partitioning Strategies for Accelerated Multi-rate Encoding in Adaptive VVC Streaming

Vignesh V Menon, Adam Wieckowski, Yiquin Liu, Benjamin Bross, Detlev Marpe

Main category: cs.MM

TL;DR: This paper proposes CU partitioning strategies for efficient multi-rate VVC encoding, achieving up to 11.69% time reduction with minimal quality loss (<0.6% bitrate overhead).

Details

Motivation: The increasing demand for UHD video content and adaptive streaming requires efficient multi-rate encoding, but VVC's computational complexity makes this resource-intensive.

Method: Proposed single- and double-bound approaches using CU depth constraints from reference encodes to guide dependent encodes across multiple QPs, evaluated with VVenC.

Result: Methods achieve up to 11.69% encoding time reduction with <0.6% bitrate overhead, with Pareto-front analysis showing superior performance over existing configurations.

Conclusion: CU-guided strategies are validated as effective for scalable multi-rate encoding in adaptive streaming applications.

Abstract: The demand for efficient multi-rate encoding techniques has surged with the increasing prevalence of ultra-high-definition (UHD) video content, particularly in adaptive streaming scenarios where a single video must be encoded at multiple bitrates to accommodate diverse network conditions. While Versatile Video Coding (VVC) significantly improves compression efficiency, it introduces considerable computational complexity, making multi-rate encoding a resource-intensive task. This paper examines coding unit (CU) partitioning strategies to minimize redundant computations in VVC while preserving high video quality. We propose single- and double-bound approaches, leveraging CU depth constraints from reference encodes to guide dependent encodes across multiple QPs. These methods are evaluated using VVenC with various presets, demonstrating consistent improvements in encoding efficiency. Our methods achieve up to 11.69 % reduction in encoding time with minimal bitrate overhead (<0.6 %). Comparative Pareto-front (PF) analysis highlights the superior performance of multi-rate approaches over existing configurations. These findings validate the potential of CU-guided strategies for scalable multi-rate encoding in adaptive streaming.

Nick Oh, Giorgos D. Vrakas, Siân J. M. Brooke, Sasha Morinière, Toju Duke

Main category: cs.MM

TL;DR: PETLP is a compliance framework that integrates GDPR, copyright, and platform terms into ETL pipelines for social media data research, using evolving Data Protection Impact Assessments and distinguishing extraction rights between research organizations and commercial entities.

Details

Motivation: Existing frameworks fail to integrate overlapping obligations under GDPR, copyright law, and platform terms for social media data research, leaving researchers without unified guidance.

Method: Introduces PETLP (Privacy-by-design Extract, Transform, Load, and Present) framework that embeds legal safeguards into extended ETL pipelines, treats Data Protection Impact Assessments as living documents, and demonstrates through systematic Reddit analysis.

Result: Shows extraction rights differ between qualifying research organizations (can invoke DSM Article 3) and commercial entities (bound by ToS), reveals true anonymisation is unachievable for social media data, and exposes legal gap between dataset creation and model distribution.

Conclusion: PETLP enables researchers to navigate regulatory complexity by structuring compliance decisions into practical workflows and simplifying institutional data management plans, bridging legal requirements with research practice.

Abstract: Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms – yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We demonstrate why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.

eess.AS

[613] Switchboard-Affect: Emotion Perception Labels from Conversational Speech

Amrit Romana, Jaya Narain, Tien Dung Tran, Andrea Davis, Jason Fong, Ramya Rasipuram, Vikramjit Mitra

Main category: eess.AS

TL;DR: The paper introduces SWB-Affect, a naturalistic conversational speech dataset with detailed emotion labeling, highlighting limitations of current SER models especially for anger detection.

Details

Motivation: Current speech emotion datasets use acted/pseudo-acted speech with exaggerated emotions and lack transparency in annotation guidelines, making it difficult to assess real-world SER model performance.

Method: Used Switchboard corpus as natural conversational speech source, trained crowd annotators to label categorical emotions (10 categories) and dimensional attributes (activation, valence, dominance), with detailed annotation guidelines and analysis of lexical/paralinguistic cues.

Result: Evaluated state-of-the-art SER models found variable performance across emotion categories with particularly poor generalization for anger detection.

Conclusion: Naturalistic datasets like SWB-Affect are crucial for evaluating SER model performance in real-world applications, and the dataset is released for further research.

Abstract: Understanding the nuances of speech emotion dataset curation and labeling is essential for assessing speech emotion recognition (SER) model potential in real-world applications. Most training and evaluation datasets contain acted or pseudo-acted speech (e.g., podcast speech) in which emotion expressions may be exaggerated or otherwise intentionally modified. Furthermore, datasets labeled based on crowd perception often lack transparency regarding the guidelines given to annotators. These factors make it difficult to understand model performance and pinpoint necessary areas for improvement. To address this gap, we identified the Switchboard corpus as a promising source of naturalistic conversational speech, and we trained a crowd to label the dataset for categorical emotions (anger, contempt, disgust, fear, sadness, surprise, happiness, tenderness, calmness, and neutral) and dimensional attributes (activation, valence, and dominance). We refer to this label set as Switchboard-Affect (SWB-Affect). In this work, we present our approach in detail, including the definitions provided to annotators and an analysis of the lexical and paralinguistic cues that may have played a role in their perception. In addition, we evaluate state-of-the-art SER models, and we find variable performance across the emotion categories with especially poor generalization for anger. These findings underscore the importance of evaluation with datasets that capture natural affective variations in speech. We release the labels for SWB-Affect to enable further analysis in this domain.

[614] Spatially Aware Self-Supervised Models for Multi-Channel Neural Speaker Diarization

Jiangyu Han, Ruoyu Wang, Yoshiki Masuyama, Marc Delcroix, Johan Rohdin, Jun Du, Lukas Burget

Main category: eess.AS

TL;DR: The paper introduces a lightweight method to make pre-trained WavLM spatially aware for multi-channel speaker diarization by inserting channel communication modules and fusing speaker embeddings with spatial attention, achieving better performance and efficiency than DOVER-Lap.

Details

Motivation: Self-supervised models like WavLM are pre-trained on single-channel recordings, limiting their effectiveness in multi-channel scenarios. Existing systems using DOVER-Lap have computational overhead and don't fully exploit spatial information.

Method: Building on DiariZen pipeline, insert channel communication modules into early WavLM layers to make it spatially aware, and fuse multi-channel speaker embeddings using spatial attention weights. The approach is agnostic to microphone channels and array topologies.

Result: Evaluations on five public datasets show consistent improvements over single-channel baselines and demonstrate superior performance and efficiency compared to DOVER-Lap.

Conclusion: The proposed lightweight approach effectively makes pre-trained WavLM spatially aware for multi-channel speaker diarization, achieving better results with improved computational efficiency.

Abstract: Self-supervised models such as WavLM have demonstrated strong performance for neural speaker diarization. However, these models are typically pre-trained on single-channel recordings, limiting their effectiveness in multi-channel scenarios. Existing diarization systems built on these models often rely on DOVER-Lap to combine outputs from individual channels. Although effective, this approach incurs substantial computational overhead and fails to fully exploit spatial information. In this work, building on DiariZen, a pipeline that combines WavLM-based local endto-end neural diarization with speaker embedding clustering, we introduce a lightweight approach to make pre-trained WavLM spatially aware by inserting channel communication modules into the early layers. Our method is agnostic to both the number of microphone channels and array topologies, ensuring broad applicability. We further propose to fuse multi-channel speaker embeddings by leveraging spatial attention weights. Evaluations on five public datasets show consistent improvements over single-channel baselines and demonstrate superior performance and efficiency compared with DOVER-Lap. Our source code is publicly available at https://github.com/BUTSpeechFIT/DiariZen.

[615] Non-invasive electromyographic speech neuroprosthesis: a geometric perspective

Harshavardhana T. Gowda, Lee M. Miller

Main category: eess.AS

TL;DR: A silent speech interface using EMG signals from face/neck muscles to directly convert silently articulated speech into text, without requiring audible speech or audio alignment.

Details

Motivation: To restore communication for people who lost speech ability due to laryngectomy, neuromuscular diseases, stroke, or trauma, overcoming limitations of prior methods that required audible speech.

Method: Records surface EMG from multiple articulatory sites during silent speech articulation and uses efficient representation of high-dimensional EMG signals for direct sequence-to-sequence EMG-to-text conversion at phonemic level.

Result: Demonstrated direct EMG-to-text translation without relying on time-aligned audio, enabling silent speech interface for individuals who cannot produce audible speech.

Conclusion: The proposed interface provides a practical solution for speech restoration by directly translating silent articulations to text using EMG signals, with all data, code, and models open-sourced for community use.

Abstract: We present a high-bandwidth, egocentric neuromuscular speech interface that translates $silently$ voiced articulations directly into text. We record surface electromyographic (EMG) signals from multiple articulatory sites on the face and neck as participants $silently$ articulate speech, enabling direct EMG-to-text translation. Such an interface has the potential to restore communication for individuals who have lost the ability to produce intelligible speech due to laryngectomy, neuromuscular disease, stroke, or trauma-induced damage (e.g., radiotherapy toxicity) to the speech articulators. Prior work has largely focused on mapping EMG collected during $audible$ articulation to time-aligned audio targets or transferring these targets to $silent$ EMG recordings, which inherently requires audio and limits applicability to patients who can no longer speak. In contrast, we propose an efficient representation of high-dimensional EMG signals and demonstrate direct sequence-to-sequence EMG-to-text conversion at the phonemic level without relying on time-aligned audio. All data, code, and model checkpoints are open-sourced at The dataset and code are available at: https://github.com/HarshavardhanaTG/emg2speech .

[616] SPIRIT: Patching Speech Language Models against Jailbreak Attacks

Amirbek Djanibekov, Nurdaulet Mukhituly, Kentaro Inui, Hanan Aldarmaki, Nils Lukas

Main category: eess.AS

TL;DR: Speech Language Models (SLMs) are highly vulnerable to adversarial jailbreak attacks via imperceptible speech noise, achieving 100% success rates. The paper proposes post-hoc patching defenses that modify activations during inference to improve robustness up to 99% without retraining.

Details

Motivation: SLMs enable natural spoken interactions but introduce new security risks compared to text models, as adversaries can better bypass safety mechanisms using imperceptible noise in speech signals.

Method: Proposed post-hoc patching defenses that intervene during inference by modifying the SLM’s activations, requiring no retraining. Conducted ablation studies to optimize defense efficacy and utility/security trade-off.

Result: SLMs are substantially more vulnerable to jailbreak attacks (100% success rate in some cases). The proposed defenses improve robustness up to 99% with negligible impact on utility, validated through large-scale SLM-specific benchmarks.

Conclusion: Post-hoc patching defenses effectively secure SLMs against adversarial attacks without requiring retraining, achieving high robustness while maintaining utility, addressing the unique security challenges of speech-based language models.

Abstract: Speech Language Models (SLMs) enable natural interactions via spoken instructions, which more effectively capture user intent by detecting nuances in speech. The richer speech signal introduces new security risks compared to text-based models, as adversaries can better bypass safety mechanisms by injecting imperceptible noise to speech. We analyze adversarial attacks and find that SLMs are substantially more vulnerable to jailbreak attacks, which can achieve a perfect 100% attack success rate in some instances. To improve security, we propose post-hoc patching defenses used to intervene during inference by modifying the SLM’s activations that improve robustness up to 99% with (i) negligible impact on utility and (ii) without any re-training. We conduct ablation studies to maximize the efficacy of our defenses and improve the utility/security trade-off, validated with large-scale benchmarks unique to SLMs.

[617] Pinhole Effect on Linkability and Dispersion in Speaker Anonymization

Kong Aik Lee, Zeyan Liu, Liping Chen, Zhenhua Ling

Main category: eess.AS

TL;DR: This paper investigates how different speaker anonymization mapping strategies affect privacy preservation, comparing common pseudo-speaker vs distinct pseudo-speaker approaches.

Details

Motivation: To understand the impact of different speaker anonymization mapping strategies on privacy preservation, specifically examining how common vs distinct pseudo-speaker mappings affect speaker linkability, dispersion, and de-identification.

Method: The study compares two mapping strategies: mapping anonymized speech to a common pseudo speaker shared across utterances vs distinct pseudo speakers unique to each utterance. It evaluates these approaches on three dimensions: speaker linkability, dispersion in anonymized speaker space, and de-identification from original identity.

Result: Using distinct pseudo speakers increases speaker dispersion and reduces linkability compared to common pseudo-speaker mapping, thereby enhancing privacy preservation. The findings support the proposed pinhole effect framework explaining the relationship between mapping strategies and anonymization performance.

Conclusion: Distinct pseudo-speaker mapping provides better privacy protection than common pseudo-speaker mapping by increasing dispersion and reducing linkability, as explained by the pinhole effect conceptual framework.

Abstract: Speaker anonymization aims to conceal speaker-specific attributes in speech signals, making the anonymized speech unlinkable to the original speaker identity. Recent approaches achieve this by disentangling speech into content and speaker components, replacing the latter with pseudo speakers. The anonymized speech can be mapped either to a common pseudo speaker shared across utterances or to distinct pseudo speakers unique to each utterance. This paper investigates the impact of these mapping strategies on three key dimensions: speaker linkability, dispersion in the anonymized speaker space, and de-identification from the original identity. Our findings show that using distinct pseudo speakers increases speaker dispersion and reduces linkability compared to common pseudo-speaker mapping, thereby enhancing privacy preservation. These observations are interpreted through the proposed pinhole effect, a conceptual framework introduced to explain the relationship between mapping strategies and anonymization performance. The hypothesis is validated through empirical evaluation.

[618] DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation

Yakun Song, Xiaobin Zhuang, Jiawei Chen, Zhikang Niu, Guanrou Yang, Chenpeng Du, Dongya Jia, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Main category: eess.AS

TL;DR: DISTAR is a zero-shot text-to-speech framework that combines autoregressive language modeling with masked diffusion in discrete RVQ code space, enabling robust, controllable, and high-quality speech synthesis without forced alignment or duration prediction.

Details

Motivation: Previous approaches combining autoregressive sketchers with diffusion refiners over continuous speech representations are brittle under distribution shift and offer limited controllability.

Method: DISTAR operates in discrete residual vector quantization (RVQ) code space, coupling an AR language model with a masked diffusion model. It drafts block-level RVQ tokens with AR, then performs parallel masked-diffusion infilling to complete the next block, enabling long-form synthesis with blockwise parallelism.

Result: DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency while maintaining rich output diversity. It supports explicit control via classifier-free guidance, variable bit-rate, and controllable computation through RVQ layer pruning.

Conclusion: DISTAR demonstrates that tight coupling of AR and diffusion models in discrete code space enables robust, controllable, and high-quality zero-shot text-to-speech synthesis with practical advantages for real-world applications.

Abstract: Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on https://anonymous.4open.science/w/DiSTAR_demo.

eess.IV

[619] An Overview of the JPEG AI Learning-Based Image Coding Standard

Semih Esenlik, Yaojun Wu, Zhaobin Zhang, Ye-Kui Wang, Kai Zhang, Li Zhang, João Ascenso, Shan Liu

Main category: eess.IV

TL;DR: JPEG AI is a new learning-based image coding standard by JPEG that offers superior compression efficiency for both human visualization and machine tasks, with significant BD-rate improvements over existing standards.

Details

Motivation: To create a practical learning-based image coding standard that provides compact compressed domain representation for both human visualization and machine consumption, addressing the limitations of existing standards.

Method: Develops a single-stream, compact compressed domain representation using learning-based approaches, incorporating various design features for broad interoperability across devices and applications.

Result: Demonstrates significant BD-rate reductions compared to existing standards across multiple quality metrics (MS-SSIM, FSIM, VIF, VMAF, PSNR-HVS, IW-SSIM, NLPD), with the first version focusing on human vision tasks.

Conclusion: JPEG AI represents an emerging standard that successfully combines learning-based approaches with practical deployment considerations, offering improved compression efficiency while maintaining broad interoperability for diverse applications.

Abstract: JPEG AI is an emerging learning-based image coding standard developed by Joint Photographic Experts Group (JPEG). The scope of the JPEG AI is the creation of a practical learning-based image coding standard offering a single-stream, compact compressed domain representation, targeting both human visualization and machine consumption. Scheduled for completion in early 2025, the first version of JPEG AI focuses on human vision tasks, demonstrating significant BD-rate reductions compared to existing standards, in terms of MS-SSIM, FSIM, VIF, VMAF, PSNR-HVS, IW-SSIM and NLPD quality metrics. Designed to ensure broad interoperability, JPEG AI incorporates various design features to support deployment across diverse devices and applications. This paper provides an overview of the technical features and characteristics of the JPEG AI standard.

[620] Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion

Xiaojian Ding, Lin Zhao, Xian Li, Xiaoying Zhu

Main category: eess.IV

TL;DR: HSACC is a novel incomplete multi-view clustering framework that uses hierarchical semantic alignment and cooperative completion to address missing views and achieve robust cross-view fusion.

Details

Motivation: Existing deep incomplete multi-view clustering methods suffer from static fusion strategies and two-stage pipelines, leading to suboptimal fusion and error propagation issues.

Method: HSACC employs dual-level semantic spaces: low-level space ensures consistency via mutual information maximization, high-level space uses adaptive view weights based on distributional affinity for weighted fusion. It implicitly recovers missing views and jointly optimizes reconstruction and clustering.

Result: HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies confirm the effectiveness of hierarchical alignment and dynamic weighting, with parameter analysis showing robustness to hyperparameter variations.

Conclusion: The proposed HSACC framework effectively addresses incomplete multi-view clustering challenges through hierarchical semantic alignment and cooperative completion, achieving superior performance and robustness.

Abstract: Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model’s robustness to hyperparameter variations.

[621] Millimeter Wave Inverse Pinhole Imaging

Akarsh Prabhakara, Yawen Liu, Aswin C. Sankaranarayanan, Anthony Rowe, Swarun Kumar

Main category: eess.IV

TL;DR: Umbra is a mmWave high resolution imaging system that uses rotating mmWave “inverse pinholes” to enhance angular resolution for static compact radars, achieving 2.5° resolution with a single antenna.

Details

Motivation: Static mount compact mmWave radars have limited angular resolution due to their small form factor, which hinders their use in applications like hovering drones that require high-resolution imaging.

Method: The system introduces rotating mmWave “inverse pinholes” - lightweight structures that enable low-power rotation to upgrade static-mount radars. It also leverages propellers in aerial vehicles as natural inverse pinholes.

Result: Umbra achieves 2.5° angular resolution with just a single antenna, representing a 5× improvement over the 14° resolution from compact mmWave radar baselines.

Conclusion: The inverse pinhole concept enables high-resolution mmWave imaging for static compact radars, with applications in aerial vehicles where propellers can naturally serve as inverse pinholes during hovering.

Abstract: Millimeter wave (mmWave) radars are popular for perception in vision-denied contexts due to their compact size. This paper explores emerging use-cases that involve static mount or momentarily-static compact radars, for example, a hovering drone. The key challenge with static compact radars is that their limited form-factor also limits their angular resolution. This paper presents Umbra, a mmWave high resolution imaging system, that introduces the concept of rotating mmWave “inverse pinholes” for angular resolution enhancement. We present the imaging system model, design, and evaluation of mmWave inverse pinholes. The inverse pinhole is attractive for its lightweight nature, which enables low-power rotation, upgrading static-mount radars. We also show how propellers in aerial vehicles act as natural inverse pinholes and can enjoy the benefits of high-resolution imaging even while they are momentarily static, e.g., hovering. Our evaluation shows Umbra resolving up to 2.5$^{\circ}$ with just a single antenna, a 5$\times$ improvement compared to 14$^{\circ}$ from a compact mmWave radar baseline.

[622] Image-based Facial Rig Inversion

Tianxiang Yang, Marco Volino, Armin Mustafa, Greg Maguire, Robert Kosk

Main category: eess.IV

TL;DR: An image-based rig inversion framework using RGB appearance and normal maps with Hiera transformers to regress 102 FACS-based rig parameters.

Details

Motivation: To create a robust facial rig inversion system that can generalize from synthetic to scanned data for faithful facial reconstructions.

Method: Uses two independent Hiera transformer backbones to process RGB appearance and RGB-encoded normal maps, then fuses features to regress 102 FACS-based rig parameters.

Result: The method successfully generalizes to scanned data and produces faithful facial reconstructions.

Conclusion: The proposed multi-modal approach with Hiera transformers effectively enables rig inversion that transfers well from synthetic to real scanned facial data.

Abstract: We present an image-based rig inversion framework that leverages two modalities: RGB appearance and RGB-encoded normal maps. Each modality is processed by an independent Hiera transformer backbone, and the extracted features are fused to regress 102 rig parameters derived from the Facial Action Coding System (FACS). Experiments on synthetic and scanned datasets demonstrate that the method generalizes to scanned data, producing faithful reconstructions.

[623] Reinforcement Learning for Unsupervised Domain Adaptation in Spatio-Temporal Echocardiography Segmentation

Arnaud Judge, Nicolas Duchateau, Thierry Judge, Roman A. Sandler, Joseph Z. Sokol, Christian Desrosiers, Olivier Bernard, Pierre-Marc Jodoin

Main category: eess.IV

TL;DR: RL4Seg3D is an unsupervised domain adaptation framework for 2D+time echocardiography segmentation that uses reinforcement learning to improve accuracy, anatomical validity, and temporal consistency while providing uncertainty estimation.

Details

Motivation: Domain adaptation methods often struggle with reliability in target domains, especially in medical image segmentation where accuracy and anatomical validity are critical. This is exacerbated in spatio-temporal data like echocardiography where temporal consistency is essential and artifacts/noise degrade performance.

Method: RL4Seg3D integrates novel reward functions and a fusion scheme to enhance landmark precision while processing full-sized echocardiography videos. It leverages reinforcement learning for segmentation to improve multiple aspects simultaneously.

Result: The framework was tested on over 30,000 echocardiographic videos and outperformed standard domain adaptation techniques without requiring any labels on the target domain.

Conclusion: RL4Seg3D provides an effective unsupervised domain adaptation solution for echocardiography segmentation that enhances accuracy, anatomical validity, temporal consistency, and includes beneficial uncertainty estimation for improved performance.

Abstract: Domain adaptation methods aim to bridge the gap between datasets by enabling knowledge transfer across domains, reducing the need for additional expert annotations. However, many approaches struggle with reliability in the target domain, an issue particularly critical in medical image segmentation, where accuracy and anatomical validity are essential. This challenge is further exacerbated in spatio-temporal data, where the lack of temporal consistency can significantly degrade segmentation quality, and particularly in echocardiography, where the presence of artifacts and noise can further hinder segmentation performance. To address these issues, we present RL4Seg3D, an unsupervised domain adaptation framework for 2D + time echocardiography segmentation. RL4Seg3D integrates novel reward functions and a fusion scheme to enhance key landmark precision in its segmentations while processing full-sized input videos. By leveraging reinforcement learning for image segmentation, our approach improves accuracy, anatomical validity, and temporal consistency while also providing, as a beneficial side effect, a robust uncertainty estimator, which can be used at test time to further enhance segmentation performance. We demonstrate the effectiveness of our framework on over 30,000 echocardiographic videos, showing that it outperforms standard domain adaptation techniques without the need for any labels on the target domain. Code is available at https://github.com/arnaudjudge/RL4Seg3D.

[624] A Density-Informed Multimodal Artificial Intelligence Framework for Improving Breast Cancer Detection Across All Breast Densities

Siva Teja Kakileti, Bharath Govindaraju, Sudhakar Sampangi, Geetha Manjunath

Main category: eess.IV

TL;DR: A density-informed multi-modal AI framework combining mammography and thermal imaging improves breast cancer detection by dynamically selecting the optimal modality based on breast tissue density, overcoming limitations of single-modality screening.

Details

Motivation: Mammography has reduced sensitivity in women with dense breast tissue, leading to missed or delayed diagnoses. Thermal imaging captures functional vascular and metabolic cues that may complement mammographic structural data.

Method: 324 women underwent both mammography and thermal imaging. A multi-modal AI framework dynamically selects mammography AI for fatty breasts and Thermalytix AI for dense breasts, using deep learning for mammography and thermal radiomics for thermal imaging.

Result: The multi-modal AI achieved 94.55% sensitivity and 79.93% specificity, outperforming standalone mammography AI (81.82% sensitivity, 86.25% specificity) and Thermalytix AI (92.73% sensitivity, 75.46% specificity). Mammography sensitivity dropped significantly in dense breasts (67.86%) vs fatty breasts (96.30%), while Thermalytix maintained consistent sensitivity (92.59% dense, 92.86% fatty).

Conclusion: The density-informed multi-modal AI framework overcomes key limitations of unimodal screening, delivering high performance across diverse breast compositions. It is interpretable, low-cost, and easily deployable for improving breast cancer screening outcomes in various settings.

Abstract: Mammography, the current standard for breast cancer screening, has reduced sensitivity in women with dense breast tissue, contributing to missed or delayed diagnoses. Thermalytix, an AI-based thermal imaging modality, captures functional vascular and metabolic cues that may complement mammographic structural data. This study investigates whether a breast density-informed multi-modal AI framework can improve cancer detection by dynamically selecting the appropriate imaging modality based on breast tissue composition. A total of 324 women underwent both mammography and thermal imaging. Mammography images were analyzed using a multi-view deep learning model, while Thermalytix assessed thermal images through vascular and thermal radiomics. The proposed framework utilized Mammography AI for fatty breasts and Thermalytix AI for dense breasts, optimizing predictions based on tissue type. This multi-modal AI framework achieved a sensitivity of 94.55% (95% CI: 88.54-100) and specificity of 79.93% (95% CI: 75.14-84.71), outperforming standalone mammography AI (sensitivity 81.82%, specificity 86.25%) and Thermalytix AI (sensitivity 92.73%, specificity 75.46%). Importantly, the sensitivity of Mammography dropped significantly in dense breasts (67.86%) versus fatty breasts (96.30%), whereas Thermalytix AI maintained high and consistent sensitivity in both (92.59% and 92.86%, respectively). This demonstrates that a density-informed multi-modal AI framework can overcome key limitations of unimodal screening and deliver high performance across diverse breast compositions. The proposed framework is interpretable, low-cost, and easily deployable, offering a practical path to improving breast cancer screening outcomes in both high-resource and resource-limited settings.

[625] EdgeNavMamba: Mamba Optimized Object Detection for Energy Efficient Edge Devices

Romina Aalishah, Mozhgan Navardi, Tinoosh Mohsenin

Main category: eess.IV

TL;DR: EdgeNavMamba is an efficient RL-based navigation framework using a compressed Mamba object detector that reduces model size by 67% and energy consumption by 73% on edge devices while maintaining performance comparable to baseline models.

Details

Motivation: Address the challenge of deploying accurate deep learning models for autonomous navigation on resource-constrained edge devices with limited computing power and memory.

Method: Proposes EdgeNavMamba framework with Mamba object detection model as preprocessing module to extract bounding boxes from visual input, which are then used by RL policy for goal-directed navigation. Uses custom shape detection dataset collected in diverse indoor settings.

Result: Student model achieved 67% size reduction and up to 73% energy reduction per inference on NVIDIA Jetson Orin Nano and Raspberry Pi 5 while maintaining same performance as teacher model. Reduced parameters by 31% compared to baseline while maintaining high detection accuracy in simulators. Navigation policy achieved over 90% success rate across varying complexity environments.

Conclusion: EdgeNavMamba successfully demonstrates efficient and accurate autonomous navigation on edge devices through model compression and RL-based policy, enabling real-time applications with significant resource savings.

Abstract: Deployment of efficient and accurate Deep Learning models has long been a challenge in autonomous navigation, particularly for real-time applications on resource-constrained edge devices. Edge devices are limited in computing power and memory, making model efficiency and compression essential. In this work, we propose EdgeNavMamba, a reinforcement learning-based framework for goal-directed navigation using an efficient Mamba object detection model. To train and evaluate the detector, we introduce a custom shape detection dataset collected in diverse indoor settings, reflecting visual cues common in real-world navigation. The object detector serves as a pre-processing module, extracting bounding boxes (BBOX) from visual input, which are then passed to an RL policy to control goal-oriented navigation. Experimental results show that the student model achieved a reduction of 67% in size, and up to 73% in energy per inference on edge devices of NVIDIA Jetson Orin Nano and Raspberry Pi 5, while keeping the same performance as the teacher model. EdgeNavMamba also maintains high detection accuracy in MiniWorld and IsaacLab simulators while reducing parameters by 31% compared to the baseline. In the MiniWorld simulator, the navigation policy achieves over 90% success across environments of varying complexity.

[626] TABSurfer: a Hybrid Deep Learning Architecture for Subcortical Segmentation

Aaron Cao, Vishwanatha M. Rao, Kejia Liu, Xinrui Liu, Andrew F. Laine, Jia Guo

Main category: eess.IV

TL;DR: TABSurfer is a 3D patch-based CNN-Transformer hybrid model that provides fast and accurate subcortical segmentation, outperforming both FreeSurfer and FastSurferVINN.

Details

Motivation: Manual subcortical segmentation is labor-intensive, while automated tools like FreeSurfer are slow and inefficient for large datasets, creating a need for faster and more accurate alternatives.

Method: Developed TABSurfer, a 3D patch-based CNN-Transformer hybrid deep learning model specifically designed for subcortical segmentation.

Result: TABSurfer showed consistent performance across various T1w MRI datasets with significantly shorter processing times than FreeSurfer, and outperformed both FreeSurfer and FastSurferVINN when validated against manual segmentations.

Conclusion: TABSurfer is a powerful tool for fully automated subcortical segmentation with high fidelity and efficiency.

Abstract: Subcortical segmentation remains challenging despite its important applications in quantitative structural analysis of brain MRI scans. The most accurate method, manual segmentation, is highly labor intensive, so automated tools like FreeSurfer have been adopted to handle this task. However, these traditional pipelines are slow and inefficient for processing large datasets. In this study, we propose TABSurfer, a novel 3D patch-based CNN-Transformer hybrid deep learning model designed for superior subcortical segmentation compared to existing state-of-the-art tools. To evaluate, we first demonstrate TABSurfer’s consistent performance across various T1w MRI datasets with significantly shorter processing times compared to FreeSurfer. Then, we validate against manual segmentations, where TABSurfer outperforms FreeSurfer based on the manual ground truth. In each test, we also establish TABSurfer’s advantage over a leading deep learning benchmark, FastSurferVINN. Together, these studies highlight TABSurfer’s utility as a powerful tool for fully automated subcortical segmentation with high fidelity.

[627] Pruning Sparse Tensor Neural Networks Enables Deep Learning for 3D Ultrasound Localization Microscopy

Brice Rauby, Paul Xing, Jonathan Porée, Maxime Gasse, Jean Provost

Main category: eess.IV

TL;DR: Sparse tensor neural networks enable 3D ultrasound localization microscopy by reducing memory requirements, allowing higher microbubble concentrations and shorter acquisition times.

Details

Motivation: Current deep learning approaches for ULM are limited to 2D due to high memory requirements, preventing extension to 3D imaging which would provide more comprehensive micro-vessel mapping.

Method: Proposed using sparse tensor neural networks to convert ultrasound data into sparse format, studying different conversion approaches and their impact on information loss.

Result: In 2D, sparse networks reduced memory by 2x with minor performance loss. In 3D, memory reduced by two orders of magnitude while outperforming conventional ULM in high concentration settings.

Conclusion: Sparse tensor neural networks enable 3D ULM with the same benefits as 2D dense methods - higher microbubble concentrations and reduced acquisition times.

Abstract: Ultrasound Localization Microscopy (ULM) is a non-invasive technique that allows for the imaging of micro-vessels in vivo, at depth and with a resolution on the order of ten microns. ULM is based on the sub-resolution localization of individual microbubbles injected in the bloodstream. Mapping the whole angioarchitecture requires the accumulation of microbubbles trajectories from thousands of frames, typically acquired over a few minutes. ULM acquisition times can be reduced by increasing the microbubble concentration, but requires more advanced algorithms to detect them individually. Several deep learning approaches have been proposed for this task, but they remain limited to 2D imaging, in part due to the associated large memory requirements. Herein, we propose to use sparse tensor neural networks to reduce memory usage in 2D and to improve the scaling of the memory requirement for the extension of deep learning architecture to 3D. We study several approaches to efficiently convert ultrasound data into a sparse format and study the impact of the associated loss of information. When applied in 2D, the sparse formulation reduces the memory requirements by a factor 2 at the cost of a small reduction of performance when compared against dense networks. In 3D, the proposed approach reduces memory requirements by two order of magnitude while largely outperforming conventional ULM in high concentration settings. We show that Sparse Tensor Neural Networks in 3D ULM allow for the same benefits as dense deep learning based method in 2D ULM i.e. the use of higher concentration in silico and reduced acquisition time.

[628] VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis

Andrew Hoopes, Neel Dey, Victor Ion Butoi, John V. Guttag, Adrian V. Dalca

Main category: eess.IV

TL;DR: VoxelPrompt is an end-to-end image analysis agent that uses natural language prompts to generate executable code for analyzing volumetric medical images, automating complex radiological workflows.

Details

Motivation: Current radiological analysis requires practitioners to manually combine multiple specialized tools, which is time-consuming and painstaking. VoxelPrompt aims to automate these complex workflows through natural language interaction.

Method: VoxelPrompt integrates a language model that generates executable code to invoke a jointly-trained, adaptable vision network. The system takes volumetric medical images and natural language prompts, then creates analytical pipelines to address quantitative medical objectives.

Result: VoxelPrompt can delineate hundreds of anatomical and pathological features, measure complex morphological properties, and perform open-language analysis of lesion characteristics. It achieves accuracy similar to specialist single-task models while enabling broad compositional biomedical workflows.

Conclusion: VoxelPrompt successfully automates complex radiological analyses that traditionally require manual tool combination, providing similar accuracy to specialized models while offering greater flexibility through natural language interaction.

Abstract: We present VoxelPrompt, an end-to-end image analysis agent that tackles free-form radiological tasks. Given any number of volumetric medical images and a natural language prompt, VoxelPrompt integrates a language model that generates executable code to invoke a jointly-trained, adaptable vision network. This code further carries out analytical steps to address practical quantitative aims, such as measuring the growth of a tumor across visits. The pipelines generated by VoxelPrompt automate analyses that currently require practitioners to painstakingly combine multiple specialized vision and statistical tools. We evaluate VoxelPrompt using diverse neuroimaging tasks and show that it can delineate hundreds of anatomical and pathological features, measure complex morphological properties, and perform open-language analysis of lesion characteristics. VoxelPrompt performs these objectives with an accuracy similar to that of specialist single-task models for image analysis, while facilitating a broad range of compositional biomedical workflows.

[629] A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation

Renjie Liang, Zhengkang Fan, Jinqian Pan, Chenkun Sun, Bruce Daniel Steinberg, Russell Terry, Jie Xu

Main category: eess.IV

TL;DR: A two-stage AI framework for automatic renal CT report generation that combines structured feature detection with conditioned report generation to improve clinical accuracy and reduce radiologists’ workload.

Details

Motivation: The growing CT workload increases radiologists' burden and risks incomplete documentation. Automating report generation is challenging as it requires integrating visual interpretation with clinical reasoning.

Method: A two-stage framework: Stage 1 uses multi-task learning to detect structured clinical features from 2D CT images. Stage 2 uses a vision-language model to generate free-text reports conditioned on both the image and detected features.

Result: The model achieved an average AUC of 0.75 for key imaging features and a METEOR score of 0.33, showing improved report quality, clinical consistency, and fewer template-driven errors compared to baseline approaches.

Conclusion: Linking structured feature detection with conditioned report generation provides a clinically grounded approach that enhances interpretability and clinical faithfulness, highlighting the importance of domain-relevant evaluation metrics for medical AI.

Abstract: Objective Renal cancer is a common malignancy and a major cause of cancer-related deaths. Computed tomography (CT) is central to early detection, staging, and treatment planning. However, the growing CT workload increases radiologists’ burden and risks incomplete documentation. Automatically generating accurate reports remains challenging because it requires integrating visual interpretation with clinical reasoning. Advances in artificial intelligence (AI), especially large language and vision-language models, offer potential to reduce workload and enhance diagnostic quality. Methods We propose a clinically informed, two-stage framework for automatic renal CT report generation. In Stage 1, a multi-task learning model detects structured clinical features from each 2D image. In Stage 2, a vision-language model generates free-text reports conditioned on the image and the detected features. To evaluate clinical fidelity, generated clinical features are extracted from the reports and compared with expert-annotated ground truth. Results Experiments on an expert-labeled dataset show that incorporating detected features improves both report quality and clinical accuracy. The model achieved an average AUC of 0.75 for key imaging features and a METEOR score of 0.33, demonstrating higher clinical consistency and fewer template-driven errors. Conclusion Linking structured feature detection with conditioned report generation provides a clinically grounded approach to integrate structured prediction and narrative drafting for renal CT reporting. This method enhances interpretability and clinical faithfulness, underscoring the value of domain-relevant evaluation metrics for medical AI development.

[630] HANS-Net: Hyperbolic Convolution and Adaptive Temporal Attention for Accurate and Generalizable Liver and Tumor Segmentation in CT Imaging

Arefin Ittesafun Abian, Ripon Kumar Debnath, Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Asif Karim, Reem E. Mohamed, Sami Azam

Main category: eess.IV

TL;DR: HANS-Net is a novel segmentation framework that combines hyperbolic convolutions, wavelet decomposition, synaptic plasticity, and neural representations for accurate liver and tumor segmentation on CT images, achieving state-of-the-art performance on LiTS and AMOS datasets.

Details

Motivation: Accurate liver and tumor segmentation is critical for diagnosis and treatment planning but remains challenging due to complex anatomical structures, tumor appearance variability, and limited annotated data.

Method: Combines hyperbolic convolutions for hierarchical geometry, wavelet decomposition for multi-scale textures, synaptic plasticity for adaptive features, implicit neural representations for continuous boundaries, uncertainty-aware Monte Carlo dropout, and lightweight temporal attention for inter-slice consistency.

Result: Achieved 93.26% Dice score, 88.09% IoU, 0.72mm ASSD, and 11.91% VOE on LiTS dataset; 85.09% Dice, 76.66% IoU, 19.49mm ASSD, and 23.34% VOE on AMOS 2022 dataset.

Conclusion: HANS-Net provides anatomically consistent, accurate, and confident liver and tumor segmentation with strong generalization across different datasets.

Abstract: Accurate liver and tumor segmentation on abdominal CT images is critical for reliable diagnosis and treatment planning, but remains challenging due to complex anatomical structures, variability in tumor appearance, and limited annotated data. To address these issues, we introduce Hyperbolic-convolutions Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity Network (HANS-Net), a novel segmentation framework that synergistically combines hyperbolic convolutions for hierarchical geometric representation, a wavelet-inspired decomposition module for multi-scale texture learning, a biologically motivated synaptic plasticity mechanism for adaptive feature enhancement, and an implicit neural representation branch to model fine-grained and continuous anatomical boundaries. Additionally, we incorporate uncertainty-aware Monte Carlo dropout to quantify prediction confidence and lightweight temporal attention to improve inter-slice consistency without sacrificing efficiency. Extensive evaluations of the LiTS dataset demonstrate that HANS-Net achieves a mean Dice score of 93.26%, an IoU of 88.09%, an average symmetric surface distance (ASSD) of 0.72 mm, and a volume overlap error (VOE) of 11.91%. Furthermore, cross-dataset validation on the AMOS 2022 dataset obtains an average Dice of 85.09%, IoU of 76.66%, ASSD of 19.49 mm, and VOE of 23.34%, indicating strong generalization across different datasets. These results confirm the effectiveness and robustness of HANS-Net in providing anatomically consistent, accurate, and confident liver and tumor segmentation.

[631] Bit Allocation Transfer for Perceptual Quality Enhancement of VVC Intra Coding

Runyu Yang, Ivan V. Bajić

Main category: eess.IV

TL;DR: A low-complexity method to enhance perceptual quality in VVC intra coding by transferring bit allocation knowledge from end-to-end image compression, achieving over 11% BD-rate reduction in MS-SSIM.

Details

Motivation: Traditional block-based codecs like H.266/VVC optimize well for PSNR but struggle with perceptual metrics like MS-SSIM, creating a need for perceptual quality enhancement.

Method: Use a lightweight model trained with perceptual losses to generate quantization step maps that capture block-level perceptual importance, then derive QP maps for VVC coding.

Result: Significant advantages in execution time and perceptual metrics, with >11% BD-rate reduction in MS-SSIM on Kodak and CLIC datasets.

Conclusion: Provides an efficient, practical pathway for perceptual enhancement of traditional codecs without major architectural changes.

Abstract: Mainstream image and video coding standards – including state-of-the-art codecs like H.266/VVC, AVS3, and AV1 – adopt a block-based hybrid coding framework. While this framework facilitates straightforward optimization for Peak Signal-to-Noise Ratio (PSNR), it struggles to effectively optimize perceptually-aligned metrics such as Multi-Scale Structural Similarity (MS-SSIM). To address this challenge, this paper proposes a low-complexity method to enhance perceptual quality in VVC intra coding by transferring bit allocation knowledge from end-to-end image compression. We introduce a lightweight model trained with perceptual losses to generate a quantization step map. This map implicitly captures block-level perceptual importance, enabling efficient derivation of a QP map for VVC. Experiments on Kodak and CLIC datasets demonstrate significant advantages, both in execution time and perceptual metric performance, with more than 11% BD-rate reduction in terms of MS-SSIM. Our scheme provides an efficient, practical pathway for perceptual enhancement of traditional codecs.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL

[2] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening

[3] A Linguistics-Aware LLM Watermarking via Syntactic Predictability

[4] A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer’s Disease

[5] Users as Annotators: LLM Preference Learning from Comparison Mode

[6] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

[7] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

[8] ConDABench: Interactive Evaluation of Language Models for Data Analysis

[9] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

[10] Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection

[11] Meronymic Ontology Extraction via Large Language Models

[12] NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

[13] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

[14] Serialized EHR make for good text representations

[15] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

[16] On-device System of Compositional Multi-tasking in Large Language Models

[17] Language steering in latent space to mitigate unintended code-switching

[18] Revisiting the UID Hypothesis in LLM Reasoning Traces

[19] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

[20] ConsistencyAI: A Benchmark to Assess LLMs’ Factual Consistency When Responding to Different Demographic Groups

[21] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

[22] R2T: Rule-Encoded Loss Functions for Low-Resource Sequence Tagging

[23] Harnessing Consistency for Robust Test-Time LLM Ensemble

[24] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

[25] ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing

[26] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

[27] Unlocking the Potential of Diffusion Language Models through Template Infilling

[28] Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

[29] FRACCO: A gold-standard annotated corpus of oncological entities with ICD-O-3.1 normalisation

[30] What Layers When: Learning to Skip Compute in LLMs with Residual Gates

[31] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

[32] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

[33] PAGE: Prompt Augmentation for text Generation Enhancement

[34] Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

[35] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

[36] Reliable Fine-Grained Evaluation of Natural Language Math Proofs

[37] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

[38] The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data

[39] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

[40] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges

[41] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

[42] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

[43] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

[44] Schema for In-Context Learning

[45] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

[46] Interpreting the Latent Structure of Operator Precedence in Language Models

[47] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

[48] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

[49] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

[50] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

[51] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

[52] Element2Vec: Build Chemical Element Representation from Text for Property Prediction

[53] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

[54] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

[55] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

[56] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs

[57] LLMs Can Get “Brain Rot”!

[58] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions

[59] Big Reasoning with Small Models: Instruction Retrieval at Inference Time

[60] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

[61] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

[62] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

[63] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

[64] The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

[65] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

[66] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

[67] Quantifying Phonosemantic Iconicity Distributionally in 6 Languages

[68] ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

[69] DROID: Dual Representation for Out-of-Scope Intent Detection

[70] Toward Cybersecurity-Expert Small Language Models

[71] Building a Macedonian Recipe Dataset: Collection, Parsing, and Comparative Analysis

[72] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

[73] DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

[74] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

[75] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

[76] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

[77] Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation